7The developed audiovisual tracker works on the 2D image plane and is better per-
forming over single target scenario; the project organization in terms of data process-
ing is highlighted in the following flowchart in Fig. 1.1.
Figure 1.1: Project organization in terms of data processing
This thesis is organized in three main parts: state of the art, proposed solution
and results, furthermore in the appendix is presented the audiovisual data acquisi-
tion system developed and used in this project.
State of the Art chapter is divided into three sections which present the existing
technology respectively of audio analysis, video analysis, and data fusion techniques
oriented to tracking.
• Audio Analysis section discusses a robust cross-correlation estimation method
working on real audio dataset; this method is used to get information about
audio sources location.
• Video Analysis section presents a background subtraction method and two
other image processing steps: image soothing and morphological operations
respectively. They are used to segment out objects of interest from a video
sequence.
• Tracker and Data Fusion Algorithm section presents particle filter framework
as technique to perform tracking and data fusion.
Queen Mary University of London – Politecnico di Milano
Chapter 1 : Introduction 8
The Proposed Solution chapter is also divided in three main sections and presents
specific implementations used to perform and overcome problems related to this
project.
• Audio Analysis section presents an innovative method to compute the cross-
correlation function effective to reverberation and correlated noise; and a spe-
cific algorithm to extract interesting audio data for data fusion algorithm.
• Video Analysis section presents a specific algorithm based on background sub-
traction, to extract interesting video data for data fusion algorithm.
• Tracker and Data Fusion Algorithm section presents innovative specific particle
filter application to track objects using audio video data together.
Results chapter presents the analysis of the obtained results, where the performances
are evaluated processing different real dataset using the tracker developed in this
project.
Queen Mary University of London – Politecnico di Milano
1.1 Contribution of the Thesis 9
1.1 Contribution of the Thesis
The purpose of this thesis is to study and to develop a complete audiovisual
tracker system, where both audiovisual data recording and audiovisual data analysis
are taken into consideration and implemented.
The work is relevant to the audiovisual data recording system, using a professional
surveillance equipment presented in appendix A and the implementation of a C++
software to process the recorded data carrying out the audiovisual tracker.
1.1.1 Audiovisual Data Recording System
This work section consists of a hardware sets up and a software implementation
to manage the data recording. For what concern the hardware side, the scope of this
work is oriented to set up a multi sensors recording system capable to manage up to
three different sensors in a synchronized mode; for details see appendix A. Moreover,
on the software side, is implemented an application in Visual basic capable to read
the audiovisual data received from the recording equipment and to store all this
data in a remote PC.
The proposed multi sensors recording system, demonstrate to be flexible, easy to
use, and suitable for different applications.
1.1.2 Audiovisual Data Analysis
Once the audiovisual data are stored, shall be analyzed and the significant infor-
mation to be extracted. The developed tracker, taking advantage from the already
done projects, is implemented following the latest audiovisual tracking methods also
used in other applications, such as [18][1][8][5]. Testing the tracker performance on
our audiovisual data, it allows to face a large number of real problems that were
solved by introducing some innovative ideas related to data analysis.
• Tracker algorithm
It was decided to use a tracking algorithm based on particle filter, because
it appear to be the most suitable solution for this purpose. This decision
Queen Mary University of London – Politecnico di Milano
Chapter 1 : Introduction 10
is based on the results coming from the already done projects [18][1][8][5],
in which are presented and analyzed the particle filter method advantages
respect to the other classic tracker methods. Here the main particle filter
features are resumed: related to the computational cost, it depends on the
number of the particles used, and increases proportionally with the number of
the targets under analysis. Furthermore, particle filter can be used in a wide
range of application because suitable to work with non-Gaussian, non-linear,
and multi modal variables.
The contribution of this part is the development of a tracker algorithm where
through the estimation of different distributions, the audio and video data are
fused, as presented in section 3.3.
• Video Analysis
The video component analysis is based on background subtraction method;
method well known and used in different computer vision applications [14].
After the background subtraction step, the targets of interest are segmented
out and the significant data to be used in the particle filter algorithm are ex-
tracted.
The contribution of this part is the development of a multi threshold back-
ground subtraction, where the detection is improved by means of an empha-
sizing system based on frame by frame analysis. Moreover, video information
extraction, oriented to particle filter aim, is implemented as presented in sec-
tion 3.2.
• Audio Analysis
Audio analysis resulted as the most problematic section, because principal
cause of estimation error and false detection; in fact part of work done is
oriented to improve the audio detection. Starting from the already done re-
searches in this field, it was observed that the most diffused technique for DOA
detection, in reverberant environments, consists of generalize cross-correlation
estimation [18][13].
It was noted that the results obtained in processing the recorded data, using
Queen Mary University of London – Politecnico di Milano
1.1 Contribution of the Thesis 11
the simple generalize cross-correlation, were still not performing; so, some in-
novative ideas are introduced to overcome these problems.
Starting from the analysis of available methods done on reducing reverberation
component, proposed in [20] and [3]; subsequently, these two methods are im-
plemented and merged together to better perform the detection, as presented
in section 3.1.2. Furthermore, to reduced strong correlate noise, the estima-
tion of generalize cross-correlation function is computed in different frequency
bands, as presented in section 3.1.2.
At the end of the cross-correlation estimation, a reshaping of the estimated
function is done to introduce the audio information in the particle filter algo-
rithm, as described in section 3.1.3.
The contributions of this part are the development of a robust audio detection
method, based on cross-correlation estimation and the way to introduce the
audio detection in particle filter algorithm proposed in section 3.3.
Observing the obtained results presented in section 4, it is possible to conclude
that both audiovisual recording system and the audiovisual data processing are
working on real data giving positive results; moreover, these results are a good
starting point on the tracker potentiality and the possible future improvements.
Queen Mary University of London – Politecnico di Milano
Chapter 2
State Of The Art
2.1 Audio Analysis
2.1.1 Introduction
Using microphone arrays to locate sound source has been an active research topic
since the early 1950’s.
Having available high quality audio waveforms from a remote sound source, captured
by a microphone array, it is possible to localize the audio source and therewith doing
the tracking of it. Recent advances in technology have made digital signal processing
systems capable of real-time processing of multiple audio channels; fundamental re-
quirement to process and extract valid information from microphones array signals.
To locate sound source using microphone arrays has many important applications
including video conferencing, video surveillance, and speech recognition; indeed, it
is still under study to improve its performance.
The most common and widely used technique, for source localization using a couple
of microphones, is the direction of arrival (DOA) approach. This method is used to
estimate the direction of audio source from the microphones alignment.
DOA is a well-known method in the telecommunication field; it is used in a very
large number of applications such as: underwater sound source localization, radio
telescopes or wireless communication.
Various DOA algorithms have been developed to perform with different data typol-
12
2.1 Audio Analysis 13
ogy and to overcome different real world problems.
While researchers are making good progresses on various aspects of DOA, there is
still no good solution in environment where reverberation component, destructive
noise source or correlate noise are existing.
This chapter shows a common available DOA algorithm, already present in litera-
ture, and used to overcome the typical audio source localization problems using a
real audio dataset.
2.1.2 Direction of Arrival (DOA)
The purpose of this audio processing algorithm is to analyze the signals from
two microphones and to estimate the position of the sound source. Having avail-
able two microphones, it is only possible to estimate the direction of arrival of the
source’s waves. This direction of arrival, related to the microphones alignment, give
an estimation of the audio source location in terms of angles.
The fundamental principle behind direction of arrival estimation is to use the phase
information present in the audio signals picked up by spatially separated micro-
phones.
When the microphones are spatially separated, the acoustic signals arrive at them
with time differences, where the time difference, having a constant microphones
distance, depends only on the source angular location. In other words the time
difference, or delay time, is related to the DOA of the signals.
The main problem related with audio source position estimation working on real
dataset is due to reverberation component and correlated noise added on the signal.
2.1.3 Generalized Cross-Correlation
The general framework for DOA estimation can be represented as following [13]
in Fig. 2.1 An audio wave signal from a remote source, picked up in the presence of
noise at two spatially separated microphones, can be mathematically written as in
Queen Mary University of London – Politecnico di Milano
Chapter 2 : State Of The Art 14
Figure 2.1: Audio analysis schema; the audio waves picked up at two microphones
are processed to estimate the delay D.
Eq. 2.1 and Eq. 2.2:
x1(t) = s1(t) + n1(t) (2.1)
x2(t) = αs1(t+D) + n2(t) (2.2)
where: s1(t) is the wave signal, n1(t) and n2(t) are the noise components, and D
is the delay time between the two microphones signals x1(t) and x2(t). H1 and H2
represent the pre-filters step. Supposed s1(t), n1(t), n2(t) ∈ < and the noise com-
ponent are random processes; the signal s1(t) may be assumed uncorrelated with
noise n1(t) and n2(t).
The two signals x1(t) and x2(t), differ by a delay time shift and an attenuation com-
ponent α, represent the signal from the audio source converted by the microphones
in electric wave.
One of the standard methods of determining the delay D is to compute the cross-
correlation between the two input signals x1(t) and x2(t) where the cross-correlation
function is a measure of similarity of two signals over the time.
The cross-correlation function between two signals x1(t) and x2(t), in the time do-
main, can be written as Eq. 2.3
Rx1x2(τ) = E[x1(t)x2(t− τ)] (2.3)
Queen Mary University of London – Politecnico di Milano
2.1 Audio Analysis 15
where E denotes expectation. The argument τ that maximizes Eq. 2.3 provides an
estimation of delay time D.
Because of the finite observation time, only an estimation of Rx1x2(τ) can be com-
puted. For example, for ergodic processes, an estimation of the cross-correlation
function is given by Eq. 2.4
R̂x1x2(τ) =
1
T2 − T1
∫ T2
T1
x1(t)x2(t− τ)dt (2.4)
where T2 − T1 represents the observation time interval. To define the time interval
T2 − T1 it is set to this purpose equal to the video frame period.
Cross-correlation function is not enough to obtain a valid delay estimation using
real audio dataset [13]; in order to improve the accuracy of the delay estimation D̂,
it is desirable to pre-filter x1(t) and x2(t) prior to the integration in Eq. 2.4.
As shown in Fig. 2.1, x1 and x2 may be filtered through H1 and H2 respectively.
The resultants of the filtering are the signals y1 and y2, they are cross-correlated
and the peak of the estimated cross-correlation function is used to estimate D̂.
The pre-filtering step is performed to reduce noise from the signals; it is apply as
frequency-dependent weighing on the input signals using the two filter H1 and H2.
This pre-filtering step represents the generalization of the cross-correlation (GCC).
The time shift causing the peak in the cross-correlation function, is an estimation
of the true delay D. When the filters H1(f) = H2(f) = 1 ∀ f , the estimated D̂ is
simply the abscissa value at which the cross-correlation function peaks.
To mathematically model the GCC function, let’s start to analyze the cross-correlation
function between x1(t) and x2(t); it is related to its cross power spectral density
function Gx1x2(f) by the Fourier transform relationship given by Eq. 2.5:
Rx1x2(τ) =
∫ ∞
−∞
Gx1x2(f)ej2piτdf (2.5)
When x1(t) and x2(t) have been filtered as shows in Fig. 2.1, then the cross power
spectrum between the filter outputs is given by Eq. 2.6:
Gy1y2(f) = H1(f)H∗2 (f)Gx1x2(f) (2.6)
where ∗ denotes the complex conjugate.
Therefore, the generalized cross-correlation function, between the two signals coming
Queen Mary University of London – Politecnico di Milano
Chapter 2 : State Of The Art 16
from the filtering blocks, y1(t) and y2(t), can be write as Eq. 2.7:
R(g)y1y2(τ) =
∫ ∞
−∞
ψg(f)Gx1x2(f)ej2pifτdf (2.7)
where
ψg(f) = H1(f)H∗2 (f) (2.8)
and ψg(f) denotes the general frequency weighing.
The general frequency weighing is a function in the frequency domain that reshapes
the cross-correlation power spectrum trying to reduce the noise effects.
In practice, only an estimated Ĝx1x2(f) of Gx1x2(f) can be obtained from finite
observations of x1(t) and x2(t). Consequently, the integral Eq. 2.9 is evaluated and
used for estimating delay D̂.
R̂(g)y1y2(τ) =
∫ ∞
−∞
ψg(f)Ĝx1x2(f)ej2pifτdf (2.9)
It is interesting to examine the effect of preprocessing weighing on the shape of
Ry1y2(τ) under ideal conditions. For models of the form presented in Eq. 2.1 and
Eq. 2.2, the cross-correlation function of x1(t) and x2(t) is given by Eq. 2.10
Rx1x2(τ) = αRs1s2(τ −D) +Rn1n2(τ) (2.10)
Computing the Fourier Transform of Eq. 2.10 the cross power spectrum is given in
Eq. 2.11:
Gx1x2(f) = αGs1s2(f)e2pifD +Gn1n2(f) (2.11)
If n1(t) and n2(t) are uncorrelated Gn1n2(f) = 0, the cross power spectrum between
x1(t) and x2(t) is a scaled signal power spectrum times a complex exponential. Since
multiplication in time domain is a convolution in frequency domain, it follows for
Gn1n2(f) = 0 the cross correlation can be written as Eq. 2.12:
Rx1x2(τ) = αRs1s1(τ)~ δ(t−D) (2.12)
where ~ denotes convolution.
One interpretation of Eq. 2.12 is that the delta function has been spread by the
Fourier transform of the signal spectrum. If s1(t) is a white noise source, then its
Queen Mary University of London – Politecnico di Milano
2.1 Audio Analysis 17
Fourier transform is a delta function and no spreading takes place. For a single
delay it may not be a serious problem to find the peak position; however, when the
signal has multiple delays, the true cross-correlation function is given by:
Rx1x2(τ) = Rs1s1(τ)~
∑
i
αiδ(t−Di) (2.13)
In this case, the convolution withRs1s2(τ) can spread one delta function into another,
then making it impossible to distinguish peaks or delay times.
Under ideal conditions where ∀f, Ĝx1x2(f) ≈ Gx1x2(f), ψg(f) should be chosen to
ensure a large sharp peak in Ry1y2(τ) rather than a broad one in order to ensure
good time-delay resolution.
However, sharp peaks are more sensitive to errors introduced by finite observation
time, particularly in cases of low S/N ratio. Thus, as with other spectral estimation
problems, the choice of ψg(f) is a compromise between good resolution and stability.
To perform the audio detection, it is important to define a good ψg(f) function that
can reduce noise component and at the same time is able to compute good delay
estimation.
2.1.4 Generalized Cross-Correlation with Phase Transform
Working on DOA estimation using real audio dataset, the major remarked prob-
lem is related to the fact that noise at the two microphones is correlated and a
reverberation component on the signal is not negligible. Moreover, there is no ad-
ditional information about the statistical characteristics of signal and noise; hence
any kind of noise filtering is not trivial.
To improve the delay estimation a general frequency weighing ψg(f) has to be de-
fined; where ψg(f) represents a pre-filtering transform. Many different pre-filtering
transforms have been studied, some examples are: Roth filter [13], smoothed coher-
ence transform (SCOT) [13], phase transform (PHAT) [13], Eckart filter [13], and
maximum likelihood filter [13].
Of all these pre-filtering transform, PHAT offers the most interesting properties to
work with real audio data set as discussed in already done research [17].
When no assumption are available on statistical characteristics of audio signal and
Queen Mary University of London – Politecnico di Milano
Chapter 2 : State Of The Art 18
noise, the most convenient way to sharpen the cross-correlation peak is try to whit-
ing the input signals.
Contrary to the other pre-filtering techniques, PHAT does not require the model-
ing of the statistical characteristics of the audio source signal and noise, so PHAT
approach is independent of the audio input waveform characteristics. The pre-filter
used in the phase transform (PHAT) is presented in Eq. 2.14:
ψp(f) =
1
|Gx1x2(f)|
= 1
|S1(f)||S2(f)|
(2.14)
where |Gx1x2(f)| is an estimation of the cross power spectral density of the two
signals x1(t) and x2(t); it is computed as product of spectral density of signals S1(f)
and S2(f). Thus the estimated generalized cross-correlation with phase transform
(GCC-PHAT) is given by Eq. 2.15:
R̂(p)y1y2(τ) =
∫ ∞
−∞
Ĝx1x2(f)
|Gx1x2(f)|
ej2pifτdf (2.15)
In ideal case when the noise signals n1(t) and n2(t) are uncorrelated and Gx1x2(f)
is equal to Ĝx1x1(f), we can write:
|Gx1x2(f)| = αGs1s1(f) (2.16)
Ĝx1x2(f)
|Gx1x2(f)|
= S1(f)S
∗
2(f)
|S1(f)||S2(f)|
= ejθ(f) = ej2pifD (2.17)
R̂(p)y1y2(τ) = δ(t−D) = IFFT
(
Ĝx1x2(f)
|Gx1x2(f)|
)
(2.18)
for this ideal case, the result R̂(p)y1y2(τ) provides a delta function centered at the
correct delay.
In Eq 2.17 S1(f) and S2(f) are the spectral densities of the two signals x1(t) and x2(t)
respectively; in this equation it is easy to understand that only the information about
the phase of x1(t) and x2(t) is preserved, and this phase information is described in
the term ej2pifD .
The phase transform (PHAT) is an ad-hoc technique to pre-whiten the signals before
computing the cross-correlations in order to get a sharp peak. The time delay
information is present in the phases of the various frequencies and these are not
Queen Mary University of London – Politecnico di Milano
2.1 Audio Analysis 19
modified by the weighing transform ψg(f). The weighing transform tends to enhance
the true delay and suppress all spurious delays. In a real situation, this property
demonstrates a low sensitivity with respect to drawbacks due to reverberation and
multi path distortion.
One disadvantage of the PHAT is that it weights Ĝx1x2(f) as the inverse of Gs1s1(f);
thus, errors are accentuated where signal power is small.
The phase transform generalization assumes the noise between the two microphones
is uncorrelated; this assumption is not totally true if the audio data is coming
from two microphones that are separated less than a meter. In such situation, the
localization error is proportional to the noise power. To overcome this problem one
easy solution can be place the microphone as far as possible to reduce the noise
correlation; but physical constrain on the microphones location are often imposed.
2.1.5 Reverberation Reduction
To improve the time delay D̂ estimation a furthermore processing is introduced
in the generalization of the cross-correlation function, reducing part of the rever-
beration component by evaluating the reverberation component as addictive noise
added to the signal [20].
Reverberation appears when sound is produced in an enclosed environment and mul-
tiple reflections build up and blend together, creating reverberation. The reflections
due to wall, floor, or ceiling are more noticeable for the high frequencies component
then the low frequencies. An example can be a sound such as a handclap having
full frequency components, when it is played the sound stops in a short time, but
the reflections continue and decreasing in amplitude over the time.
To study the reverberation components first it is introduced in the mathematical
model; the new scenario is showed in Fig. 2.2, where the orange lines represent re-
verberations due to a wall.
Starting form the ideas to consider the reflection as additive noise, the mathe-
matical model of the system, described in Eq. 2.1 and Eq. 2.2, become as follows in
Eq. B.6 and Eq. 2.20, where hr1(t)∗s1(t) and hr2(t)∗s1(t) represent the reverberations.
Queen Mary University of London – Politecnico di Milano
Chapter 2 : State Of The Art 20
Figure 2.2: Audio analysis flowchart tacking into account the reverberations
x1(t) = s1(t) + hr1(t)~ s(t) + n1(t) (2.19)
x2(t) = αs1(t+D) + hr2(t)~ s1(t) + n2(t) (2.20)
The function hr1(t) and hr2(t) describe in which way the reverberant components are
added to the original signal; these components are strictly dependent on the local
environment. hr1(t) and hr2(t) may be draw as sum of shifted Dirac’s delta, where
these shifts describe the time delay between the reverberation component and the
original signal.
Considering the reverberations as noise and assuming the two reverberation transfer
functions have the same power spectrum |HR(f)|2 (having the two microphones
spread 1 meter this last assumption is strongly respected), the overall noise power
spectrum component |N ′(f)|2 becomes Eq. 2.21
|N ′(f)|2 = |HR(f)|2|S(f)|2 + |N(f)|2 (2.21)
The optimum cross-correlation estimator from Eq. 2.15 can now be written as
Eq. 2.22; this represent the GCC function estimated taking in account an addi-
tive noise due to reverberation component.
R̂x1x2(τ) = IFFT
(
Ĝx1x2(f)
|HR(f)|2|S(f)|2 + |N(f)|2
)
(2.22)
Queen Mary University of London – Politecnico di Milano