University of Surrey

Test tubes in the lab Research in the ATI Dance Research

Audio-visual tracking of multiple moving speakers.

Kilic, V. (2016) Audio-visual tracking of multiple moving speakers. Doctoral thesis, University of Surrey.

thesis_final_volkan.pdf - Version of Record
Available under License Creative Commons Attribution Non-commercial Share Alike.

Download (19MB) | Preview
Available under License Creative Commons Attribution Non-commercial Share Alike.

Download (165kB) | Preview


In this thesis, a novel approach is proposed for multi-speaker tracking by integrating audio and visual data in a particle filtering (PF) framework. This approach is further improved for adaptive estimation of two critical parameters of the PF, namely, the number of particles and noise variance, based on tracking error and the area occupied by the particles in the image. Here, it is assumed that the number of speakers is known and constant during the tracking. To relax this assumption, the random finite set (RFS) theory is used due to its ability in dealing with the problem of tracking a variable number of speakers. However, the computational complexity increases exponentially with the number of speakers, so probability hypothesis density (PHD) filter, which is first order approximation of the RFS, is applied with sequential Monte Carlo (SMC), namely particle filter, implementation since the computational complexity increases linearly with the number of speakers. The SMC-PHD filter in visual tracking uses three types of particles (i.e. surviving, spawned and born particles) to model the state of the speakers and to estimate the number of speakers. We propose to use audio data in the distribution of these particles to improve the visual SMC-PHD filter in terms of estimation accuracy and computational efficiency. The tracking accuracy of the proposed algorithm is further improved by using a modified mean-shift algorithm, and the extra computational complexity introduced by mean-shift is controlled with a sparse sampling technique. For quantitative evaluation, both audio and video sequences are required together with the calibration information of the cameras and microphone arrays (circular arrays). To this end, the AV16.3 dataset is used to demonstrate the performance of the proposed methods in a variety of scenarios such as occlusion and rapid movements of the speakers.

Item Type: Thesis (Doctoral)
Divisions : Theses
Authors :
Date : 29 January 2016
Funders : Turkish government
Contributors :
Depositing User : Volkan Kilic
Date Deposited : 09 Feb 2016 10:36
Last Modified : 31 Oct 2017 18:00

Actions (login required)

View Item View Item


Downloads per month over past year

Information about this web site

© The University of Surrey, Guildford, Surrey, GU2 7XH, United Kingdom.
+44 (0)1483 300800