CUHK - Research on audio-visual bimodal fusion

    The bimodality of human speech includes two aspects.

    (1) Speech is produced by the movement of the articulatory organs, including the nasal cavity, tongue, teeth, velum, and lips. Using these articulatory organs, together with the muscles that generate facial expressions, a speaker produces speech. Since some of these articulators are visible, there are inherent inter-correlations between the audio and visual speech.
    (2) The audio and visual features have different durations. In other words, there is also loose timing synchronicity between them. For instance, the mouth is opened before producing the speech, so the visual speech begins before the audio one; and only after speech is produced the mouth can be closed, thus the visual speech ends after the audio. Furthermore, the time lag between the movement of the mouth and the voice might be dependent on the speaker or context.

    The audio-visual bimodal fusion model should be able to account for these two aspects of the speech bimodality. We propose a new dynamic Bayesian network (DBN) based model to describe the correlations, the topology structure of the DBN is dispicted below.

structure of the proposed DBN

Dynamic Bayesian networks are a class of Bayesian networks designed to model temporal processes as stochastic evolution of a set of random variables over time. A DBN is a directed acyclic graph whose topology structure can be easily configured to describe various relations among variables. DBN offers a flexible and extensible means of modeling the feature-based and temporal correlations between audio and visual cues.

For the estimation of the fusion weights, we try to explore a novel methodology known as support vector regression (SVR) to estimate it directly from the original audio features, as depicted in the following figure.

fusion weight estimation using SVR

The primary audio features are first extracted from the original audio signal. These feature are then re-sampled by the Sigma-Pi method to obtain secondary distribution features that describe the distribution of the original acoustic features. Finally, SVR is used to predict the fusion weights. Where, Sigma-Pi re-sampling is introduced to reduce the feature dimensions.

Related Publication:

Wu, Z. Y., Cai, L. H., Meng, H., "Multi-level Fusion of Audio and Visual Features for Speaker Identification", International Conference on Biometrics (IAPR ICB), Hong Kong, China, 5-7 January 2006