CUHK - Research on audio-visual bimodal fusion

    We have also investigate the fusion of audio and visual features for speaker identification. We use the proposed audio-visual correlative model to describe the correlations between the audio and visual features, the model is described in "Technologies".

    We also explore the fusion of audio and visual evidences through a multi-level hybrid fusion architecture based on DBN, which combines the model level and decision level fusion and achieves higher speaker identification performance.

    The multi-level fusion strategy is illustrated in the following figure. There are three models altogether: the audio-only model, the video-only model and the AVCM model that performs model-based audio-video fusion. These three models are further combined by means of decision-level fusion to deliver the final speaker identification result.

Strategy for audio-visual multi-level fusion

DBN for audio-visual multi-level fusion

DBN structure for audio-visual multi-level fusion

Related Publication:

Wu, Z. Y., Cai, L. H., Meng, H., "Multi-level Fusion of Audio and Visual Features for Speaker Identification", International Conference on Biometrics (IAPR ICB), Hong Kong, China, 5-7 January 2006