We have also investigate the fusion of audio and visual features for speaker identification. We use the proposed audio-visual correlative model to describe the correlations between the audio and visual features, the model is described in "Technologies".

    We also explore the fusion of audio and visual evidences through a multi-level hybrid fusion architecture based on DBN, which combines the model level and decision level fusion and achieves higher speaker identification performance.

    The multi-level fusion strategy is illustrated in the following figure. There are three models altogether: the audio-only model, the video-only model and the AVCM model that performs model-based audio-video fusion. These three models are further combined by means of decision-level fusion to deliver the final speaker identification result.

Strategy for audio-visual multi-level fusion
Strategy for audio-visual multi-level fusion

DBN for audio-visual multi-level fusion
DBN structure for audio-visual multi-level fusion


Related Publication: