Automatic recognition of multiparty human interactions using dynamic Bayesian networks
Relating statistical machine learning approaches to the automatic analysis of multiparty communicative events, such as meetings, is an ambitious research area. We have investigated automatic meeting segmentation both in terms of “Meeting Actions” and “Dialogue Acts”. Dialogue acts model the discourse structure at a fine grained level highlighting individual speaker intentions. Group meeting actions describe the same process at a coarse level, highlighting interactions between different meeting participants and showing overall group intentions. A framework based on probabilistic graphical models such as dynamic Bayesian networks (DBNs) has been investigated for both tasks. Our first set of experiments is concerned with the segmentation and structuring of meetings (recorded using multiple cameras and microphones) into sequences of group meeting actions such as monologue, discussion and presentation. We outline four families of multimodal features based on speaker turns, lexical transcription, prosody, and visual motion that are extracted from the raw audio and video recordings. We relate these lowlevel multimodal features to complex group behaviours proposing a multistreammodelling framework based on dynamic Bayesian networks. Later experiments are concerned with the automatic recognition of Dialogue Acts (DAs) in multiparty conversational speech. We present a joint generative approach based on a switching DBN for DA recognition in which segmentation and classification of DAs are carried out in parallel. This approach models a set of features, related to lexical content and prosody, and incorporates a weighted interpolated factored language model. In conjunction with this joint generative model, we have also investigated the use of a discriminative approach, based on conditional random fields, to perform a reclassification of the segmented DAs. The DBN based approach yielded significant improvements when applied both to the meeting action and the dialogue act recognition task. On both tasks, the DBN framework provided an effective factorisation of the state-space and a flexible infrastructure able to integrate a heterogeneous set of resources such as continuous and discrete multimodal features, and statistical language models. Although our experiments have been principally targeted on multiparty meetings; features, models, and methodologies developed in this thesis can be employed for a wide range of applications. Moreover both group meeting actions and DAs offer valuable insights about the current conversational context providing valuable cues and features for several related research areas such as speaker addressing and focus of attention modelling, automatic speech recognition and understanding, topic and decision detection.