Alignment of speech and co-speech gesture in a constraint-based grammar
Amand, Katya Saint
MetadataShow full item record
This thesis concerns the form-meaning mapping of multimodal communicative actions consisting of speech signals and improvised co-speech gestures, produced spontaneously with the hand. The interaction between speech and speech-accompanying gestures has been standardly addressed from a cognitive perspective to establish the underlying cognitive mechanisms for the synchronous speech and gesture production, and also from a computational perspective to build computer systems that communicate through multiple modalities. Based on the findings of this previous research, we advance a new theory in which the mapping from the form of the combined speech-and-gesture signal to its meaning is analysed in a constraint-based multimodal grammar. We propose several construction rules about multimodal well-formedness that we motivate empirically from an extensive and detailed corpus study. In particular, the construction rules use the prosody, syntax and semantics of speech, the form and meaning of the gesture signal, as well as the temporal performance of the speech relative to the temporal performance of the gesture to constrain the derivation of a single multimodal syntax tree which in turn determines a meaning representation via standard mechanisms for semantic composition. Gestural form often underspecifies its meaning, and so the output of our grammar is underspecified logical formulae that support the range of possible interpretations of the multimodal act in its final context-of-use, given the current models of the semantics/ pragmatics interface. It is standardly held in the gesture community that the co-expressivity of speech and gesture is determined on the basis of their temporal co-occurrence: that is, a gesture signal is semantically related to the speech signal that happened at the same time as the gesture. Whereas this is usually taken for granted, we propose a methodology of establishing in a systematic and domain-independent way which spoken element(s) gesture can be semantically related to, based on their form, so as to yield a meaning representation that supports the intended interpretation(s) in context. The ‘semantic’ alignment of speech and gesture is thus driven not from the temporal co-occurrence alone, but also from the linguistic properties of the speech signal gesture overlaps with. In so doing, we contribute a fine-grained system for articulating the form-meaning mapping of multimodal actions that uses standard methods from linguistics. We show that just as language exhibits ambiguity in both form and meaning, so do multimodal actions: for instance, the integration of gesture is not restricted to a unique speech phrase but rather speech and gesture can be aligned in multiple multimodal syntax trees thus yielding distinct meaning representations. These multiple mappings stem from the fact that the meaning as derived from gesture form is highly incomplete even in context. An overall challenge is thus to account for the range of possible interpretations of the multimodal action in context using standard methods from linguistics for syntactic derivation and semantic composition.