1 Purpose and Theoretical Background

The CoGesT [kost] Conversational Gesture Transcription System is being developed in the DFG-funded project ``Theory and design of multimodal lexica'', which forms part of the research group ``Text technological modelling of information'' at the Universities of Bielefeld, Dortmund, Giessen and Tübingen. It is a twofold system for the transcription of gestures produced in conversational speech:

The term gesture as used here refers to the communicatively relevant position or movement of any limbs or parts of the body of a person involved in face-to-face discourse, most prominently conversation. Conversation refers to any type of speech that includes at least one other person apart from the speaker. This other person has to be physically present, or the speaker needs to know that she or he is visible to the other person (this includes video conference conversations, for example).

CoGesT is based on the notion that conversation is multimodal and comprises at least the acoustic and the visual modalities (see also Gibbon et al., 2000). Speakers are assumed to express meanings by means of information in the acoustic and the visual modalities, which are both inextricably linked to and codependent on each other. Each modality can be further divided into several submodalities: The acoustic modality, for example, consists of parallel information streams on different linguistic levels such as the segmental (speech sounds) and the prosodic (pitch, speech tempo, voice quality). For example, speakers may use intonation, fast speech or a breathy voice to convey a specific meaning. The visual modality can be divided into several submodalities according to the various body parts which produce gestures during conversation, e.g. hands and eyebrows. In face-to-face conversation, visual information is always present, and a parallel use of the acoustic information is possible.

The purpose of CoGesT is the description of gestures within this multimodal conversational context. The focus does not lie on an exhaustively detailed description of every aspect of gesture as for example in HamNoSys (see Prillwitz et al., 1989), a phonetic description of signs in German sign language, and FORM (see Martell, 2002), developed for conversational gestures and general body motions. Rather, CoGesT focuses on linguistically relevant gestural forms motivated by the functions of gestures within multimodal conversations, and appropriate for collating in a multimodal lexicon.

The theoretical assumptions underlying the CoGesT system differ from other descriptions of gestures in three important ways.

  1. Linguistically motivated categories. All categories for gestural transcriptions are linguistically motivated. This means that we assume the null hypothesis that the patterning of visual gesture is semiotically organised in much the same way as the acoustically transformed articulatory gestures in speech. We distinguish between form and function and assume that gestures have both morphological and syntactic rules for structural and sequential combinations.
  2. Clear distinction between form and function. The CoGesT system is being developed for the purpose of representing gestures in both a corpus and a multimodal lexicon. The classification of gestures proceeds according to their functional relation with other modalities of the conversation. Systematic in its nature, it is intended to serve as the basis for formalization and implementation. In contrast to McNeill (1992), who claims that there are no separately structured systems of form and meaning in gestures (p. 23), the CoGesT system is based on the theoretical assumption that a clear distinction between gestural forms and functions is possible. When making a gesture whilst saying ``And there was this circle in the sand.'', it does not matter whether one uses one's left hand or right hand, one's index finger only, the thumb, index finger and middle finger, or all fingers of the hand in a circling movement; the function of the gesture, i.e. the illustration of the meaning of the word ``circle'' is the same. This is a case of gestural paraphrase. Likewise, a single gesture may be ambiguous in or out of context, i.e. have more than one function and therefore potentially be a source of misunderstanding. As an example of this, take a circle formed with thumb and forefinger, which can be interpreted either as an icon for a circle, with the meaning ``okay'', or as an insult, depending on the surrounding physical, social and cultural context. This means that although the form of gestures may allow certain variations, their functions in communication can be described as a separate set of categories. This implies that despite the fact that gestures are ``spontaneous, unique and personal'' (McNeill, 1992), they are instances of a system and can be classified into functionally relevant categories. This, in principle, also applies to prosody. It has not yet been clarified whether gestures have discrete or gradient function.

  3. Notational system. CoGesT provides a notational system based on a clear distinction between categories of gestural form and function and by distinguishing between compulsory basic and additional optional categories of description. The categories are mapped on to a notation which can be adapted to various scientific requirements and should not be confused with the category set itself. Possible uses are the description and comparison of gestures by speakers of different competence (child vs. adult, native speaker vs. non-native speaker), of different languages and personalities and in different types of conversations.

The CoGesT system currently distinguishes between the function of gestures and a number of dimensions of form, phase, location, directionality, and shape, and allows further extensions as required.

The outline of the paper is the following: Section 2 then describes the classification of gestures according to form. Section 3 describes ways in which gestures can be combined regarding precedence and overlap, while Section 4 outlines the functional relationship of gestures with other modalities. The notational scheme for the gesture categories is introduced in Section 5, and a description of an operationalization of the CoGesT system in annotation is exemplified in Section 6. Results of the evaluation of the system in terms of an inter-rater reliability study are presented in Section 7.

Thorsten Trippel 2003-08-12