------------------------------------------- Simplex annotations and complex annotations DG 2004-04-27, internal lexicography workshop ------------------------------------------- DESCRIPTION OF A SPEECH OR LANGUAGE CORPUS General term: annotation Term used in speech technology: labelling Term used in text technology: markup A simplex annotation (e.g. label) is an event, e.g. an occurrence of a word, phoneme, syllable, feature at a specific interval in a corpus. A complex annotation consists of one or more tiers of events. A tier is a sequence of events of the same type. -------------------- Allen relations: - calculus of intervals - 13 relations -------------------- Event logic: axiomatic relational logic =def pair of a property and an interval - Johan van Benthem, Amsterdam - Applied to phonology in order to explicate autosegmental phonology by Steven Bird and Ewan Klein (1989) - "Event Phonology" event <[attribute: value], interval> <[attribute: value], Examples of annotation: Xwaves (esps-waves+): e.g.: `table 1.030' problem: the beginning is only implicit and has to be inferred by the user or added in ad hoc fashion - implicit, partial interval definition SAM: > e.g.: `table 1.030 1.659' corresponding to <[orth: table], <1.030, 1.659>> Praat: same as SAM, but with its own notation TASX: same as SAM, but with XML notation ------------------------------------------------------------------------- HOW DOES THIS RELATE TO THE LEXICON? - Lexical acquisition - list of lexical items, e.g. a wordlist - problem: what is a word? - make list by... - converting the text to a list of words - sorting the list of words - removing duplicates - extract corpus properties of the list items (i.e. microstructure elements which can be inferred from corpus relations) by... - frequency count (absolute or relative / percent) - rank ordering - Lexical representation - macrostructure: overall structure of dictionary - mesostructure: generalisations over microstructures - definitions of grammar, pronunciation, ... - cross references (e.g. semantic relations) - references to corpus (e.g. concordance, examples) - microstructure: types of lexical information, `DATCATS', (data categories), e.g. - STRUCTURAL PROPERTIES (can be extracted from corpus) - external context, e.g. collocations - internal structure, e.g. derived, compound words, idioms - INTERPRETATIVE PROPERTIES - MEANING: semantic, pragmatic - FORM: phonetic, orthographic - METADATA PROPERTIES (local housekeeping properties) - lexicographer - source - dates of creation, modification Note: there are global metadata properties which which apply to the whole lexicon, e.g. - language - corpus used - publication details Note: - macrostructure contains mesostructure contains microstructure - Lexical access