next up previous
Next: Special lexica: the case Up: Steps in practical lexicography Previous: Wordlist extraction

Lemmatisation

If fully inflected words are required as they occur in the text stream, then the task of corpus wordlist construction is essentially complete with the wordlist extraction step, after validation for corpus and procedure determined errors. This kind of lexicon, supplemented with pronunciation information, is the standard type of lexicon used in current automatic speech recognition (ASR) systems.

However, this kind of lexicon is not particularly interesting or useful from a linguistic point of view, or for many speech synthesis applications. A large number of languages are moderately or highly inflecting, unlike English, and in order to construct a useful lexicon for these languages -- but also for English -- a lemma (generally the uninflected stem) must be identified for each attested form. In the simplest case, affix-stripping, especially suffix-stripping, is applied: inflections are removed and stems are identified as an inflection-neutral underlying form. The lemmata are often represented by a canonical orthographic form, such as nominative singular for nouns or infinitive for verbs.

Affix-stripping is not as simple as it may seem, because of different morphographemic and morphophonemic modifications of stem forms with different affixes, and because stems may end with affix-like sequences (cf. Eng. adverbial suffix -ly, correctly identifiable in the adverb slowly, but not in the adjective friendly, the noun and verb fly, or the verb apply). It is not sufficient just to remove the `s' in the following examples, for instance: ladies, lady's, ladies'. Additionally, morphographemic and morphophonemic modifications may not be entirely parallel, and when stem modifications are involved (Eng. swim, swam swum, Ger. Stadt, Städte), affix-stripping is clearly inadequate. However, it is known that the operations involved in affix-stripping can be described by finite state means, and this suggests that affix-stripping routines are well within the reach of UNIX tools. In fact, a number of text-to-speech front ends for speech synthesis have been written in UNIX scripting languages e.g. David Haubensack's perl script for French TTS front end to the MBROLA synthesiser (check sources on the Web).


next up previous
Next: Special lexica: the case Up: Steps in practical lexicography Previous: Wordlist extraction

Dafydd Gibbon
Thu Nov 19 10:12:05 MET 1998