next up previous
Next: Lemmatisation Up: Steps in practical lexicography Previous: Tokenisation

Wordlist extraction

Extraction of an alphabetically sorted wordlist (more precisely: an `alphabetically ordered set of fully inflected word types') after tokenisation is, among other things, a convenient heuristic for checking the accuracy of the corpus representation. In an alphabetic wordlist, for example, misspellings and strange character combinations are immediately obvious.

There is a simple standard procedure for constructing a wordlist from a corpus which has been tokenised in the manner described above (see Figure 11, and Figure 12 for a word frequency list in order of decreasing frequency).

For spoken language system development, the next step is to create a pronunciation table or pronunciation lexicon: for each orthographic form, one or more pronunciations in phonemic notation are provided. There are several ways of doing this:

  1. lookup of irregular forms only in a reference lexicon,
  2. grapheme-to-phoneme (letter-to-sound) rules,
  3. lemmatisation, morphological classification, and generation of full paradigms with spelling and pronunciation,
  4. manual transcription.

Languages differ greatly in the complexity of the letter-to-sound relation. Some, like English or French, are extremely complex. Others, like Spanish, are more straightforward. In any case, the construction of a pronunciation table is a complex task, and will not be dealt with in detail here (cf. [Adda-Decker & Lamel (this volume)]).

 figure630
Figure 11: Construction of an alphabetically sorted wordlist. 

 figure640
Figure 12: Construction of wordlist with decreasing frequency. 



Dafydd Gibbon
Thu Nov 19 10:12:05 MET 1998