Extraction of an alphabetically sorted wordlist (more precisely: an `alphabetically ordered set of fully inflected word types') after tokenisation is, among other things, a convenient heuristic for checking the accuracy of the corpus representation. In an alphabetic wordlist, for example, misspellings and strange character combinations are immediately obvious.
There is a simple standard procedure for constructing a wordlist from a corpus which has been tokenised in the manner described above (see Figure 11, and Figure 12 for a word frequency list in order of decreasing frequency).
For spoken language system development, the next step is to create a pronunciation table or pronunciation lexicon: for each orthographic form, one or more pronunciations in phonemic notation are provided. There are several ways of doing this:
Languages differ greatly in the complexity of the letter-to-sound relation. Some, like English or French, are extremely complex. Others, like Spanish, are more straightforward. In any case, the construction of a pronunciation table is a complex task, and will not be dealt with in detail here (cf. [Adda-Decker & Lamel (this volume)]).

Figure 11: Construction of an alphabetically sorted wordlist.

Figure 12: Construction of wordlist with decreasing frequency.