next up previous
Next: Tokenisation Up: Steps in practical lexicography Previous: Lexicography with UNIX tools

Text normalisation

The input to a lexicon acquisition procedure requires data in well-defined formats. If the data does not already conform to such a format, it must be normalised.

 figure596
Figure 9: Simple normalisation of text data. 

An elementary example of corpus normalisation is removal of arbitrary line feeds, introducing systematic line feeds after punctuation marks which are followed by blanks (as a heuristic in order to avoid introducing line feeds inside abbreviations); see Figure 9. The technique is not very sophisticated in this form: LF will be incorrectly introduced after the period in cases like `Dr. Jones', for example. For the final period of an abbreviation, another strategy or combination of strategies must be used, such as lexical listing of the abbreviations, or detection of upper-case or non-syllable-based letter sequences.

After normalisation, the text is in a database-like form which is suitable for lexical processing. Note, though, that the normalisation procedure already, in effect, incorporates a linguistic theory of what are basic units, such as sentences, punctuation marks, and words; where this theory is wrong the procedure creates errors, such as the `Dr. Jones' error.



Dafydd Gibbon
Thu Nov 19 10:12:05 MET 1998