next up previous
Next: References Up: Summary and Prospects Previous: Finite state technologies

Data-driven lexicon development

Modern lexicon development is heavily data-dependent (see [Boguraev & Pustejovsky 1995]). Large quantities of text or transcriptions are processed automatically, mainly with probabilistic finite state devices such as Hidden Markov Models, in order to estimate transitional probabilities between linguistic units. These `stochastic language models' use bigram, trigram, or n-gram sequences to describe transitional probabilities in sequential contexts of length two, three, n, and thereby induce approximations to grammatical categories. Such models can be seen from the lexicographers point of view as probabilistic collocations of adjacent items, or as elementary probabilistic lexicalist grammars. Much work is currently going on in order to develop more sophisticated notions of probabilistic context.

But data will always -- because of the creativity of language as against the finiteness of observation sets -- be incomplete. This means that numerical, data-driven models and algebraic, knowledge-based models are being increasingly integrated ([Daelemans & Durieux (this volume)]), and the old polarity between `statistical approaches' and `rule-based approaches' is now seen as simplistic and unhelpful. New techniques such as genetic search and genetic learning, together with applications of neural networks (if they can be upscaled to handle very large and `noisy' quantities of language data), are likely to contribute towards finding integrated solutions to this problem; a web search will readily reveal useful introductory material on these techniques.

In conclusion, it may be suggested that it is this final area in which the most significant breakthroughs in lexicography may be predicted to occur in the near future; all the areas of lexicography, lexicology and lexicon theory represented in this volume may be expected to benefit from these developments.



Dafydd Gibbon
Thu Nov 19 10:12:05 MET 1998