next up previous
Next: Summary and Prospects Up: Steps in practical lexicography Previous: Lemmatisation

Special lexica: the case of dynamic concordances

A concordance is a lexicon in which keywords are paired with complete lists of the contexts of their occurrence in a given text. A static concordance is a concordance which is compiled completely for a given list of keywords. A dynamic concordance (`on the fly' concordance) is a concordance which is calculated on demand and can thus cope with a potentially very large set of keywords or keyword combinations which would be intractable to calculate in full for all combinations. Interesting examples of powerful dynamic concordancing systems are Web search engines like ALTAVISTA. These search engines are currently text-string-based, and search for combinations of keywords or strings indexed in text databases which are constructed from web pages.

To illustrate the point: a very simple concordance to identify occurrences of words in context (KWIC = `KeyWord In Context' concordance) can be constructed in just a few lines of UNIX shell script. The input to the example in Figure 13 takes text data and a list of keywords in arbitrary order, and the output is an alphabetic list of keywords in upper case format, each followed by a list of indented asterisked sentences in the order of occurrence in the text. That's all -- and there are even simpler, (but less understandable) UNIX hacks to do this.

 figure674
Figure 13: Simple concordance construction. 

Of course, there are also far more sophisticated things to do and ways to do them. One problem with this simple routine is that any substring which matches the key will be found, not just tokens, i.e. words. In order to achieve this, the key must be provided with leading and trailling token separators or a more sophisticated pointer system must be used. Alternatively, fields in the normalised text token database must be analysed and compared individually using one of the database scripting languages such as perl or awk.

A sample interaction with this simple concordance script using a file containing a list with the keys `lexicography' and `sign', and part of the present document as input, looks like Figure 14.

 figure687
Figure 14: Concordance script output for `lexicography' and `sign'. 

Again, a caveat for the beginning computational lexicographer: this example obviously shows only some basic techniques, and does not remotely compare with very large scale concordancing systems such as Web search engines.


next up previous
Next: Summary and Prospects Up: Steps in practical lexicography Previous: Lemmatisation

Dafydd Gibbon
Thu Nov 19 10:12:05 MET 1998