next up previous
Next: Text normalisation Up: Steps in practical lexicography Previous: Steps in practical lexicography

 

Lexicography with UNIX tools

UNIX offers a range of powerful ASCII text database management tools. Most of these are based on the text line data structure, i.e. a string of printable characters with a newline terminator. The terminator generally needs to be converted when changing platforms (for instance, newline is linefeed = ASCII 10 on UNIX systems, but a sequence of carriage return = ASCII 13 and linefeed = ASCII 10 on PCs). This contrasts with some other representation codes (e.g. HTML) in which spaces, tabs and linefeeds are lumped together as `white space' and need to be tagged explicitly if required separately. On UNIX systems, numbering is very frequently octal (base 8), with the convention `backslash and three digits', i.e. `tex2html_wrap_inline1317' rather than the decimal numbering given above, for example:

Decimal: 9 10 13 32
Octal: tex2html_wrap_inline1319011 tex2html_wrap_inline1319012 tex2html_wrap_inline1319015 tex2html_wrap_inline1319040
ASCII: TAB LF CR SP

Commonly used tools include basic search, comparison and construction programmes, and scripting languages for combining other tools. A scripting language is a programming language which permits many different kinds of instruction to be combined flexibly as a `script', in general for processing a text stream in the form of sequences of UNIX text lines as defined above. The tools perl, the awk family, sed and lex are scripting languages which are useful in lexicography. Some of the most frequently used UNIX tools are the following:

The basic format for UNIX tools is the ASCII database, in which records are separated by the line-feed code (LF, newline, ASCII decimal 10, octal tex2html_wrap_inline1319012), and fields are separated by blanks (SP, space, ASCII decimal 32, octal tex2html_wrap_inline1319040) or tabulator (TAB, ASCII decimal 9, octal tex2html_wrap_inline1319011), or in standard programming language syntax notation:

database ::= record tex2html_wrap_inline1333 (record tex2html_wrap_inline1333)tex2html_wrap_inline1337
record ::= field (tex2html_wrap_inline1339 field)tex2html_wrap_inline1337
field ::= chartex2html_wrap_inline1337, char tex2html_wrap_inline1349 {tex2html_wrap_inline1319040 ... tex2html_wrap_inline1319177}

That is, translated into decimal terms and running text, a database consists of at least one record terminated by ASCII decimal 10 (LF), a record consists of at least one field, with following fields preceded by ASCII decimal 9 (TAB), a field consists of a sequence of any characters except LF and TAB, and a character is any ASCII value between decimal 32 (SP) and decimal 127.

The choice of field and record separator is in principle free, and depends on pecularities of the task at hand and the capabilities of the UNIX tools being used; the characters shown above are the most conventional. The simplest procedure is often to use spaces, but where fields contain spaces, tabs are preferable if it is inconvenient to re-code spaces with an otherwise unused character.

UNIX tools are frequently used to re-format database information as human-readable texts by introducing ASCII text markup, such as LaTeX, or as specified by SGML document type descriptions (DTDs), such as XML or HTML. Since all notation is in ASCII characters, it is also feasible to generate standardised markup automatically in the scripting languages themselves, or in other programming languages such as C, C++, LISP, Prolog.

The basic steps in typical UNIX lexicon building are shown in Figure 8, which illustrates the kinds of steps required for lexicon construction in general (cf. [Daelemans & Durieux (this volume), ]).

 figure882
Figure 8: Lexicon project development design. 

Input to the procedure is a text corpus, either of printed matter or of transcriptions of spoken language utterances, and, of course, the practical, analytic, and constructive experience of the lexicographer. The output is in the first instance the database, in some pre-selected format, but secondary outputs in printed or electronic formats may be created by transformations (export functions) which re-format the database representation as book-structured dictionaries or as graph-structured electronic hyperlexica.

The procedure itself consists of the following steps, assuming that an overall concept for the macrostructure and the database implementation already exists (see especially [Quazza & van den Heuvel (this volume), ]):

  1. format normalisation, i.e. adapting character sets, record structures, etc., to the requirements of the lexicographer's workbench;
  2. tokenisation, i.e. identification of the smallest structural units of the input text, such as words and punctuation, and resolution of coded items such as numbers, dates, abbreviations;
  3. wordlist extraction, i.e. identification of the fully inflected word forms occurring in context in the corpus;
  4. information extraction:
    1. statistical analysis, e.g. word frequency, bigram frequency, collocation frequency, probability estimation as microstructure information;
    2. linguistic analysis, i.e. lemmatisation (headword extraction), phonological, orthographic, morphological, phrase syntactic, semantic and pragmatic microstructure information;
  5. microstructure specification: definition of the attribute-structure, database record structure, etc. for the types of lexical information which are required.

next up previous
Next: Text normalisation Up: Steps in practical lexicography Previous: Steps in practical lexicography

Dafydd Gibbon
Thu Nov 19 10:12:05 MET 1998