UNIX offers a range of powerful ASCII text database management tools.
Most of these are based on the text line data structure,
i.e. a string of printable characters with a newline terminator.
The terminator generally needs to be converted when changing platforms
(for instance, newline is linefeed = ASCII 10 on UNIX systems,
but a sequence of carriage return = ASCII 13
and linefeed = ASCII 10 on PCs).
This contrasts with some other representation codes (e.g. HTML)
in which spaces, tabs and linefeeds are lumped together as `white space'
and need to be tagged explicitly if required separately.
On UNIX systems, numbering is very frequently octal (base 8), with
the convention `backslash and three digits', i.e. `
'
rather than the decimal numbering given above, for example:
| Decimal: | 9 | 10 | 13 | 32 |
| Octal: | | | | |
| ASCII: | TAB | LF | CR | SP |
Commonly used tools include basic search, comparison and construction programmes, and scripting languages for combining other tools. A scripting language is a programming language which permits many different kinds of instruction to be combined flexibly as a `script', in general for processing a text stream in the form of sequences of UNIX text lines as defined above. The tools perl, the awk family, sed and lex are scripting languages which are useful in lexicography. Some of the most frequently used UNIX tools are the following:
The basic format for UNIX tools is the ASCII database,
in which records are separated by the line-feed code
(LF, newline, ASCII decimal 10, octal
012), and fields are separated
by blanks (SP, space, ASCII decimal 32, octal
040) or tabulator
(TAB, ASCII decimal 9, octal
011),
or in standard programming language syntax notation:
| database | ::= | record |
| record | ::= | field ( |
| field | ::= | char |
That is, translated into decimal terms and running text, a database consists of at least one record terminated by ASCII decimal 10 (LF), a record consists of at least one field, with following fields preceded by ASCII decimal 9 (TAB), a field consists of a sequence of any characters except LF and TAB, and a character is any ASCII value between decimal 32 (SP) and decimal 127.
The choice of field and record separator is in principle free, and depends on pecularities of the task at hand and the capabilities of the UNIX tools being used; the characters shown above are the most conventional. The simplest procedure is often to use spaces, but where fields contain spaces, tabs are preferable if it is inconvenient to re-code spaces with an otherwise unused character.
UNIX tools are frequently used to re-format database information as human-readable texts by introducing ASCII text markup, such as LaTeX, or as specified by SGML document type descriptions (DTDs), such as XML or HTML. Since all notation is in ASCII characters, it is also feasible to generate standardised markup automatically in the scripting languages themselves, or in other programming languages such as C, C++, LISP, Prolog.
The basic steps in typical UNIX lexicon building are shown in Figure 8, which illustrates the kinds of steps required for lexicon construction in general (cf. [Daelemans & Durieux (this volume), ]).

Figure 8: Lexicon project development design.
Input to the procedure is a text corpus, either of printed matter or of transcriptions of spoken language utterances, and, of course, the practical, analytic, and constructive experience of the lexicographer. The output is in the first instance the database, in some pre-selected format, but secondary outputs in printed or electronic formats may be created by transformations (export functions) which re-format the database representation as book-structured dictionaries or as graph-structured electronic hyperlexica.
The procedure itself consists of the following steps, assuming that an overall concept for the macrostructure and the database implementation already exists (see especially [Quazza & van den Heuvel (this volume), ]):