Next: Lexical spelling conventions
Up: No Title
Previous: Extensional and intensional
The database structure and encoding criteria are ASCII based, and were designed for maximally simple processing with standard UNIX tools based on regular expressions, such as tr, sed, grep, awk/gawk.
The database structure for the VERBMOBIL lexical word list is defined as follows.
- The structure of the lexical word list is defined as an n x m matrix of n rows and m columns, where n defines extensional coverage (of a specified entry type) and m defines intensional coverage (at the coarsest level of granularity). For the standard VERBMOBIL demonstrator lexical word list, n = 1292 (fully inflected forms) and m = 2: orthographic keys and pronunciation (including pronunciation variants). In the extended lexical database, for the submatrix corresponding to a conventional lexical word list m = 3: orthographic keys, morphologically segmented orthography, pronunciation.
- The matrix is realised as a UNIX ASCII database file of n records of m fields.
- The first record has irregular structure, and provides a version identifier.
- The second record contains the names of the attributes (fields, columns).
- The following record and field separator choices (cf. ANSI X3.4-1986) were made for convenience in processing with standard UNIX tools:
-
- Record separator: RS = LF, i.e. newline, ASCII LF, 10 (octal 012)
-
- Field separator: FS = aa*, where a = SP | HT , i.e. a sequence of white space consisting of:
-
- either blank, SP, ASCII 32 (octal 040),
-
- or tab, HT, ASCII 9 (octal 011).
-
- Disjunction operator in fields: `;' (no white space: separates pronunciation variants)
For convenience in manual checking, in the first lexical word list, the field separator was defined more specifically as a sequence of five blanks; this was superceded by the more general definition given here, however. Different disjunction operators were used in earlier versions, but are now superceded by the semicolon.
For other operating systems, it may be convenient to augment the record separator by carriage return, ASCII CR, 13 (octal 015), i.e. as a system dependent sequence of LFCR or CRLF. Simple operational UNIX tool definitions are:
- (1)
- From LFCR or CRLF -- `dos2unix'for conversion from CRLF, if available, or:
tr "\013" "\010" | grep . <infile> > <outfile>
- (2)
- To CRLF -- `unix2dos' for conversion to CRLF if available, or:
gawk '{printf($0"\015\012")}' <infile> > <outfile>
- (3)
- To LFCR:
gawk '{printf($0"\012\015")}' <infile> > <outfile>
Next: Lexical spelling conventions
Up: No Title
Previous: Extensional and intensional
Dafydd Gibbon
Fri Sep 1 19:40:09 MET DST 1995