next up previous
Next: A conventional database record Up: Lexical representation Previous: Properties of a lexical

Lexicographic database representation

Computerised lexicographic databases reflect many facets of traditional lexicography rather closely. The most common kinds of Database Management System (DBMS) are based on a relational model [Draxler (this volume)], and in their simplest form can be visualised as a matrix or table, in which the rows constitute the lexical entries, and the columns define the lexical microstructure for each entry; an example will be given later. A full relational database consists of a set of interlinked tables (relations) of this kind, often modelled by a so-called `entity-relationship diagram'. Neither paradigmatic nor syntagmatic generalisations are captured well by this kind of structure; any generalisations must be searched for on demand using classification algorithms. Object-oriented DBMS, with inheritance mechanisms related to those of DATR, and hybrid object-oriented/relational databases, are likely to supersede relational databases in time.

In practice, the most well-known, and rather widely accepted modern variety of lexicographic database representation is lexical text markup often using SGML (Standard Generalised Markup Language), in which labelled bracketings are used to indicate the microstructure and certain aspects of the macrostructure. For further information about these aspects of current lexicography see [Leech, Myers & Thomas 1995]. It must be said, however, that SGML suffers from the same deficiencies as traditional lexica and basic lexicographic databases:

  1. SGML contains no devices for expressing paradigmatic generalisations. This disadvantage will make itself felt more and more as linguistic resources are amassed, since lack of a generalisation or ISA relation leads to lack of compactness, inflated further by the markup conventions.
  2. SGML does not take into account the distinction between PARTOF compositionality and surface ordering (e.g. ID/LP and association relations); this is currently dealt with by hybrid combinations of different languages for various aspects of document `presentation'.

For many reasons, among others its relative simplicity, SGML use is on the increase as a representation language for traditionally structured lexicons. An important factor which favours the spread of SGML (with derivatives such as XML) is that it is the specification language for the document types used on the World Wide Web, particularly HTML.

There are four major (and many minor) prerequisites to the design of any lexicographic database:

  1. linguistic specification (of macrostructure and microstructure);
  2. database management system (DBMS) specification;
  3. specification of the phases of lexicographic database construction:
  4. presentation of and access to lexical information

The prerequisites are given in order of importance from the lexicographic point of view. In practice, of course, there may be lower order practical constraints such as price, availability, databases or computing platforms already in use, which force higher order choices. For example, selection of a DBMS may be based on availability of a proprietary database management system (DBMS) like Access, Paradox, Oracle, or the Shoebox basic lexicographic database system distributed by the Summer Institute of Linguistics (SIL). DBMS specification is the implementation level analogue of macrostructure specification: the choice is between a flat DBMS (though perhaps with hierarchical records, like Shoebox), a relational DBMS with a main relation and sub-relations, an object-oriented DBMS, a hybrid relational-object-oriented DBMS, or a hypertext document.

However, DBMS aside, the main selection is initially the definition of the appropriate macrostructure and its mapping into the record structure of the DBMS, with specifications such as the following: semasiological (orthographic list vs. pronunciation lexicon ...) vs. onomasiological (synonym list vs. hierarchical thesaurus ...) vs. multilingual lexicon ... The macrostructure specification thus determines the basic unit represented by the database record.

The linguistic specification phase is of primary importance in the present context. At the very least, the linguistic content of the database must be known, but ideally a comprehensive specification of the lexical organisation and types of information is desirable.

The microstructure definition completes the linguistic specification, and is the most difficult part of the procedure, involving detailed linguistic analysis. Typical questions to be resolved include morphological paradigm definition (e.g. standard inflectional categories), lemmatisation (i.e. extraction of a canonical reference form from morphological variants), syntactic analysis (definition of a part of speech set, with carefully chosen granularity of subcategories such as VERB, VERB_TRANSITIVE, VERB_DITRANSITIVE...), semantic analysis (semantic components, relations, fields, frames etc.), pragmatic analysis (functional, dialectal, sociolinguistic usage). A microstructure corresponds to what is traditionally known as `types of lexical information', and may vary from simple glossary or spelling-pronunciation tables to vectors of theoretically well-founded categories as in the following selection:

Classical theoretical lexicology as represented by the work of Fillmore (modified from [Fillmore 1971], p. 370):

  1. syntactic environments,
  2. collocational idiosyncrasies,
  3. semantic valency, i.e. number of conceptual arguments,
  4. roles played by each argument (e.g. Agent, Instrument, ...),
  5. presuppositions (concerning beliefs, facts, ...) which are required for apt use,
  6. semantic and morphological relations to other items in the lexicon,
  7. meaning,
  8. phonological or orthographic shapes.

Contemporary formal sign-based lexical microstructure as in HPSG [Pollard & Sag 1987], p. 108; the boxed indices denote shared substructures (Figure 5).gif

 figure446
Figure 5: Attribute-value structure for HPSG 1987. 

In a lexicographic database, only the vector of most deeply embedded values would be used; the hierarchical structure would not be directly represented but `squashed' into a flat value vector. Complex objects could then be represented as sub-relations for the purpose of describing cross-referencing (re-entrancy, structure sharing).

The later version of HPSG [Pollard & Sag 1994], p. 82, simplifies the outer levels of this structure (Figure 6).

 figure454
Figure 6: Attribute-value structure for HPSG 1994. 

Lexical semantic microstructure, as in Pustejovsky's Generative Lexicon Theory (for feature structure details, see [Pustejovsky 1995]).

The following is an example of a Generative Lexicon microstructure (p. 82), which uses essentially the same formalism as HPSG (Figure 7):

 figure470
Figure 7: Attribute-value matrix for Generative Lexicon Theory. 


next up previous
Next: A conventional database record Up: Lexical representation Previous: Properties of a lexical

Dafydd Gibbon
Thu Nov 19 10:12:05 MET 1998