next up previous
Next: Lexicographic database representation Up: Lexical representation Previous: Lexical theory and lexical

 

Properties of a lexical knowledge representation language

The goal of this section is to exemplify the preceding discussion more concretely using a modern lexical formalism (DATR, cf. also [Cahill, Carson-Berndsen & Gazdar (this volume)]), with a syntax, a procedural semantics and an informal denotational semantics in the form of a toy sized lexical model following the ILEX conventions. Not all technical details are explained; much is left to the reader to try out in practice and search for on the Web.

The discussion in this section will necessarily be somewhat more technical than in other sections; it is clearly not possible to introduce a formalism in a wholly informal fashion. The reason for selecting this particular example is that both modelling techniques and formalism are typical of much contemporary work both in computational lexicon theory and in computational lexicography, and the discussion can be treated as background to the kinds of lexicon discussed in other papers in this volume. In addition to [Cahill, Carson-Berndsen & Gazdar (this volume)], for more extensive applications of lexical formalisms reference should be made to the papers by [Daelemans & Durieux (this volume)] and [Bouma, van Eynde & Flickinger (this volume)], and there are also interesting implications in the theory of the mental lexicon processing by [Baayen, Schreuder & Sproat (this volume)].

The toy model follows the ILEX conventions for inheritance lexica and is visualised as an inheritance graph in Figure 4; the model is derived from a more complex model of English endocentric compound nouns.

 figure322
Figure 4: Toy model of an inheritance lexicon 

The objects (nodes) Time and Table inherit generalisable properties from the node Noun; the complex object Timetable inherits its generalisable properties from the node Compound_Noun and further from Noun, and the properties of its parts from Time and Table, which it immediately dominates.

Two levels of generalisation are represented: lexicalised signs (lexical entries) are modelled as nodes in the graph, represented by dotted circles, and abstract generalisation classes are modelled as nodes represented by solid circles. Idiosyncratic properties are defined at the lexicalised sign nodes, the most regular generalisations are defined at the classes at the top of the graph, and partial generalisations are defined at intermediate levels. Properties (attached to the nodes, but not shown in Figure 4) are partitioned by attribute structures, as in many contemporary theories of phonology, morphology, syntax, semantics and the lexicon.

Two kinds of lexical relation are also represented: paradigmatic ISA relations of similarity are modelled as edges represented by solid lines, and syntagmatic PARTOF relations of composition are represented as edges represented by dotted lines. While specific lexical composition relations are basically idiosyncratic, generalisable features, such as linear ordering (linear precedence, LP), are inherited from the abstract class nodes.

The graph model (minus the Noun node, which is not needed in the example used here) will be applied in order to interpret a theory in a modern lexical knowledge representation language, DATR. The theory may be said to describe the model which interprets it. Some of the modelling conventions involved in the interpretation of the DATR theory are summarised in Table 1; for fuller references to literature on DATR see [Cahill, Carson-Berndsen & Gazdar (this volume)].

 

DATR theory Model
local inheritance paradigmatic ISA inheritance
global inheritance syntagmatic PARTOF (ID) inheritance
non-hidden nodes lexical entry nodes
hidden nodes abstract class nodes
equation right-hand side sequence linear (LP) order
evaluation of right-hand side sequence property
attribute paths orthogonal inheritance
Table 1: Modelling conventions for the toy inheritance lexicon. 

As a lexical knowledge representation formalism, DATR has the following properties.

Syntax: A DATR theory consists of a set of sentences (or nodes, each of which starts with a nodename and a colon, contains at least one equation, and terminates with a dot `.'. The equation consists of a left hand side (a path, either empty or a sequence of atomic symbols, delimited by `<' and `>'), a separator `==', and a right hand side (a sequence of evaluable expressions, i.e. atoms or inheritance descriptors, the latter being nodenames, or paths, or node-path pairs with or without double-quote delimiters).

Procedural semantics: A DATR query consists of a pair of a node and an atomic path (possibly empty). The query connects to the theory if both the node exists in the theory, and a left hand side path under this node is a prefix of the query path (the longest match wins in case of a clash, and there can only be one winner).

The remaining non-connecting suffix (the extension), possibly null, is suffixed to every path on the right hand side, however deeply embedded; procedurally the suffix represents constraints on inference which `percolate' up the inheritance graph. The resulting right hand side expressions, i.e. atoms and (possibly double-quoted) inheritance descriptors (which are node-path pairs, nodes or paths) are then treated in turn as queries.

Right hand sides evaluate to atom sequences, Atoms evaluate to themselves (DATR inference rule 1), and inheritance descriptors evaluate to atom sequences (DATR inference rules 2-7). The presence or absence of quotes determines how inheritance descriptors are evaluated:

  1. Each query defines a query environment, a global environment, and a local environment, each consisting of a node-path pair.
  2. Initially these are the same; the Query environment never changes. A double-quoted inheritance descriptor redefines the global and local environments, and an unquoted inheritance descriptor only redefines the local query.
  3. If the descriptor only contains a node or a path, then only that part of the environment(s) is redefined.
  4. Only local paths enter into the connect operation.

Theory: Definitions of nodes and inheritance relations. Following the model in Figure 4, the node Timetable inherits all its ISA properties from the node Compound_Noun, and its PARTOF properties for the `modifier' attribute from the node Time and for the `head' attribute from the node Table. The node Timetable (simplifying, of course) has no surface or semantic properties of its own, and inherits its general compositional properties (e.g. surface linear order) from Compound_Noun and its specific interpretation properties from its two parts, the nodes Time and Table.

     Timetable:     <>         == Compound_Noun
                    <modifier> == "Time:<>"
                    <head>     == "Table:<>" .
     Time:          <surf>     == taIm 
                    <sem>      == tempus .
     Table:         <surf>     == teIbl
                    <sem>      == matrix .
     Compound_Noun: <>         == "<modifier>" "<head>" .

Theorems: Queries are pairs of nodes and atomic paths, and evaluate to theorems derived by standard DATR inference:

     Table:< surf > = teIbl .
     Table:< sem > = matrix .
     Time:< surf > = taIm .
     Time:< sem > = tempus .
     Timetable:< surf > = taIm teIbl .
     Timetable:< sem > = tempus matrix .

This theory thus describes the two kinds of generalisation in the toy model in terms of local and global inheritance, respectively:

  1. Paradigmatic: Similarities are expressed by inheritance of properties from the same class -- e.g. all Compound Nouns have certain properties in common.
  2. Syntagmatic: A distinction is made between two relations that constituents enter into -- the part-whole relation with the whole unit as head or modifier, and part-part relations such as `preceding', `following'; this relation is typical of `ID-LP Grammars' (immediate dominance, linear precedence grammars), a classification which covers many modern formal grammars.

Implementation: Using the operational semantics of the Zdatr 2.0 software (freeware, check Web sources), the following sequence of inference steps for the query Timetable:<surf> is automatically generated, in which the strategy is to evaluate Time:<surf> and Table:<surf>, and to concatenate the resulting values:

     =0,0,0> LOCAL Timetable:< || surf > == Compound_Noun 
             GLOBAL Timetable:< surf >
     RULE III.(NODE)
     =1,0,0> LOCAL Compound_Noun:< || surf > == "< modifier >" "< head >" 
             GLOBAL Timetable:< surf >
     RULE VII.(GPATH)
     =2,0,0> LOCAL Timetable:< modifier || surf > == "Time: <  > " 
             GLOBAL Timetable:< modifier surf >
     RULE V.(GNODE/GPATH)
     =3,0,0> LOCAL Time:< surf > == taIm
             GLOBAL Time:< surf >
     RULE I.(ATOM)
     taIm
     RULE VII.(GPATH)
     =2,0,1> LOCAL Timetable:< head || surf > == "Table: <  > " 
             GLOBAL Timetable:< head surf >
     RULE V.(GNODE/GPATH)
     =3,0,0> LOCAL Table:< surf > == teIbl
             GLOBAL Table:< surf >
     RULE I.(ATOM)
     teIbl
     [Query 5 (7 Inferences)] Timetable:< surf > = taIm teIbl .

The path suffix extension operator is denoved by `||'; local and global environments for DATR inference are shown, numbers indicating depth of inference are shown.

A computer implementation is more than an operationalisation of the procedural semantics, of course; it also has specifiable behaviour in time. Some practical values resulting from the present operationalisation using the Zdatr 2.0 software are:

          ---------------------------------------------
          Programs    : zdatrinf2.0, zdatrtok2.0
          Mode        : verbosity = 2,
                        crunch      ON,
                        maxdepth  = 200,
                        maxrec.   = 100
          Input Type  : declfile  'toynouns.dtr.dec'
          Date        : Mon Sep 28 03:23:18 1998
          Queries     : 1
          Inferences  : 7
          Active[sec] : 0.03
          Queries/sec : 28.98
          Inf./sec    : 202.83
          Inf./query  : 7.00
          ---------------------------------------------

Like other inheritance based representation regimes, the DATR lexical knowledge representation language enables lexicon microstructure to be integrated with lexicon macrostructure: generalisable microstructure properties are inherited from a hierarchical macrostructure. For lexica with a large number of entries, and a large quantity of generalisable information (e.g. `in English, with very few exceptions, noun plurals end in ``s'' or some predictable variant thereof'), the result is a considerable reduction in overall lexicon size.

The relatively recent development of generalising languages of this kind permits interesting comparisons with other kinds of lexicographic representation. In a traditional lexicon, there is little generalisation -- in fact, a lexicon is held to be a store of idiosyncratic, i.e. ungeneralisable information. But if a lexicon were indeed this, we would have nothing to say about it except to list the entries. In its extreme form, this thesis is absurdly wrong, since lexical entries evidently do have much in common. And traditional lexica do contain many generalisations, ranging from prefatory material containing general descriptions of typical classes of words, classes of irregular verbs, and the like, to pointers (`tex2html_wrap_inline1263', `see', `cf.', `q.v.') within the lexical microstructure. These mark lexical relations (in themselves generalisations of different types) between lexical entries, thus constituting an implicit macrostructure containing both entries and generalisations.


next up previous
Next: Lexicographic database representation Up: Lexical representation Previous: Lexical theory and lexical

Dafydd Gibbon
Thu Nov 19 10:12:05 MET 1998