next up previous contents
Next: Definition of database Up: No Title Previous: History and scope

Extensional and intensional coverage criteria

An explicit fundamental distinction was made between extensional and intensional coverage:

Extensional coverage. The number of items (in the present case, fully inflected forms) in the lexical database (or lexical word list).
Intensional coverage. The number of attributes of items (in the case of the lexical word list, only orthography and pronunciation attributes).

The result of discussions and negotiations was a compact Word Information Format for the lexical word list.

Note on extensional coverage

The main requirements posed by different groups were initially somewhat contradictory, and may be summarised as follows:

  1. From the point of view of the speech recognition groups, the wordlist had to be corpus-based.
  2. From the point of view of the linguistic groups, the wordlist had to contain sensible generalisations (for instance, to contain full inflectional paradigms, or domain specific closed word classes such as days of the week, or months, or numbers for dates, calendar weeks, and years).
  3. For the speech groups and some of the language groups, it made sense to define the entry as a fully inflected word form, and this definition was adopted.
  4. For the morphology groups, and for medium term lexicon work, it made sense, however, to regard the uninflected stem as the basic lexical entry.
  5. In particular, the wordlist was technically defined as a submatrix of the full VERBMOBIL lexicon matrix.
  6. Finally, the discrimination ability of the speech recognition components for continuous spontaneous speech was estimated to define a limit of about 1300 words, which imposed a top limit.
These partially conflicting requirements made it necessary to define the lexicographic criteria explicitly in order to avoid as far as possible inter-group misunderstandings of different kinds between the speech and language oriented groups in respect of their different criteria.

The extensional coverage criteria for the VERBMOBIL demonstrator are defined as follows:

  1. Words in an artificial reference dialogue
  2. Words from 10 selected dialogues
  3. Completion of function word sets
  4. Completion of closed semantic sets (calendar terms etc.)
  5. Initial limitation to 1300 entries for test purposes

Note on intensional coverage

Complementary to the extensional coverage criteria are the intensional coverage criteria, that is, the number and type of attributes assigned to each entry (or, in straightforward database terms, the number of fields in each entry record).

Initially, the intensional coverage was defined in terms of two attributes, orthography (spelling) and phonology (pronunciation). Each of these attributes is complex, permitting choices of detail. The intensional coverage is summarised separately for orthography and phonology.

In particular, the complexity of the phonology attribute was determined by the variety of requirements set by the different speech recognition subcomponents for word recognition, syllable handling, prosody, morphological word-structuring, and lexical (i.e. non-rule-based) pronunciation variants (heterophonous homographs).

The spelling and pronunciation attributes are described separately below.



next up previous contents
Next: Definition of database Up: No Title Previous: History and scope



Dafydd Gibbon
Fri Sep 1 19:40:09 MET DST 1995