Next: Lexical pronunciation conventions
Up: No Title
Previous: Definition of database
The orthographic conventions followed in the lexical word list follow the recommendations made in the VERBMOBIL standard orthographic transcription handbook developed by the VERBMOBIL Corpus Project (Teilprojekt 14, see [3]). The term transliteration is introduced in that handbook for orthographic transcription; however, since the term is unorthodox and has other meanings (in particular, standardly referring to the re-coding of orthographies such as Cyrillic to Roman, Greek to Roman, etc.), it is avoided here.
The spelling conventions may be summarised as follows.
- A distinction was made between orthographic conventions and orthographic encoding.
- German standard orthographic conventions defined by the Duden Publishing Company were adopted.
- Some borderline cases (such as contracted preposition-article sequences) were adopted from the Handbook of the VERBMOBIL corpus project, which should be consulted.
- In particular, the orthographic representation is case sensitive: spelling is lower case, with initial upper case characters for nouns.
Note: Effectively, this introduces the equivalent of a partial morphosyntactic tag set, which may or may not force arbitrary decisions in some cases, but in general reduces ambiguity at the expense of vocabulary size. The distinction affects stochastic language models, but is less relevant for pronunciation oriented recogniser training.
- The German standard orthographic encoding of umlaut (diaeresis) and Eszet (`scharfes S', sz-ligature) follows the operational norm defined by the de facto standard `german.sty' file commonly used as a LaTeX document type for German (see Table).
- The orthographic names for `spellings', i.e. the letter names A, ... Z, plus special characters, are encoded with a prefixed dollar sign, thus: $A, ... $Z, $Ä, $Ö, $Ü, $SZ. The words zwei, zwo, doppel, Umlaut are also used in spellings.
Note: Omission of the dollar sign would lead to pronunciation ambiguities for many acronyms, which may be pronounced letter or as word-like units; this would negatively affect speech recognition training.
- Spellings are treated as common nouns, either undeclinable (in the spelling context), or declinable with -s in other contexts.
- Spelling sequences (in acronyms such as USA and `uptake spelling' of names etc.) are treated as compound nouns and therefore linked by hyphen. Thus the word `USA-Reise', USA trip, is encoded as '$U-$S-$A-Reise' (see [2]).
Note: In older orthographic corpus transcriptions, and in the VERBMOBIL orthographic transcription handbook, the sequence is inconsistently encoded as '$U $S $A-Reise', which implies a sequence of two word-like units and a compound word.
- The following orthographic separators are used; a distinction is made between those used in general orthographic representations of phrasal idioms and compound words, and additional separators used in specially segmented orthographic representations:
-
- _ (underscore): Word separator in phrasal items (e.g. idioms, greetings)
-
- - (hyphen): Word separator in compound words
-
- #+ (hash-plus, inflected stem boundary, additional segment separator): between word stems and (sequences of) inflectional affixes, but not within such sequences
-
- + (plus sign, morph boundary, additional segment separator): between orthographic morphs, i.e. between affixes, or between derivational stems and affixes

Next: Lexical pronunciation conventions
Up: No Title
Previous: Definition of database
Dafydd Gibbon
Fri Sep 1 19:40:09 MET DST 1995