Next: Further developments
Up: No Title
Previous: Lexical spelling conventions
A lexical pronunciation transcription was developed, based on the following phonological conventions and SAMPA-oriented encoding principles.
- The fundamental phonological transcription convention is defined as phonemic transcription according to the international SAMPA conventions for German.
Note: Some local SAMPA variants differ slightly, e.g. in using /Q/ for glottal stop; however, this is defined in international SAMPA as a variety of open rounded back vowel and was therefore considered unsuitable.
- Length marks for long vowels are included, as these are widely used; they are, strictly speaking, redundant, and are not used in the Aussprachewörterbuch (Pronunciation Dictionary) of the Duden Publishing Company.
- Word stress information is included: Primary and secondary lexical stress are encoded (to be distinguished from phonetic accent in context).
Note: The international SAMPA encoding with double quote (") and percent (%) was found to be inconvenient for a number of ASCII oriented processing environments, and replaced by a single quote (') for primary stress and two single quotes ('') for secondary stress. In LaTeX, the latter is generally indistinguishable from as a double quote. This notation has the advantage of being iconic; for example, tertiary stress (in compound words) can be simply included as three single quotes (''').
- Conventions and encodings for morphological and phonological boundary markers are used as follows:
-
- # (hash, word boundary): Word boundaries of two kinds are included; the hash sign, ` #', is a standard notation in linguistics.
In compound words, word boundaries are encoded as a single hash sign (` #').
In phrasal idioms, word boundaries are encoded as a double hash sign ` ##').
-
- . (period, point): Syllable boundaries which are not simultaneously word boundaries; word boundaries are simultaneously syllable boundaries.
-
- + (plus): Morph boundaries which are not simultaneiously word boundaries; word boundaries are simultaneously morph boundaries.
Phonemic morphs (contrast, in orthography, orthographic morphs) are the phonemic represenations of morphemes, in distinction to structural or semantic characterisations of morphemes).
Note: Note that morphs and syllables are frequently not co-extensive: Morphs may contain more than or less than one syllable, and morphs and syllables may overlap (cf. Verbindung /vE6.+b'In.d+UN/ for overlap of the morph /b'Ind/ with the syllable /dUN/.
The strictly ASCII oriented database standard enables simple database access functions to be specified and emulated using UNIX tools, defining sets such as the following:
- the set of unstressed monosyllables
- the set of compounds
- the set of polysyllabic noninflected simplex words
For example, 'the set of morphologically simple unstressed monosyllables with short vowels' is defined as
grep -v "[.'+\#:]" <infile> > <outfile>
A set of UNIX access tools has been provided for the VERBMOBIL database, based on these basic principles.
Next: Further developments
Up: No Title
Previous: Lexical spelling conventions
Dafydd Gibbon
Fri Sep 1 19:40:09 MET DST 1995