Filename: README.bielefeld.lexdb.v2.0 Author: Dafydd Gibbon Date: 20 November 1994 Description: Brief overview of the Bielefeld lexicon database, Version 2. Summary of features of the VERBMOBIL lexicon database, Version 2 Dafydd Gibbon, U Bielefeld 20 December 1994 Contents 1. Structure and coverage of the lexicon database 2. Description of attributes 3. Tools 4. Directory 5. Brief database report 1. Structure and coverage of the lexicon database There has been considerable discussion with partners about the most suitable structure for the database, and further discussion and suggestions are welcomed. The main problems discussed so far are the following: (1) Flat relational db vs. embedded relations for complex substructures. (2) Lexical coverage of the database, both in terms of lemmas, and in terms of inflected forms derived from these lemmas. (3) Unique fully inflected forms as keys, one record per keyword, with vectors of dependent information, and disjunctive compression for alternative readings of these keys (mainly for the Bielefeld prosodic morphology, the Berlin STUF lexicon, and the Stuttgart transfer database). Mainly for reasons of time, we have decided to keep the original database structure and coverage of the Bielefeld Demonstrator Wordlist, which may be regarded as the Version 0 of the database. After dissemination of Version 1 of the database on 30 November 1994, several additional offers of information were made, which made it necessary to define and implement additional format transformations, which was fairly time-consuming in some cases. The database is consequently still, at the top level, a flat relational database, with strings (which may not contain blanks and tabs) separated by a single blank. The strings are structured as follows: (1) Disjuntive values: A semicolon (;) separates alternative values of a given attribute for a given keyword. In some of the input data, these alternative values were given in separate records (lexical entry definitions, etc.). (2) The further internal structure of values is attribute-dependent. further internal disjunctions are represented by double semi-colon (;;), and vectors (in general to be interpreted as conjunctions or more finely grained values of sub-attributes) are variously represented by ampersand (&), comma(,), or understroke (_), depending on preferences implicit in the input data. The use of disjunctions of value vectors inevitably results in considerable redundancy. So far, this is not a problem; the question will be addressed in more detail for the next version of the database. In the Bielefeld frequency lists for the CD-ROMs, the angle brackets (< and >) around discourse particles were removed in order to conform to linguistic practice and the Demonstrator Wordlist convention. this resulted in one ambiguity (ah and ); for cases such like this, values were automatically summed. However, in later versions of the database this may be changed. The number of columns in each record, corresponding to the number of top-level attributes, has increased to 20. The attributes will be described below. The internal structure of the values means that, in effect, the number of values is considerably greater, of the order of 40 - 50. In particular, the main items of the Stuttgart database are no longer represented as separate attributes, but as as a vector of values; disjunctions of these vectors form the values of the Bielefeld database. Similar considerations apply to the Bielefeld prosodic morphology and to the Berlin STUF lexicon. Coverage is still restricted in this version to the Demonstrator Wordlist, with minor modifications: (1) Spelling errors deriving from the early transliterations were removed. (2) Formal second person plural pronouns were introduced with upper case initials at the request of a number of partners (Sie. Ihnen, Ihr, ...). This results in a database length of 1292 records. Restricting coverage to that of the Demonstrator Wordlist has the advantage of providing a widely used common foundation for developing the database structure and content; however, the coverage of the input databases was widely different, some being restricted to the Demonstrator Wordlist, some containing morphological generalisations to complete paradigms, and some containing the entire contents of the 3 currently available CD-ROMs. It is intended to take the UNION of these sets in the next version of the lexicon database, rather than their INTERSECTION. The total size of the database is 1292 records * 20 fields, resulting in a total of 25840 top-level field values. Where no information was available for a given field, it was padded with the string '@empty@'. Where the input data had internal indications of padding, such as the Stuttgart database ('@LEER@') this was retained. The data of issue of Version 3 will depend on further input and feedback. 2. Description of attributes Attribute Source Description 1 Num UBI Serial number of entry 2 BIorth UBI Orthographic inflected form (Demonstrator Wordlist, May 1994, with minor corrections). 3 BIorthseg UBI Morphologically segmented orthographic inflected form (requested by speech groups) 4 BImorpro UBI Morphoprosodic transcription (integrated surface wordform interchange format for phonology, prosody and morphology) 5 BIlemma UBI Neutral lemma form (normal form) 6 BIorthstem UBI Orthographic stem form (requested by speech groups) 7 BImorprostem UBI Morphoprosodic stem form (requested by speech groups). Attributes 1-7 provided by Daniela Steinbrecher, Doris Bleiching, Dafydd Gibbon. 8 BIflex UBI Disjunctive inflexion category vectors (standardised inflectional morphology reference). Provided by Doris Bleiching. 9 BICD1 UBI CDROM-1 word frequencies 10 BICD2 UBI CDROM-2 word frequencies 11 BICD3 UBI CDROM-3 word frequencies 12 BICDall UBI Sum of the CD-ROM word frequencies (the Bielefeld analysis has been adopted since other analyses were not available in time). Attributes 9-12 provided by Daniela Steinbrecher. 13 KIcanon Kiel TP 14 canonical corpus transcription (based on TP 14 corpus annotation conventions; blanks in phrasal items recoverably coded as two understrokes '__"). Provided by Klaus Kohler. 14 BIdiscpart UBI Discourse particle information (pragmatic/dialogue categories). Provided by Kerstin Fischer. 15 SIEMENScats SIEMENS Siemens lexicon (N, A, V, funcats, zeitgramm). Provided by Stephanie Schachtl. 16 STUTTcats IMS Disjunction of value vectors STUTTsyntag IMS Stuttgart baseform STUTTsyntag IMS Part of Speech (Stuttgart tagger) STUTTsempred IMS Semantic predicate STUTTsempred IMS Semantic macro name STUTTtransfwho IMS Responsible Transfer partner STUTTengpred IMS English predicate STUTTenglem IMS English lemma Provided by Martin Emele. The following attributes are currently kept in separate files because of overlong value strings: Bsynsem HUB Berlin STUF-III synsem lexicon Provided by Johannes Heinecke, Gunter Gebhardi, Markus Duda in cooperation with Tibor Kiss, Stefan Geissler, Kai Lebeth, Scott McGlashan. BSlabel TUBS Braunschweig signal annotations Provided by Michael Lehning. The database will be re-designed to cope with this; time did not permit a more integrated treatment at this stage. 3. Tools Currently, two tools are provided in the same directory: dbreport (no arguments) - Writes a report to standard output. dbselect (>1 arguments) - Selects attributes by name. The dbselect tool currently does not select Bsynsem and BSlabel; these will have to be processed with the standard UNIX tools. The dbselect tool is not aware of the internal structure of value strings below the top level. Further processing is attribute specific, and will need to be done by the user. At a later date, more finely grained access tools will be provided. Selection by keyword or by a specific feature is easily performed with the standard UNIX tools. 4. Directory README.bielefeld.lexdb.v2.0 This file bielefeld.lexdb.v2.0 Bielefeld VM lexicon database braunschweig.label.att Braunschweig subdatabase coverage.rprt* Called by dbreport dbreport* Brief database report tool dbselect* Name-based attribute selection tool hub.synsem.att Berlin subdatabase report.cat Report for current database (identical with Section 5 below) 5. Brief database report ======================================================== Report on bielefeld.lexdb.v2.0 Thu Dec 22 11:40:50 MET 1994 ======================================================== The first (null) record should contain attribute names. -------------------------------------------------------- Online record and field checks: Number of true records: 1292 Number of fields per record: 16 -------------------------------------------------------- Attribute coverage for bielefeld.lexdb.v2.0 Num 100% BIorth 100% BIorthseg 100% BImorpro 100% BIlemma 100% BIorthstem 100% BImorprostem 100% BIflex 100% BICD1 100% BICD2 100% BICD3 100% BICDall 100% KICanon 100% BIdiscpart 1% SIEMENScats 95% STUTTcats 99% Attribute coverage for hub.synsem.att NUM 100% Bsynsem 100% Attribute coverage for braunschweig.label.att NUM 100% BSlabel 61% ======================================================== -----------------End of file-------------------------------