HyprLex: VM Lexicographic Database
HyprLex Applications Menu

Link to:
Top Changes Patch level FAQ Menu Conventions Operation Bottom Reference Main Menu

VM-HyprLex: Lexicon Database Documentation

U Bielefeld Lexicon Group, December 1995



Link to:
Top Changes Patch level FAQ Menu Conventions Operation Bottom Reference Main Menu

Changes in LexDB Version 4.0, 23.09.96

This XXL version of the LexDB contains the automatic maximal paradigm projection of all attested word forms in VERBMOBIL German corpus with morphological, orthographic, phonological, prosodic specifications, including morphological and syllable boundary markers. Only the 7 word form surface attributes are available for this extended set (XXL wordlist). The DB was generated by Guido Drexel from a morphological database developed by Doris Bleiching (checked by Harald Lüngen, Daniela Steinbrecher and Martin Matthiesen, with feedback from colleagues at U Hamburg and LMU Munich). The Web interface was automatically generated by Holger Nord's IKE hyperdocument generator. The Morpholgy DB, LexDB and HyprLex concepts were developed and prototyped by Dafydd Gibbon.

Changes in LexDB Version 4.0, 27.09.96

Main change: extension of corpus coverage.

Changes in LexDB Version 3.2, 11.12.95

Patch levels (last first)


Link to:
Top Changes Patch level FAQ Menu Conventions Operation Bottom Reference Main Menu

Frequently asked questions / FAQ

  1. Why are there only 7 attributes in LexDB Version 4.0? Well, there are in fact only seven attributes which are available for all 41541 words, viz. the morphological and phonological attributes. The other attributes (syntax, semantics, statistics, etc.) are only specified for occurrences in the corpus, and for about 5000 words. Version 4.0 is a projection from the corpus to include the full orthographically, phonologically and prosodically specified paradigms for all attested stems, including morph and syllable boundaries. See Bleiching & al. in D. Gibbon, ed. (1996): Natural Language and Speech Technology: Results of the 3rd KONVENS Conference. Berlin: Mouton de Gruyter, pp. 237-248.
  2. When will the Word List for the Research Prototype be available? The first 2500 word version, FPWL1, based on the model designed by Harald Lüngen for the Word List Task Force, has been available since October 1995, and was discussed at the Project Steering Meeting (PLS) at the end of that month. The second version, FPWL2, based on a revised model also designed by the Task Force, is available as of December 1995. The second model (just under 2300 words) leaves just over 200 words to be filled in from later scenario oriented data as they become available.
  3. Graphics files are slow to download, so why are they used? In fact, the question is based on a false presupposition: with the exception of the small usage indicator in the footer, graphics files are not used, for precisely this reason, and consequently are not downloaded. This help file may take longer to load because it is intended to print as a complete formatted document, and is rather long. It may be split in a future version. But non-textual display is restricted to HTML interactive form menu elements and to HTML-3 table elements, which are interpreted locally by your browser and are therefore dependent on the speed of your browser as well as on network transfer time. Set your browser options to load and view text first, and ensure that you have an adequate cache. At the beginning of a session, documents should in general be re-loaded in order to cache updates.
  4. How can I `tune' VM-HyprLex response time? Three main factors are involved in VM-HyprLex response time: server processing time (dependent on task complexity and server load), download time, and client display time; you can influence the task complexity and client display factors: With the above information in mind, you will be able to make your own observations about the major factors which influence your tasks. Continuous efforts are in progress to increase server throughput, but a significant increase in speed will be obtained by increasing client bandwidth to 2 MBit. If you are using a modem, response times will generally be rather slow, though for many VM-HyprLex tasks a modem bandwidth >= 14400 Bit will be tolerable. Relatively large static documents (such as interactive menus or this help file), which take several seconds (if not more) to download via modem, should be cached locally by the client at the start of each session.
  5. Can I select an ordered hitlist (frequency list) as well as the frequency attribute information? Click on the highlight on the menu page. Hitlist 1000 is currently the complete frequency ordered list of words in the CDROM 1-5 transliterations. It will be replaced soon by the hitlist for CDROM 1-7.
  6. Surely it doesn't make sense to specify, say, Blaubeuren WL and SEARCH: RQH-WL simultaneously? It may do, because it may be useful to know whether a given entry in the Blaubeuren WL is also in the RQH-WL, i.e. in the intersection of two word lists.
  7. Is there a way of searching for entries by attribute value? Yes and no: the SEARCH: Global operation uses a UNIX-like regular expression to search over all fields in the database. Consequently, if the value can be uniquely represented by a regular expression, the relevant set of entries will be selected. For example, entering NN in the input field will select the set of common nouns based on the IMS POS tags.
  8. Why are there restrictions on input parameters in the KWIC application? Because otherwise the response file size may become too large to handle. Large files, which are generally only required for automatic processing rather than visualisation, may easily be generated from the LexDB versions bielefeld.lexdb.vn.m using standard UNIX stream processing tools.
  9. Why do I sometimes get the same response to a different query? This may depend on the browser cache. Try a Reload, Refresh, etc. In any case, you should refresh your cache in this way regularly in order to get updates. If different links refer to the same visible page of a document, a reload may be avoided by the browser in order to reduce response time.
  10. What is the difference between the Demonstrator Word List, the Research Prototype Word List, and the LexDB?
    • Demonstrator Word List: The Demo-WL was defined at the Bielefeld Lexicon Workshop, March 1994, for the VERBMOBIL demonstrator, and constitutes the null version of the LexDB. It contains words from the Blaubeuren dialogue corpus, extensions of scenario-relevant closed sets (e.g. months), and inflexional paradigm extensions for non-attested forms.
    • Research Prototype Word List: There have been two versions of the FPWL.
      1. The first FP-WL is based on the Lüngen Model described in detail below. It contains the Demo-WL as a proper subset, extended by new words from the Reithinger-Quantz-Herweg (RQH) Word List, the spelling compounds (abbreviated proper names), and 994 otherwise most frequent words from the corpus, to make up 2500 words.
      2. The second FP-WL is based on a more complex model, and is based on more complex filtering of word sets, as defined by the Word List Task Force, on criteria from speech recognition, syntax, and the domain semantics of the scenario.
    • LexDB: The LexDB contains the Demo-WL and FP-WL as proper subsets, and in addition all new words from the transliterations in CDROMs 1.0.3 to 5, as obtained by the filterings trlfilter -wg and trlfilter -wpg. In addition, paradigm extensions to include 1st person singular and plural present indicative verb forms were included at the request of partners in the VERBMOBIL Transfer project. (Other morphological forms will be supplied on request; the basic corpus word list projects to approx. 21k inflected forms and 100k morphological mappings after resolution of syncretisms; they can be included in the next version of the LexDB on request.)

Link to:
Top Changes Patch level FAQ Menu Conventions Operation Bottom Reference Main Menu

Menu Conventions

  1. There are two main SEARCH modes: the Key mode and the SubDB mode. In the Key mode, any of the following may be selected:
    1. Key type and key string specification. In this case, the entire database is searched. The following options, with characteristic examples, are available:


      Complete match of fully inflected form in standard orthography


      Substring match of fully inflected form in standard orthography


      Complete match of Morphological Lemma attribute (currently BIlemma)


      Complete match of Semantic Lemma attribute (currently IMSlemma)


      Substring match over entire database

      A new Boolean function currently permits matching of conjunctions of A-V pairs, where values may be atom disjunctions; further operations are planned. The conjunction operator is `&' and disjunction operator is `|'; no blanks are permitted. Definition:
         AVpairs ::= AVpair [ `&' AVpair ]*
         AVpair  ::= ATT `:' VAL [ `|' VAL ]*
         ATT     ::= string
         VAL     ::= string
      
      The selection of OrthString with Terminabsprache is equivalent to the selection of A-V-Pairs with BIorth:Terminabsprache (similarly for OrthString, MorLemma and SemLemma. However, A-V-Pair search is less efficient than the other options. In the following example, those items are selected which have the IMSpos value `VVINF' and a frequency of either 4 or 6:

      A default value is provided for demonstration purposes, which reappears when the RESET button is operated. The default value given is for illustration of the Attribute-Value-Pairs option.
    2. The Key/SubDB switch activates the Key matching and sub-database extraction extraction search modes. The SubDB extraction mode permits selection of pre-defined subsets of entries, and of user-defined subsets of attributes for these. The currently pre-defined subdatabases are:
      All entries
      Caution: selecting `All' attributes generates the whole database
      Demonstrator Word list subset
      Wordlist subset for Blaubeuren Dialogues
      Wordlist subset for the 6 CDROM dialogues
      in the Reithinger/Quantz/Herweg selection
      Current Top 1000 Hitlist subset of most frequent words
      in the CDROM corpus transliterations
      Word list for VM Research Prototype (according to current model)


  2. The DISPLAY criteria determine which attribute values of matching entries are to be displayed, for example:
    If the option `Marked' is selected, attribute checkboxes should be clicked; if no attributes are marked, an error message is generated.



  3. Attributes to restrict the search and response space contexts are selected by clicking on check boxes. Defaults are simply the null value. Attribute selection applies to String, Substring and anonymous full SubDB selections (see below). Access by specific attribute values is only available implicitly for the named sub-database selections (set automatically under the SEARCH function); otherwise global substring search over all fields may be performed.
    Example: Spelling: Pronunciation:

    The attributes are defined as follows:



Link to: Top Changes Patch level FAQ Menu Conventions Operation Bottom Main Menu Reference

Operation

  1. Click on the or button to set default values for all options.


  2. Click on the or button in order to start the search.

Link to:
Top Changes Patch level FAQ Menu Conventions Operation Bottom Main Menu Reference
VM-HyprLex service 10.12.95