This XXL version of the LexDB contains the automatic maximal paradigm projection of all attested word forms in VERBMOBIL German corpus with morphological, orthographic, phonological, prosodic specifications, including morphological and syllable boundary markers. Only the 7 word form surface attributes are available for this extended set (XXL wordlist).
The DB was generated by Guido Drexel from a morphological database developed by Doris Bleiching (checked by Harald Lüngen, Daniela Steinbrecher and Martin Matthiesen, with feedback from colleagues at U Hamburg and LMU Munich).
The Web interface was automatically generated by Holger Nord's IKE hyperdocument generator. The Morpholgy DB, LexDB and HyprLex concepts were developed and prototyped by Dafydd Gibbon.
Changes in LexDB Version 4.0, 27.09.96
Main change: extension of corpus coverage.
Changes in LexDB Version 3.2, 11.12.95
Important: see current revisions (patches without version change)
Field separator changed from `@empty@' to `--' for compactness.
Minor changes in some field content specifications
(see detailed documentation below and context-sensitive help)
New attribute
The attribute BIortherror has been introduced to mark
words in the LexDB which do not have standard orthography, but which have
been automatically integrated in order not to lose information.
do not have standard orthography.
Additional fully inflected forms
The First Singular Present forms of all verbs have been included at the
request of the Transfer group. This has increased the size of the LexDB
by about 1000 entries.
Faulty entries
Fault: The `termin' case turned out to be one of a range of incorrect entries
caused by use of non-standard orthography in a subdatabase delivered by
a VM partner, and incomplete filtering of non-words from this database.
Remedy: The converter for this input has been redesigned and the database
has been remade.
Faulty entry `termin'
Fault: Incorrect orthography due to incomplete manual correction of non-standard orthography for Transfer Rule attribute IMSrule.
Remedy: Removal of entry and concatenation of the value of its IMSrule value with `;' to value of IMSrule for entry `Transfer'.
Corrected: 13 Dec 95
Why are there only 7 attributes in LexDB Version 4.0? Well, there are in fact only seven attributes which are available for all 41541 words, viz. the morphological and phonological attributes. The other attributes (syntax, semantics, statistics, etc.) are only specified for occurrences in the corpus, and for about 5000 words. Version 4.0 is a projection from the corpus to include the full orthographically, phonologically and prosodically specified paradigms for all attested stems, including morph and syllable boundaries. See Bleiching & al. in D. Gibbon, ed. (1996): Natural Language and Speech Technology: Results of the 3rd KONVENS Conference. Berlin: Mouton de Gruyter, pp. 237-248.
When will the Word List
for the Research Prototype be available?
The first 2500 word version, FPWL1,
based on the model designed by Harald Lüngen for
the Word List Task Force, has been available
since October 1995, and was discussed at the Project Steering Meeting (PLS)
at the end of that month. The second version, FPWL2, based
on a revised model also designed by the Task Force, is available as of
December 1995. The second model (just under 2300 words) leaves
just over 200 words to be filled in from later scenario oriented data as they
become available.
Graphics files are slow to download, so why are they used?
In fact, the question is based on a false presupposition:
with the exception of the small usage indicator in the footer, graphics files
are not used, for precisely this reason,
and consequently are not downloaded.
This help file may take longer to load because it is intended to print as a
complete formatted document, and is rather long. It may be split in a future
version. But non-textual display is restricted to HTML interactive form menu
elements and to HTML-3 table elements, which are interpreted locally by your
browser and are therefore dependent on the speed of your browser as well as
on network transfer time.
Set your browser options to load and view text first, and ensure that you
have an adequate cache. At the beginning of a session, documents should in
general be re-loaded in order to cache updates.
How can I `tune' VM-HyprLex response time?
Three main factors are involved in VM-HyprLex response time:
server processing time (dependent on task complexity and server load),
download time, and client display time; you can influence the task complexity
and client display factors:
When the first result appears, local processing is finished
and the rest is download time and client display time.
Tasks such as substring, global substring and subdatabase search,
or the construction of large concordances, are
more time consuming than plain entry string access.
Simultaneous access by several VERBMOBIL clients results in slower
processing and download, and local use of the server (e.g. database
restructuring, programme compilation) will also increase server load
and reduce throughput. The smaller the search space and the result
space (e.g. working with the FPWL subdatabase, not specifying all attributes),
the faster the response time.
Download time is dependent on the volume of results attained, and
is greatest with substring, global substring and concordance tasks; this
you can influence yourself, as outlined above.
Client WIN bandwidth below 2 MBit will result in suboptimal transfer rates.
This is a question for your system manager.
Client display time depends on client machine and browser software
properties, and on display size. The latter in turn depends
partly on the quality display quality. In order to improve functionality
though display quality, tabular text patterns are coded in HTML-3.
Menu matrices are rendered in this way (note that on graphic browsers,
these also require the button graphics used in HTML forms).
For online concordance calculation, this also considerably improves server
processing and download times
(with slight extra cost in terms of client display time). Attention to
factors such as client processor speed, memory size, browser cache capacity
and browser version may prove beneficial. You may be able to influence some
of these factors yourself; otherwise it they should be referred to your
system manager.
With the above information in mind, you will be able to make your own
observations about the major factors which influence your tasks.
Continuous efforts are in progress to increase server throughput, but
a significant increase in speed will be obtained by increasing
client bandwidth to 2 MBit. If you are using a modem,
response times will generally be rather slow, though for many VM-HyprLex
tasks a modem bandwidth >= 14400 Bit will be tolerable.
Relatively large static documents (such as interactive menus or this help file),
which take several seconds (if not more) to download via modem,
should be cached locally by the client at the start of each session.
Can I select an ordered hitlist (frequency list) as well as
the frequency attribute information?
Click on the highlight on the menu page. Hitlist 1000 is currently the
complete frequency ordered list of words in the CDROM 1-5 transliterations.
It will be replaced soon by the hitlist for CDROM 1-7.
Surely it doesn't make sense to specify, say, Blaubeuren WL and
SEARCH: RQH-WL simultaneously?
It may do, because it may be useful to know whether a given entry in the
Blaubeuren WL is also in the RQH-WL, i.e. in the intersection of two word lists.
Is there a way of searching for entries by attribute value?
Yes and no:
the SEARCH: Global operation uses a UNIX-like regular expression
to search over all fields in the database. Consequently, if the value
can be uniquely represented by a regular expression, the relevant set
of entries will be selected. For example, entering NN in the
input field will select the set of common nouns based on the IMS POS tags.
Why are there restrictions on input parameters in the KWIC application?
Because otherwise the response file size may become too large to handle.
Large files, which are generally only required for automatic processing
rather than visualisation, may easily be generated from the LexDB versions
bielefeld.lexdb.vn.m using standard UNIX stream processing tools.
Why do I sometimes get the same response to a different query?
This may depend on the browser cache. Try a Reload, Refresh,
etc. In any case, you should refresh your cache in this way regularly in order
to get updates.
If different links refer to the same visible page of a document, a reload
may be avoided by the browser in order to reduce response time.
What is the difference between the Demonstrator Word List,
the Research Prototype Word List, and the LexDB?
Demonstrator Word List: The Demo-WL was defined
at the Bielefeld Lexicon Workshop, March 1994, for the VERBMOBIL demonstrator,
and constitutes the null version of the LexDB.
It contains words from the Blaubeuren dialogue corpus,
extensions of scenario-relevant closed sets (e.g. months), and inflexional
paradigm extensions for non-attested forms.
Research Prototype Word List: There have been two versions of the FPWL.
The first FP-WL is based on
the Lüngen Model described in detail below.
It contains the Demo-WL
as a proper subset, extended by new words from the Reithinger-Quantz-Herweg
(RQH) Word List, the spelling compounds (abbreviated proper names), and
994 otherwise most frequent words from the corpus, to make up 2500 words.
The second FP-WL is based on a more complex model,
and is based on more complex filtering of word sets, as defined by the
Word List Task Force, on criteria from speech recognition, syntax, and
the domain semantics of the scenario.
LexDB: The LexDB contains the Demo-WL and FP-WL as proper subsets,
and in addition all new words from the transliterations in CDROMs 1.0.3 to 5,
as obtained by the filterings trlfilter -wg
and trlfilter -wpg. In addition, paradigm extensions to include
1st person singular and plural present indicative verb forms were included
at the request of partners in the VERBMOBIL Transfer project. (Other
morphological forms will be supplied on request; the basic corpus word list
projects to approx. 21k inflected forms and 100k morphological mappings after
resolution of syncretisms; they can be included in the next version of the
LexDB on request.)