Spectrum Lexicography Workshop 2004
Timetable
Spectrum Lexicography Workshop Timetable 15 June - 13 July 2004
First few weeks of term: Introduction to computational corpus lexicography:
- Terminology
- Concordancing (see below)
Tuesday 18 May 2004, 12:15h (Phonetics Lab)
- Nadine Borchardt: Shoebox - practical aspeccts.
Tuesday 25 May 2004, 12:15h (Phonetics Lab)
- (LREC conference period)
Tuesday 01 June 2004, 12:15h (Phonetics Lab)
- Nadine Borchardt: Shoebox - practical aspects.
Tuesday 08 June 2004, 12:15h (Phonetics Lab)
- Caroline Sporleder: Machine learning and lexicography.
Tuesday 15 June 2004, 12:15h (Phonetics Lab)
Shoebox (continuation; Nadine Borchardt and others):
- defining and modifying database structure
- microstructure definitions (types of lexical information)
Owing to illness, in fact we discussed the following aspects of the
Ibibio lexicography work:
- Structure of printed version of Ibibio electronic dictionary
- Lexicographic procedure from tabular capture of lexical information to
adding format markup and printing the dictionary.
- UNIX script for format markup script.
Tuesday 22 June 2004 at 12:15h (Phonetics Lab)
Discussion of current lexicographic projects:
- Ibibio project (Eno-Abasi Urua, Moses Ekpenyong)
- Saba Amsalu
- Nadine Borchardt
- Anna Garbar
- Thorsten Trippel (state of the lexicon in the Modelex project)
Tuesday 29 June 2004 at 12:15h (Phonetics Lab)
Dafydd Gibbon: Formalising the computational lexicon (theory)
- generalisation as an operation over feature structures,
- generalisation in inheritance networks,
- the ILEX (Integrated Lexicon with Exceptions) lexicon model.
Tuesday 6 July at 12:15h (Phonetics Lab)
Dafydd Gibbon: Implementing the formal computational lexicon (practice)
- DATR resources (website)
- ZDATR web interface
- ZDATR UNIX and Windows command line interface
Tuesday 13 July at 12:15h (Phonetics Lab)
Dafydd Gibbon & Thorsten Trippel: The TAMINO XML lexical database.
Tuesday 20 July at 12:15h (Phonetics Lab)
Continuation of project oriented discussions
(I will be at the EMELD conference in Detroit)
Tuesday 29 July at 12:15h (Phonetics Lab)
- Final meeting: coordination of summer months
- Plans for follow-up to the ModeLex project
Corpus lexicography notes
- The general rule in corpus-based lexicography is: the larger the corpus, the better.
- Higher order n-grams can be generated using the same techniques; the higher the value of n, the larger the corpus should be.
- Note that the script is not optimised in any way but is intended to illustrate the use of a fairly wide range of UNIX tool techniques (my excuse for hacking this script).
- For further information see:
- Gibbon, Dafydd, Roger Moore & Richard Winski, eds. (1997). Handbook of Standards and Resources for Spoken Language Systems. Berlin: Mouton de Gruyter. [Especially my chapter on the lexicon.]
- van Eynde, Frank & Dafydd Gibbon, eds. (2000). Lexicon Development for Speech and Language Processing. Dordrecht: Kluwer Academic Publishers. [Especially my introductory chapter.]
- Gibbon, Dafydd Inge Mertins & Roger Moore, eds. (2000). Handbook of Multimodal and Spoken Dialogue Systems. New York &c.: Kluwer Academic Publishers. [Especially the chapter on terminology.]
- lexicography2004-04-27.txt (3.0k)
Frequency dictionary (using UNIX script)
INPUT: text
OUTPUT: wordlists and digram lists
METHOD: remove punctuation, put each word on separate line, sort and count
doublets, work out frequency information.
- lexscript01.sh (5.0k; archaic HTML wrapper needs to be removed and file renamed)
INPUT
The input corpus from which the output files were generated is simply the
lexicography notes file linked above.
OUTPUT
Automatically generated word output files in order of generation:
- lex_out_wordlist.txt (3.0k)
- lex_out_alphasorted.txt (4.0k)
- lex_out_numericsorted.txt (4.0k)
- lex_out_ranked.txt (6.0k)
Automatically generated digram output files in order of generation:
- digram_out_list.txt (5.0k)
- digram_out_alphasorted.txt (8.0k)
- digram_out_numericsorted.txt (8.0k)
- digram_out_ranked.txt (12k)
KWIC concordance (using UNIX script)
INPUT: plain ASCII text
OUTPUT: normalised text, wordlist, concordance
METHOD: normalise text into punctuation-delimited phrases, create wordlist
(see above), for each word list phrases in which it occurs.
INPUT
OUTPUT
WARNING: THESE FILES ARE LARGE, and the concordance, the third, is especially
large (over 30 pages). Do NOT print them in their entirety, but just what you
can see on a web page.
- conc_test-norm.txt (4.0k, 156 lines, 3 pages)
- conc_test-words.txt (4.0k, 423 lines, 9 pages)
- conc_test-conc.txt (73k, 1725 lines, 35 pages!)
The Shoebox lexical database management system (DBMS)
The Shoebox toolset for lexical database management will be introduced by
Nadine Borchardt.
Ibibio dictionary project
The first step in making the Ibibio dictionary was to create a concordance, i.e. a wordlist with associated contexts:
The Ibibio concordance PDF DRAFT VERSION (don't print without thinking: around 600 pages!).
The
of the Ibibio electronic dictionary was developed along simplified
lexicographic lines to conform with the WELD (Workable Electronic Language
Documentation) principles:
The Ibibio dictionary PDF DRAFT VERSION (don't print without thinking: around 130 pages!)
.
The procedures followed are outlined in the front matter to the concordance and the dictionary.
Formalising lexical structure
Overview
- Lexical structures require formalisation for
- better scientific understanding of lexica,
- modelling the mental lexicon,
- acquisition of lexical information in the creation of databases,
- semi-automatic production of dictionaries from databases.
- Lexical structures are:
- Macrostructure (usually consisting of table and tree structures)
- Mesostructure (generalisations over lexical items in the form of
- relations between lexical items defined in terms of different elements of microstructures,
- definitions of categories used in microstructures
- concordance-like contexts in which tokens of lexical items occur.
- Microstructure:
- types of lexical information
- structured as a vector (n-tuple), a tree of finite depth.
- A suitable formalism for formalising lexical structure is the attribute-value logic.
Modelling conventions
Before coding a formal lexicon, a set of modelling conventions is
required. Modelling conventions define
- a set of objects to be modelled,
- a set of relations between these objects,
- a function which interprets the formalism in terms of these objects.
The initial formulation of modelling conventions is usually fairly informal,
and then also formalised at a later point.
We will discuss two sets of modelling conventions:
- standard lexical database conventions,
- inheritance knowledge base conventions.
Lexical database modelling conventions
The simplest set of modelling conventions for a lexical database are:
- Macrostructure: table (i.e. a vector of rows), where
- each row represents a lexical entry,
- each column represents a type of lexical information.
- Microstructure: none.
- Microstructure: row of fields in a table, each field representing a lexical information value of the type defined by the column in which it occurs.
Example:
| Orthography | Pronunciation | Category | Definition |
| poodle | pu:dl | N | a dog with a haircut |
| terrier | tEri@ | N | a dog like a shoebrush |
| bulldog | bUldOg | N | a dog with a squashed face |
Starting with this very basic set of modelling conventions, further conventions
need to be introduced which account for:
- homography,
- homophony,
- homonymy,
- polysemy,
- semantic relations (hyponymy, syntonymy, antonymy, ...)
- definition of categories,
- provision of examples.
We will deal with Shoebox, a software toolbox for lexicography in
field linguistics, as a representative of lexical databases.
Lexical inheritance network modelling conventions
The lexicon modelling conventions which I introduced in the early 1990s are
called the Integrated LEXicon (ILEX) modelling conventions.
The main features of the ILEX modelling conventions are:
- The macrostructure is a set of nodes, partitioned into entry nodes representing lexical entries, and class nodes, representing lexical classes or fields.
- The mesostructure is a pair of relations over these nodes, class inheritance and entry inheritance, represented as edges between nodes in a graph. The relations represent generalisations over lexical entries.
- The microstructures of nodes are attribute-value structures; values are concatenations (or other structure-creating operations) of atomic or node references. The attribute-value structures represent types of lexical information.
Coding
DATR is a flexible formalism which has been designed for coding the kinds of
attribute-value structures and other network structures such as automata which
are needed for encoding lexica according to logical and algebraic principles.
Check my DATR (including ZDATR) pages on the web.
D. Gibbon, Do Apr 29 19:24:07 CEST 2004