next up previous contents
Next: Project: Production of language Up: 23 01 19 LINGUISTIK Previous: Fieldwork

Data archiving and processing

The notion of `Corpus Archive'

What is a `corpus' in the context of language documentation? In general, a corpus contains the following components:

  1. Set of signal recordings
  2. Set of transcriptions and time-stamped annotations (minimally: orthographic or phonemic, but possibly with morphological, syntactic, semantic categories) aligned with the signal recording times
  3. Corpus lexicon
  4. Corpus metadata

See the following Handbook for detailed specification of corpus design, construction and processing:

Gibbon, Dafydd, Roger Moore & Richard Winski, eds. (1997). Handbook of Standards and Resources for Spoken Language Systems. Berlin: Mouton de Gruyter.

The notion of `metadata' is new; for suitable characterisations, you will need to browse the web. Check for the Talkbank, ISLE, and DOBES projects.

Labelling, Annotation, Markup, ...

[ practical ]

Signal annotation

[ practical ]

Textual annotation

[ practical ]

Corpus Linguistics

What happens to the data when it has been appropriately annotated and archived? The main discipline concerned with linguistic data processing in this sense is corpus linguistics which extracts information, usually with the aid of statistical methods, from corpora. Of course there are other disciplines, in particular speech technology, which also process speech corpora, often with similar methods.

Overview of the methods and statistics of corpus linguistics are contained in the following books:

Barnbrook, Geoffrey (1996). Language and Computers. Edinburgh: Edinburgh University Press.

Butler, Christopher (1985). Statistics in Linguistics. Oxford: Basil Blackwell.

Manning, Christopher D. & Hinrich Schütze (1999). Foundations of Statistical Natural Language Processing. Cambridge, Mass.: The MIT Press.

McEnery Tony & Andrew Wilson (1996). Corpus Linguistics. Edinburgh: Edinburgh University Press.

Oakes, Michael P. (1998). Statistics for Corpus Linguistics. Edinburgh: Edinburgh University Press.

Basic statistical applications

The following two Perl programs calculate some of the basic statistical functions used in corpus linguistics:


next up previous contents
Next: Project: Production of language Up: 23 01 19 LINGUISTIK Previous: Fieldwork

Dafydd Gibbon, Thu Feb 15 15:07:15 MET 2001