What is a `corpus' in the context of language documentation? In general, a corpus contains the following components:
See the following Handbook for detailed specification of corpus design, construction and processing:
Gibbon, Dafydd, Roger Moore & Richard Winski, eds. (1997). Handbook of Standards and Resources for Spoken Language Systems. Berlin: Mouton de Gruyter.
The notion of `metadata' is new; for suitable characterisations, you will need to browse the web. Check for the Talkbank, ISLE, and DOBES projects.
[ practical ]
[ practical ]
[ practical ]
What happens to the data when it has been appropriately annotated and archived? The main discipline concerned with linguistic data processing in this sense is corpus linguistics which extracts information, usually with the aid of statistical methods, from corpora. Of course there are other disciplines, in particular speech technology, which also process speech corpora, often with similar methods.
Overview of the methods and statistics of corpus linguistics are contained in the following books:
Barnbrook, Geoffrey (1996). Language and Computers. Edinburgh: Edinburgh University Press.
Butler, Christopher (1985). Statistics in Linguistics. Oxford: Basil Blackwell.
Manning, Christopher D. & Hinrich Schütze (1999). Foundations of Statistical Natural Language Processing. Cambridge, Mass.: The MIT Press.
McEnery Tony & Andrew Wilson (1996). Corpus Linguistics. Edinburgh: Edinburgh University Press.
Oakes, Michael P. (1998). Statistics for Corpus Linguistics. Edinburgh: Edinburgh University Press.
The following two Perl programs calculate some of the basic statistical functions used in corpus linguistics: