Presented at ALLC/ACH Götheborg, June 2004
Presentation: Poster in PDF format
The objective of this paper is the restructuring of an existing corpus of historical Portuguese for multiple purposes.
The Tycho Brahe Parsed Corpus of Historical Portuguese (*) includes 40 texts (in a total of 1,851,619 words) written by Portuguese authors born from 1496 to 1845. The initial goal of this corpus was to provide annotated texts for the research of language change in European Portuguese. First, a morphological coding system to facilitate automatic search of lexical items was developed. This morphological coding system provided a mapping between linguistic description needs and computational demands for an automatic morphological tagger (cf. BRITTO et al., 1999; GALVES & BRITTO 2003; FINGER, 1998). Another objective of this first system was to provide input for the following stage, syntactic annotation (BRITTO, 2002).
The texts in the corpus are made available in three formats:
The transcript texts were prepared according to the requirements of part of speech tagging and syntactic annotation, that is, they were made machine processable. The processibility depended on a format normalization of the original material, which included correction of typographical errors, character encoding problems and modifying historic varying orthography into standard (modern) orthography. The original forms were preserved and marked-up in this process; the modifications, unfortunately, were included without special markup. This method was adopted because the purpose of this first preparation was, crucially, to make the texts adequate for automated annotation, which would be applicable only to the modified material.
However, as the corpus started to be available for varied researchers, other uses of the presented data became apparent. Many of the included texts are only available with restricted access or at a limited number of libraries (some sources are originals, handwritings or single preserved copies), which made them highly attractive for the demands of historians and other researchers in the humanities. For these purposes, different requirements are necessary, such as readability, preservation of the original data and text design structures, easy access; on the other hand, linguistic markup such as POS tags turns out irrelevant. Another field that opened up was lexicography: the rich corpus material was used for the creation of a lexicon of historical Portuguese, mining the information on the modifications for automatic syntactic tagging.
All this indicated that the ideal structure of the corpus would be one that integrated different versions of the same original material. Remodelling the original corpus into such a structure is the aim of the present work.
Using the sources for all these different and possibly more purposes --- typological analysis, philological, philosophical and philanthropic research, lexicography --- requires the inclusion of all available information on the source in one single source document. The single source approach was taken because of easy maintainability of the resource --- some of the resources being OCRed with all limitation and possible errors --- and easy distributability in different output formats. What is more, the problem of property rights and archiving can be solved once and the original source can be better preserved if researchers find necessary information already in an electronic version.
Therefore, all available information needs to be made explicit and well structured in a machine readable format, enabling automatic consistency check of the resource to discover annotation errors or systematic problems, and to enable machine processing for transformation into the required data formats for the intended purpose, such as output formating for readability but also output format for automatic tagging.
All this had to be done with a minimum of information loss --- if any All this had to be done with a minimum of information loss --- if any --- and with the preservation of the original data. Additionally, inconsistencies needed to be editable in order to correct them for later purposes.
A simple strategy was taken for this purpose:
The method of choice for single source representation is, of course, using XML markup (BRAY, 2000), creating a document grammar such as a DTD or Schema and transforming the document into an XML document which can be validated against this document grammar. Only the original spelling variation was marked up in the source documents, but not the modernized versions. For the preservation of all available information as well as to mark-up the information on the modernization of orthography a simple parser was implemented in perl.
Linebreaks and page-breaks were marked up similar to the TEI P4 standard, as well as other typographic variations.
The strategy for marking up editions was adopted from the TEI P4 (see Section 6.5.2 Regularization and Normalization) with the exception that the original and regularized version both were put into elements instead of attributes, grouped in another element for the variation. The reason for this was to express the comparable status of the regularized and original versions and for ease of implementation. Nevertheless with a simple transformation a TEI conformant mark-up could be produced.
The metadata that was already available in the source documents were easy to categorize using the DublinCore Metadata set (see DUBLINCORE 2003), therefore this simple structure was used instead of a standard TEI-Header. In TRIPPEL and BAUMANN, 2003 a mapping between the TEI-Header and various other metadata standards has been described.
When the data was available in this standard normalized XML format, XSL-Transformation (CLARK, 1999) into HTML were possible to create:
The re-structured corpus will be available via a web interface soon, as well as a concordance for interactive search within the data.
This work was made possible through funding by:
The Tycho Brahe Parsed Corpus of Historical Portuguese is available to scholars without fee for educational and research purposes at http://www.ime.usp.br/~tycho/corpus.
The annotation scheme in the Tycho Brahe Parsed Corpus of Historical Portuguese was designed by a team lead by Helena Britto and Charlotte Galves, at the University of Campinas (Instituto dos Estudos da Linguagem, Universidade de Campinas IEL, UNICAMP); it is strongly inspired on the scheme designed by Anthony Kroch and Ann Taylor for the Penn-Helsinki Parsed Corpus of Middle English (http://www.ling.upenn.edu/mideng/).
Philological consulting was provided by Ivo Castro and Ana Maria Martins from the Classical University of Lisbon. The institute of mathematics and statistics of the university of São Paulo (IME-USP) provides support and computational resources; the part-of-speech tagger used for the automatic annotation of the corpus was implemented by Marcelo Finger.
The construction of the corpus is part of the project Rhythmic patterns, parameter setting and language change, coordinated by Charlotte Galves.