| Encyclopaedia of Ivory Coast Languages | This report as a PostScript-document |
Thorsten Trippel, Nils Jahn, Dafydd Gibbon (U Bielefeld)
Soma Ouattara (U Cocody, Abidjan)
Status deliverable
printed February 5, 2001
The Portable Audio Concordance (PAC) is a basic tool required for spoken language lexicography. The present tool is tailored to the needs of training local corpus analysists and lexicographers concerned with the documentation of (endangered) languages in their own countries using low-end hardware.
The tool is specified as a deliverable in the DAAD project Encyclopaedia of languages of the Ivory Coast in close connection with the DOBES project Ega: a documentation model for an endangered Ivorian language.
It is anticipated that this tool will be used in a hybrid lingware development environment together with tools such as Praat or Transcriber and Shoebox. Compatibility with other tools is ensured by providing ASCII interfaces, in particular transcriptions in X-SAMPA and with XML markup, and by training local computer science personnel in automatic text processing techniques in order to interface tools in a hybrid environment.
The educational and local involvement aim has priority: the tool is not intended to be an industry standard implementation, but may of course be used a specification and proof-of-concept by other implementers.
We present a requirements specification, a system design specification, a project design specification, an implementation description, and an outline of the planned evaluation, distribution and maintenance policies. Object specifications in a standard format, and source code of a proof-of-concept implementation in Perl are included in the Appendices.
Perl is used as the language for software development; it was selected because of built-in efficient regular expression processing capability, seamless integration with CGI and other interactive interfaces, and the portability of the code from UNIX/Linux to Windows and Macintosh environments, enabling rapid prototyping and efficient cyclical development.
The overall goal of the project was to present in relatively informal outline form
for a portable audio concordance for use in lexicography within the framework of the documentation of Ivorian languages. This does not exclude utility in other contexts, but the present minimal specification is specifically formulated to ensure efficient initial language documentation. By `Portable Audio Concordance' (PAC) we mean concordance shell software which can be used on as many platforms as possible, in particular in an offline laptop environment in the field or with low-end or older hardware.
The specific goals are
We proceed by specifying requirements in terms of first user groups, second application areas, third user needs.
The potential groups for a audio concordance include:
The main intended application of PAC is in the lexicography of languages also in contexts of endangered languages. Consequently, the main considerations are ergonomic use by the language documenter rather than other users:
Our implementation perspective leans strongly towards practice in computational linguistics, and involves
The present tool is straightforward and does not involve essentially new concepts; it takes up ideas and experience from the VerbMobil and EAGLES projects and applies them to the specific area of the documentation of languages.
The further technical requirements for PAC are summarised informally as follows:
A specific lingware requirement is that PAC should be interfaced with
Interfaces to standard ASCII based formats for text components of corpora and lexica need to be specified (e.g. widely used UNIX databases, archives with ASCII markup such as HTML, XML, Shoebox files).
By `corpus' we mean a set of related primary language data sources, following EAGLES recommendations (cf. Gibbon et al. (1997)), including the following minimum
The following options may also be included:
Finally, the audio-concordance should be as language independent as possible. This means in particular that it should take into acount more than one language and has to be extensible to other languages as long as the data are available in some standard way. This requirement applies to the proof of concept implementation.
In this design specification we outline both system design and project design.
We define a Concordance System from a declarative point of view as a pair of functions
The acquisition function maps a corpus into a concordance consisting of a set of pairs of keyword and keyword-in-context set:
where
The aquisition function creates a list of keys from a given -- possibly marked up -- text. This list of keys is to be used as access criteria to the contexts of these keys, i.e. the Key Words In Context (KWIC).
The consultation function maps a pair of keys (often just one) and a corpus into a keyword-in-context concordance:
where
The main dependencies involving both functions are illustrated in condensed form in Figure 1.

Figure 1: Main concordance dependencies.
Two different types of electronic concordances were taken into consideration from a procedural point of view:

Figure 2: System modules and interface types.
We note that there is a logical dependency between static and dynamic concordances: a static concordance is a subset of the output of a dynamic concordance. Consequently, PAC design starts with the dynamic concordance. The modules of the dynamic concordance are shown in Figure 2 and the architecture of a system designed to realise these functions is shown in Figure 3.

Figure 3: Text handling dataflow.
The hyperconcordance
which is the output of the software should be a browser accessible format.
A user interface event should result in a list of occurrences of the keyword
in context and an audio rendering of the context, including the keyword.
It is intended to use the concordance as a source of contextual lexical information within a lexicon, a lexicon as described by Adouakou Schulte (2001). Further information on a microstructure for a suitable lexicon can be found in Gibbon (2001).
The lexicon microstructure required for Ivorian tone languages putatively with vowel harmony, consists of the following items:
For the concordance a simpler microstructure subset is used:

Figure 4: Tree hierarchy for concordance tree tagging
The tagging hierarchy for use with the concordance is shown in Figure 4.
The system should require the following files:
Required scripts/modules are:
Generated, static files include:
Generated, not static files include:
Three user interfaces are being included:
In order to test the functionality, the PAC system was subjected to initial informal evaluation using two different Ivorian languages: Koulango (Gur/Senoufo), Anyi (Kwa/Tano). In the meantime this has been extended to the Ega language.

Figure 5: Time management bar chart.
| Task | Who |
| Specification and design for an audio-concordance | Trippel, Ouattara, Jahn |
| Specification and design for an audio-concordance | Trippel, Ouattara, Jahn |
| Design and definition of markup | Trippel, Ouattara, Schulte |
| Coordination, collation | Trippel, Schulte, Adouakou |
| Evaluation | Trippel, Adouakou |
| Module definition | Trippel, Ouattara, Jahn |
| ASCII to markup converter | Trippel, Ouattara, Jahn |
| Search function | Ouattara, Jahn |
| User interface design | Trippel, Jahn |
| CD-Rom production | Trippel, Adouakou |
The following main tasks have been defined:
The tasks are coordinated closely (and some shared) with the VW project EGA: A Documentation Model for an Endangered Ivorian Language until the end of this project.
The basic conditions for implementation are as follows:
The design of the `container tree' of elements and tag types was specified above (cf. Figure 4): a text element is the container element for sentences, which are in turn container elements for words and sentence-end punctuation such as periods, question marks, exclamation marks. These are included because they have a semantic function for orthographic texts. Tone-language-specific prosodic markup will be included at a later stage.
The DTD is deliberately minimal and subject to revision with respect to the distinction between elements and attribute-value pairs in consultation with other teams.
<!-- DTD for the concordance markup Concordance Tree Tagging (CTT) --> <!-- Developed 2000 by Thorsten Trippel, Soma Outtara, Nils Jahn at the University of Bielefeld, Germany --> <!-- root element is ctt --> <!ELEMENT ctt - - (concinf, conctext)> <!-- head element with general information --> <!ELEMENT concinf - - (title, author, date, changes*)> <!ELEMENT title - - (#PCDATA)> <!ELEMENT author - - (#PCDATA)> <!ELEMENT date - - (#PCDATA)> <!ELEMENT language - - (#PCDATA)> <!ELEMENT changes - - (#PCDATA)> <!-- body element with marked up text --> <!ELEMENT conctext - - (concsentence)* > <!-- Element to tag single sentences with id attributes --> <!ELEMENT concsentence - - ((concword+),concsentend) > <!ATTLIST concsentence sentencenumber ID #REQUIRED> <!-- Element to tag single words with id attributes --> <!ELEMENT concword - - (#PCDATA)> <!ATTLIST concword wordnumber ID #REQUIRED> <!-- Element to tag sentence end punctuation such as . ! ? --> <!ELEMENT concsentend - - (#PCDATA)>
The following sample text illustrates the CTT format:
<?xml version="1.0" standalone="no"?>
<!DOCTYPE ctt public "-//UBI//DTD CONCORDANCE 0.1a//EN" >
<ctt>
<concinf>
<title>Testtext</title>
<author>Trippel</author>
<date>08 Oct 2000</date>
<language>English</language>
<changes>08 Oct 2000</changes>
</concinf>
<conctext>
<concsentence sentencenumber="sentence1">
<concword wordnumber="word1">This</concword>
<concword wordnumber="word2">is </concword>
<concword wordnumber="word3">sentence</concword>
<concword wordnumber="word4">1</concword>
<concsentend>.</concsentend>
</concsentence>
</conctext>
</ctt>
The normalisation function converts a SAMPA text into a marked-up text. Every sentence receives a unique identification number. Within the sentences each word receives an identification which is composed of the number of the current sentence and the number of its position in the sentence. The SAMPA text does not contain any punctuation and the end of a sentence is marked up with the line-feed symbol. The normalisation function will be invoked once per text.
Table 2: Pseudocode for normalisation function.
The normalisation function (cf. Table 5) expects two arguments which are the name of the input file and the name of the output file. The read line of the input file is stored in a string variable. The line of the file will be splitted into an array, and two integer variables are used as index variables. The first index will be the number of the current sentence and the second the number of the current word of the sentence.
The following algorithm first opens the input and the output file. The first thing which will be written into the output file is the markup header, i.e. root element and information about title, author, and date of the text. The two index variables are initialised and the input file is read line by line. Each line is split into an array and the sentence and the words of the sentence are marked up with tags and also receive identification numbers which are provided by the index variables sentence-number and word-index. When the end of the input file is reached the end tags are written into the output file and the files are closed.
For the provisional proof-of-concept code see Appendix B.
thistexttabbsprogramwork¯normalisation(input,
output)
array
Table 3: Pseudocode for keyword extraction function.
The acquisition module creates a list of keywords which occur in the input texts. The list is alphabetically ordered and does not contain any double occurences. The list is in simple ASCII format and each word is seperated by a line feed symbol.
The acquisition module searches the data directory and processes each file in this directory. The module first stores the files of the mentioned directory in an array. Then a file is opened, its contents are read line by line and the extracted words are stored in an array. After that the file is closed. This procedure is repeated until all files in the directory have been processed.
The array which contains the words of all files is sorted alphabetically. After that the first element of the array is copied into another array. The module checks if the next element is equal to the one just copied. If it is equal it checks the next one, if it is not equal it is copied to the next array. This is repeated until each element of the first array has been checked. The second array consists of unique words. The array is split into a string variable and each word is separated by a new line and stored into the target file.
For the provisional proof-of-concept code see Appendix D.
The consultation module searches a given text for a specific keyword. If the keyword is found all sentences are shown in which the desired keyword occures. In case there are no matches it returns the message that no matches were found. It is also possible to search more than one text at the same time, but the results are sorted according the source texts.
The consultation module expects as arguments a regular expression -- the keyword to be searched--, the output file where the results will be stored and a list of input files which could be as long as necessary.
The module first checks whether the number of arguments is less than 3. In that case it terminates and returns an error message to the user and prints out which arguments are expected. The first input and the output file is opened. The whole input file is stored into one variable. The whole variable is searched for a specific pattern and the line number and the sentence are stored in variables.
Then the variable which contains the sentence is searched whether it contains the keyword and if it contains it the whole sentence with the line number is written to the output file. This is repeated until the whole input file is checked, then the input file is closed. The next file from the input list is processed in the same manner. This is repeated until the whole list has been processed. At the end the module checks whether there were any matches, if not a message is returned to the screen. The output file is closed.
consultation(keyword,outputfile,list-of-inputfiles)
openfile(outputfile)
found
Table 4: Pseudocode for consultation function.
The TK-interface is an interface between the user and the audio concordance program package, namely the acquisition module, which creates keywords from given input texts, and the consultation function, which searches the required words in a corpus. The user can selct a language. All files which contain data in this language are displayed, as well as the keywords which occur in the corpus data. The user can select a keyword and a text, and each sentence containing the selected word is displayed. If desired it is also possible to listen to the audio data of a chosen sentence.
The TK-interface consists of two Perl-TK-modules. In the first module the user chooses a language for which
the audio concordance will display data. The first module invokes the second module. The second module is the interface for the audio concordance itself.
The interface is a layer which accepts user input and interacts with the acquisition module and the consultation module.
When the second module is invoked it executes the acquisition module and reads in the file which contains the keywords.
All keywords are displayed in a listbox and the text data files appear in a second listbox. It is not possible to select more than one word or more than one text at a time. Each selection can be canceled via the deselect word or delselect text button.
The reset button cancels all the choices selected. The search button executes the consultation function and then a new
window is opened and all the hits for one text are displayed. The hits are displayed in a listbox. While the result window is open the search button is disabled. The user can select
one text at a time and click the play button. The wav-file belonging to this sentence will be played. When the close button is clicked the window will be closed and the search button is enabled again.
Module 1 (langgui)
define main-interface
language-listbox
Table 5: Pseudocode for TK-interface.

Figure 6: Basic user interface forms.
The consultation function is frontended with CGI-interaction for user access. After selecting a language from a pick-list on the introductory page (see figure 6), two pick lists enable users to select a word and a corpus where a context could come from that is language specific. It is also possible to select all corpora at the same time.
After the consultation request a list of occurences with accompanying line numbers and contexts are given. All of this is generated on the fly.
Three user interfaces are being incorporated; design and implementation have been tested for command line access, web-accessibility and TK-Widget standalone programs.. The current implementation of the graphical user-interface forms for web-access is shown in figure 6.
For provisional proof-of-concept code see Appendix F.
The testing programme follows EAGLES recommendations for language and speech technology (Gibbon et al. (2000)) and involves
The software and documentation is distributed continuously within the cooperation between Bielefeld and Abidjan.
Initial object specifications follow for
PAC - Portable Audio Concordance
PAC is designed to aid corpus lexicographers with low-technology equipment
in the documentation of undocumented languages.
The present documentation covers an initial specification and
proof-of-concept implementation.
PAC has the following functionality:
Input: phrase-chunked speech files, (multi-tier) SAMPA annotations
Intermediate: keywords, normalised ASCII SAMPA text with XML markup
Output: Hypertext (XML/HTML) formatted text/audio query results
Text normalisation module
The module maps
The marked-up text serves as the input to the KWIC (KeyWord in Context)
search function.
Input: time-stamped phrase-chunked X-SAMPA transcription
Intermediate: --
Output: XML marked up normalised X-SAMPA transcription
Keyword extraction and formatting module
The keyword extraction module maps a normalised text into a set of keywords
for mapping to a lexicon and for consultation queries to the KWIC
consultation function.
The keywords are also intended to be generalised to regular expressions
for generalised search.
Input: XML Concordance Tree Tagging
Intermediate: Line separated ASCII SAMPA formatted keyword set
Output: XML marked up picklist widget
Consultation query and response module
The consultation query module maps a keyword (set) and a normalised text
with Concordance Tree Tagging XML markup into a set of transcription and
audio context pairs in which the keyword occurs.
Input: XML formatted keyword list and XML formatted text
Intermediate: search optimised text format
Output: GUI formatted KWIC keyword-textcontext-audiocontext triples
User interface module
Four user interfaces are provided:
All (G)UI functions have service calls to all other modules.
Input: Outputs of all other modules
Intermediate: --
Output: Events to trigger all other modules; (G)UI forms
#!/vol/bin/perl -w
# normalisation.pl
# version: 0.9b
# N. Jahn, S. Ouattara, T. Trippel
# November 2000, University of Bielefeld, Germany
# [jahn,soma,ttrippel]@spectrum.uni-bielefeld.de
# Functionality: for a given line it ennumerates the line,
# breaks it into words, gives every word a unique identifier.
# Syntax: normalisation.pl <INFILE> <OUTFILE>
# Additional information: the user will be prompted for
# title, author, date of and changes to the document
($input, $output) = @ARGV ; #store the arguments in $input and $output
if ($#ARGV < 1) { #if there are less than 2 arguments, tell the user and exit program
print "usage: normalisation.pl <input> <output>\n" ;
exit ;
}
open (IN, "< $input") #open $input for reading access
or
die "\n Input file couldn't be opened!!\n" ;
open (OUT, "> $output") #open $output for writing access
or
die "\n Output file couldn't be created!!\n" ;
print "Please give the title: \n";
chomp($title = <STDIN>);
print "Please give the authors name: \n" ;
chomp($author =<STDIN>);
print "Please give the date when the text was created:\n";
chomp($date = <STDIN>);
print "Please give the language of the text:\n";
chomp($language = <STDIN>);
print "Please give the changes to the text:\n";
chomp($changes = <STDIN>);
print OUT qq%<?xml version="1.0" standalone="no"?>\n% ; #print document markups into $output
print OUT "<!DOCTYPE ctt public \"-//UBI//DTD CONCORDANCE 0.1a//EN\" >\n\n";
print OUT "<ctt>\n\n" ;
print OUT "<concinf>\n";
print OUT "<title>$title</title>\n";
print OUT "<author>$author</author>\n";
print OUT "<date>$date</date>\n";
print OUT "<language>$language</language>\n";
print OUT "<changes>$changes</changes>\n";
print OUT "</concinf>\n\n";
print OUT "<conctext>\n";
$count = 1 ;
$line = 1 ;
$word = 0 ;
while(<IN>) {
chop ; # deletes the last character of the line
@array = split(" ", $_) ; # splits the words into an array
if ($#array > 0) { # checks for non-empty lines
print OUT "<concsentence sentencenumber=\"$line\">\n" ;
foreach $word (@array) { # for each word of the current line do
print OUT "<concword wordnumber=\"word$count.$line\">$word<\/concword>\n";
$count++ ;
}
print OUT "<\/concsentence>\n" ;
}
$count= 1;
$line++ if ($#array > 0) ; # increments the line number by one if the line is non-empty
}
print OUT "<\/ctt>\n" ;
close(IN) ;
close(OUT);
#!/vol/bin/perl
use Tk ;
sub openError {
$nf = MainWindow->new() ;
$nf->Frame(-label => "\nNo input or output file specified!\n")->pack() ;
$nf->Button(-text => "ok", -command =>sub {$nf->withdraw()})->pack() ;
}
sub norm {
$mw = shift ;
$input = $e5->get() ;
$output = $e6->get() ;
$title = $e1->get() ;
$author = $e2->get() ;
$date = $e3->get() ;
$changes = $e4->get() ;
if ($input eq undef) {
openError ;
}
open (IN, "< $input")
or
die "Input file couldn't be opened!\n" ;
open (OUT, "> $output")
or
die "Output file couldn't be opened!\n" ;
print OUT qq%<?xml version="1.0" standalone="no"?>\n% ;
print OUT "<!DOCTYPE ctt public \"-//UBI//DTD CONCORDANCE 0.1a//EN\" >\n\n";
print OUT "<ctt>\n\n" ;
print OUT "<concinf>\n";
print OUT "<title>$title</title>\n";
print OUT "<author>$author</author>\n";
print OUT "<date>$date</date>\n";
print OUT "<changes>$changes</changes>\n";
print OUT "</concinf>\n\n";
print OUT "<conctext>\n";
$count = 1 ;
$line = 1 ;
$word = 0 ;
while(<IN>) {
chop ; # deletes the last character of the line
@array = split(" ", $_) ; # splits the words into an array
if ($#array > 0) { # checks for non-empty lines
print OUT "<concsentence sentencenumber=\"$line\">\n" ;
foreach $word (@array) { # for each word of the current line do
print OUT "<concword wordnumber=\"word$count.$line\">$word<\/concword>\n";
$count++ ;
}
print OUT "<\/concsentence>\n" ;
}
$count= 1;
$line++ if ($#array > 0) ;
# increments the line number by one if the line is non-empty
}
print OUT "<\/ctt>\n" ;
close(IN) ;
close(OUT);
exit ;
}
$mw = MainWindow->new() ;
$mw->Label(-text => "Title")->pack() ;
$e1 = $mw->Entry()->pack() ;
$mw->Label(-text => "Author")->pack() ;
$e2 = $mw->Entry()->pack() ;
$mw->Label(-text => "Date")->pack() ;
$e3 = $mw->Entry()->pack() ;
$mw->Label(-text => "Last changes")->pack() ;
$e4 = $mw->Entry()->pack() ;
$mw->Label(-text => "")->pack() ;
$mw->Label(-text => "Input File")->pack() ;
$e5 = $mw->Entry()->pack() ;
$mw->Label(-text => "Output File")->pack() ;
$e6 = $mw->Entry()->pack() ;
$mw->Label(-text => "")->pack() ;
$mw->Button(-text => "Normalise Text", -command => \&norm)->pack() ;
$mw->Button(-text => "Quit", -command => sub {exit})->pack() ;
MainLoop;
#!/vol/bin/perl
#authors Jahn & Ouattara
#Program : acquire
#gets as input a <ctt> text, then creates a key wordlist and sorts it automatically
sub getDir {
opendir(ETC, "/project/langdoc/SOFTWARE/CONCORDANCE/DATA/")
or
die "Cannot open it!" ;
while ($toc = readdir(ETC)) {
if ($toc =~ m/\S+?\.ctt/g) {
push(@inh, $toc) ;
}
}
closedir(ETC);
return @inh ;
}
@array = () ;
@narray = () ;
@dir = getDir() ;
print @dir ;
foreach $datei (@dir) {
open(DATEI, "< /project/langdoc/SOFTWARE/CONCORDANCE/DATA/$datei") ;
while (<DATEI>) {
if (m/<concword wordnumber=\"word\d+\.\d+\">(\S+?)<\/concword>/g) {
push(@array, $1) ;
} #extracts a word and pushes it onto an array
}
close(DATEI) ;
}
@narray = sort(@array) ; #sorts the array
$index = 0 ;
$current = 0 ;
@rarray = () ;
while($index <= $#narray){ #compares index to the length of the array
push(@rarray, $narray[$index]) ; #pushes word onto the array
$index++ ;
while ($rarray[$current] eq $narray[$index]){
$index++ ; #skips equal words
}
$current++;
}
open(OUT, "> /project/langdoc/SOFTWARE/CONCORDANCE/DATA/wortliste.wl") ;
print OUT join("\n", @rarray) ; # converts the array into string
close(OUT) ;
#!/vol/bin/perl -w
use CGI qw(:standard);
my $language = param("language"); #the language to be investigated
#my $language = "AGNI";
$defaultpath= "../html-data/DATA/"."$language"."/";
# $_=$DATEI;
#s/\/project\/langdoc\/SOFTWARE\/CONCORDANCE\/DATA\///;
#s/\.ctt//;
#$filename=$_;
%titlefile =();
@array = () ;
@narray = () ;
@dir = getDir() ;
# print @dir ;
foreach $datei (@dir) {
open(DATEI, "< /project/langdoc/SOFTWARE/CONCORDANCE/DATA/$language/$datei") ;
while (<DATEI>) {
if (m/<concword wordnumber=\"word\d+\.\d+\">(\S+?)<\/concword>/g) {
push(@array, $1) ;
} #extracts a word and pushes it onto an array
elsif (/<title>\w(.*)<\/title>/){
s/<title>//;
s/<\/title>//;
$title=$_;
$titlefile{"$datei"}=$title;
}
}
close(DATEI);
}
@narray = sort(@array) ; #sorts the array
$index = "0" ;
$current = "0" ;
@rarray = () ;
while($index <= $#narray){ #compares index to the length of the array
push(@rarray, $narray[$index]) ; #pushes word onto the array
$index++ ;
while ($rarray[$current] eq $narray[$index]){
$index++ ; #skips equal words
}
$current++;
}
print header();
print <<END_of_HEAD;
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html>
<head>
<title>AUDIO-Concordance Wordlist and Userinterface</title>
<link rev="MADE" href="mailto:ttrippel\@spectrum.uni-bielefeld.de" />
<base href="http://coral.lili.uni-bielefeld.de/langdoc/cgi-bin/acquisition.pl" />
<meta name="copyright" content="University of Bielefeld, Computational Linguistics and Spoken Language" />
<meta name="author" content="Thorsten Trippel" />
<meta name="description" content="Wordlist for the audio concordance and Userinterface" />
<meta name="date" content="23 Nov 2000" />
<link rel="stylesheet" href="../langdoc.css" />
<script language="JavaScript">
<!--
function doList() {
counter=0;
for(var i=0;i<document.forms["consultationstart"].infile.options.length;i++) {
if(document.forms["consultationstart"].infile.options[i].selected) {
counter++;
}
}
if(counter==0){
alert("Don't want to proceed?\\n There are no files selected!");
return;
}
else {document.forms["consultationstart"].submit();}
}
function select_all(formList) {
for (var i = 0; i < formList.options.length; i++) {
formList.options[i].selected =true;
}
}
function deselect_all(formList) {
for (var i = 0; i < formList.options.length; i++) {
formList.options[i].selected =false;
}
}
// -->
</script>
</head>
<body link="#ffffff" vlink="#fafafa" alink="ff0000">
END_of_HEAD
@alltitles= values(%titlefile);
print <<HEAD_of_TABLE;
<form name="consultationstart" action="http://coral.lili.uni-bielefeld.de/langdoc/cgi-bin/consultation.pl" method="post">
<input type="hidden" name="language" value="$language" />
<table class="intern" >
<tr><!-- 1.Reihe leer nur leere Bilder -->
<td class="background" width="137">
<img src="../IMAGES/1pix.gif" width="1" height="1" alt="" hspace="68" vspace="1" /></td>
<td class="background" width="300%" colspan="3"><img src="../IMAGES/1pix.gif"
width="1" height="1" alt="" hspace="137" vspace="1" /></td>
<td class="background" width="137"><img src="../IMAGES/1pix.gif"
width="1" height="1" alt="" hspace="68" vspace="1" /></td>
</tr>
<tr><!-- 2. Reihe Tabellenueberschriften -->
<td class="tablehead">Contents</td>
<td class="tablehead" colspan="3">Search for words in one ore more text(s) <br /> in the language <b>$language</b>.</td>
<td class="tablehead">Links</td>
</tr>
<tr> <!-- 3. Reihe, das ist die erste Reihe des Tabelleninhalts -->
<td class="content" rowspan="3">
<p><a href="../LangDoc/index.html">Language Documentation Notes</a></p>
<p><a href="../index.html">Introductory page</a></p>
<!-- <p><a href="../acquisition.pl">Search the concordance</a></p> -->
<p><a href="../SPECIFICATION/">Specification of the audio concordance</a></p>
<p>E-mail: <a href="mailto:langdoc\@spectrum.uni-bielefeld.de">langdoc\@spectrum.uni-bielefeld.de</a></p>
<p><a href="../about.html">About the project</a></p>
<p>Designed: November 2000</p>
<!-- <img src="../IMAGES/1pix.gif"
width="1" height="1" alt="" hspace="1" vspace="100" /> -->
</td>
<td class="body" rowspan="2">
<!-- <table class="intern" border="0" cellspacing="0" cellpadding="0"> -->
HEAD_of_TABLE
print <<SELECT_END;
<!-- <tr>
<td class="body" rowspan="2" > -->
<strong>Select word:</strong><br />
<select name="word" size="20" multiple="multiple">
SELECT_END
$word="0";
for ($word=0;$word<=$#rarray;$word++){
print "<option value=\"$rarray[$word]\">$rarray[$word]</option>\n"
}
print <<CONTENT_START;
</select>
</td>
<td align="center" class="body" colspan="2" rowspan="1">
<strong>Select corpus:</strong><br />
<select name="infile" size="3" multiple="multiple">
CONTENT_START
while (($file,$title) = each(%titlefile)){
print "<option value=\"$defaultpath$file\">$title</option>\n";
}
print <<CONTENT_END;
</select>
</td>
<td class="linklist" rowspan="3">
</td>
</tr>
<tr> <!-- 4. Reihe, Uebersicht und linkliste sind verbraucht, 2. Spalte auch bleibt noch Spalte 3 und 4 -->
<td align="center" class="body" colspan="2">
<input type="button" value="Select All Files" onclick="select_all(form.infile)" />
<!-- </td>
<td align="center" class="body" > -->
<input type="button" value="Deselect All Files" onclick="deselect_all(form.infile)" />
<br /><input type="button" value="Select All Words" onclick="select_all(form.word)" />
<!-- </td>
<td align="center" class="body" > -->
<input type="button" value="Deselect All Words" onclick="deselect_all(form.word)" /></td>
</tr>
<tr><!-- 5. Reihe, Uebersicht und linkliste sind verbraucht, Rest noch nicht -->
<td align="center" class="body" colspan="3">
<input type="button" value="Search for word" onclick="doList(form)" />
<!-- <input type="submit" value="Search for word" / >
</td>
<td align="center" class="body" >--> <input type="reset" value="Reset" /></td>
</tr>
CONTENT_END
print <<END_of_TABLE;
<!-- </table>
</td>
</tr> -->
<tr>
<td class="tablehead"> </td>
<td class="tablehead" colspan="3"> </td>
<td class="tablehead"> </td>
</tr>
</table>
</form>
</body>
</html>
END_of_TABLE
sub getDir {
opendir(ETC, "/project/langdoc/SOFTWARE/CONCORDANCE/DATA/$language/")
or
die "Cannot open the DATA directory!" ;
while ($toc = readdir(ETC)) {
if ($toc =~ m/\S+?\.ctt/g) {
push(@inh, $toc) ;
}
}
closedir(ETC);
return @inh ;
}
{
#!/vol/bin/perl -w
#authors Jahn & Ouattara
#Program : search.pl
#looks for a given word in a given text and prints out the results in multiple matching
#a result is composed of the line number and the the contents of that line
undef $/ ;
if ($#ARGV < 2) {
print "Usage: consultation.pl search outputfile inputfile(1) ... inputfile(n)\n" ;
exit[0] ;
}
print "@ARGV\n" ;
$word = $ARGV[0] ; #the word to be searched
@input = @ARGV[2..$#ARGV] ; #the input file to look through
$output = $ARGV[1] ; #the output file
$found = 0 ; #boolean variable which is 0 if there aren't any matches
open (OUT, "> $output")
or
die "\n Output file couldn't be created!!\n" ;
foreach $dat (@input) {
open(IN, "< $dat")
or
die "$dat couldn't be opened!!\n" ;
$text = <IN> ;
while ($text =~ m/<concsentence sentencenumber=\"(\d+)\">(.+?)<\/concsentence>/gs) { #matches the
#sentence number and its contents in standard variables
$zeile = $1 ;
$inhalt = $2 ;
print $inhalt ;
if ($inhalt =~ m/>$word</g) { #matches the word with the contents
$found = 1 ;
$inhalt =~ s/concword wordnumber/a name/g ;
$inhalt =~ s/concword>/a>/g ;
print OUT "line $zeile\n" ; #prints the sentence number into a file
print OUT "$inhalt\n" ; #prints the contents of the sentence into a file
}
}
close(IN) ;
}
if ($found == 0) {
print "No matches found !!\n" ;
}
close(OUT);
#!/vol/bin/perl -w
use CGI qw(:standard);
my @word = param("word"); #the word to be searched
my @input= param("infile"); #the input file to look through
# my $output= param("outfile"); #the output file
my $language = param("language"); #the language to be investigated
undef $/ ;
print header();
print <<END_of_HEAD;
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html>
<head>
<title>AUDIO-Concordance output</title>
<link rev="MADE" href="mailto:ttrippel\@spectrum.uni-bielefeld.de" />
<base href="http://coral.lili.uni-bielefeld.de/langdoc/cgi-bin/test.pl" />
<meta name="copyright" content="University of Bielefeld, Computational Linguistics and Spoken Language" />
<meta name="author" content="Thorsten Trippel" />
<meta name="description" content="Results from the audio concordance query" />
<meta name="date" content="23 Nov 2000" />
<link rel="stylesheet" href="../langdoc.css" />
</head>
<body link="#ffffff" vlink="#fafafa" alink="#fa1340">
END_of_HEAD
print <<HEAD_of_TABLE;
<table class="intern" >
<tr>
<td class="background" width="137">
<img src="../IMAGES/1pix.gif" width="1" height="1" alt="" hspace="68" vspace="1" /></td>
<td class="background" width="300%"><img src="../IMAGES/1pix.gif"
width="1" height="1" alt="" hspace="137" vspace="1" /></td>
<td class="background" width="137"><img src="../IMAGES/1pix.gif"
width="1" height="1" alt="" hspace="68" vspace="1" /></td>
</tr>
<tr>
<td class="tablehead" bgcolor="#CCCCCC" >Contents</td>
<td class="tablehead">Hits: search for <b>$word</b> <br />in $language
<!-- text $filename as a corpus. -->
</td>
<td class="tablehead">Links</td>
</tr>
<tr>
<td class="content">
<p><a href="/LangDoc/index.html">Language Documentation Notes</a></p>
<p><a href="../index.html">Introductory page</a></p>
<!-- <p><a href="acquisition.pl">Search the concordance</a></p> -->
<p><a href="../SPECIFICATION/">Specification of the audio concordance</a></p>
<p>E-mail: <a href="mailto:langdoc\@spectrum.uni-bielefeld.de">langdoc\@spectrum.uni-bielefeld.de</a></p>
<p><a href="about.html">About the project</a></p>
<p>Designed: November 2000</p>
</td>
<td>
<table class="intern" border="0" cellspacing="0" cellpadding="0">
HEAD_of_TABLE
foreach $file (@input) {
open (IN, "< $file")
or
die "\n Input file couldn't be opened!!\n" ;
$text = <IN> ;
$_=$file;
s/\.\.\/html-data\/DATA\/$language\///;
s/\.ctt//;
$filename=$_;
while ($text =~ m/<concsentence sentencenumber=\"(\d+)\">(.+?)<\/concsentence>/gs) { #matches the
#sentence number and its contents in standard variables
$zeile = $1 ;
$inhalt = $2 ;
foreach $word (@word){
if ($inhalt =~ m/>$word</g) { #matches the word with the contents
#
# $found = 1 ;
$inhalt =~ s/concword wordnumber/a name/g ;
$inhalt =~ s/concword>/a>/g ;
$inhalt =~ s/>$word</><b>$word<\/b></g ;
print #<<CONTENT_END;
("<tr><td class=\"body\">text: $filename, line $zeile: <br /> $inhalt\n</td><td class=\"body\"><a href=\"../CORPUS/AUDIO/$filename"."$zeile.wav\"><img src=\"../IMAGES/speaker.gif\" alt=\"Link to audio\" /></a>
</td>
</tr> ")
# CONTENT_END
# print p("text: $file <br /> line $zeile: <br /> $inhalt\n") ;
#prints the sentence number into a file
#prints the contents of the sentence into a file
}
}
}
}
print <<END_of_TABLE;
</table>
<!-- -->
</td>
<td class="linklist">
</td>
</tr>
<tr>
<td class="tablehead"> </td>
<td class="tablehead"> </td>
<td class="tablehead"> </td>
</tr>
</table>
</body>
</html>
END_of_TABLE
close(IN) ;
#print end_html();
The Portable Audio Concordance (PAC) was developed at the University of Bielefeld in the years 2000--2001 as a result of the cooperation of the University of Bielefeld, Germany, and Universite de Cocody, Abidjan, the Ivory Coast. This cooperation took place within the DAAD project: Encyclopaedia of languages of the Ivory Coast and continues in the VW-foundation funded project EGA: A Documentation Model for an Endangered Ivorian Language .
The goal was to create a concordance, a list of words with pointers to actual occuring contexts, which should not only be given in some form of print but also in the form of sound-presentation to enable all sorts of researchers to have access to the source data. For the technical and linguistic specification see Trippel et al. (2001).
This is the users manual, a manual for potential users of the audio concordance. It is directed towards the user consulting a concordance, not the lexicographer adding corpora; therefore the lexicographer's information needs are met at the end of the manual.
As most programs there are certain functions PAC has, and other that it does not have. The next sections explain what is possible and what is not.
PAC is a language independent package of software for the creation of a concordance. Any language can be used as long as corpora meeting the following requirements exist:
If these requirements are met the PAC software package can work properly and carry out
If the above mentioned requirements are not met, PAC cannot work properly.
Further shortcomings in functionality, which may be changed in the process of development at a later stage, are:
Two different graphical user interfaces are available: CGI-Interfacing (via WWW or at least a webserver) and TK-Widget interfacing (windows popping up). Because the functionality is roughly the same we will not distinguish between them if not necessary due to different functionality.
The introductory page requires a selection of the languages the user is interested in. This selection is done (at least on the standard systems) by clicking on the name of the language in the list of languages. If the number of languages is greater than the available lines of the pick list than a scroll-bar enables scrolling up and down for more languages.
After selection of a language click the choose-button (TK-version) or the consult concordance-button (web-version).
The next window will automatically appear which can be used to continue with a query.
Two select lists appear after successfully selecting a language. The left one shows a list of words that can be queried for, the right one shows the available corpora. Select a word by clicking on a word in the list (scroll the list if needed) and select a corpus. In the Web-version you may as well select more than one corpus, e.g. by clicking the button Select All Corpora. Deselection of a word and/or corpus can be done by re-clicking (Web-version) or by pressing the deselect file/word-button (TK version). A complete reset can be done by clicking the reset-button.
The TK-Version can be exited from this window by a click on the quit-button.
In both versions a click on the search for word button results in the actual query.
The result is a string of the name of the corpus file (WEB-version only), the line number (this is also the sentence number) and the transcription of the sentence. Listening to the audio files can be accomplished by selecting a sentence and clicking play (TK) version, or by clicking the speaker icon in the Web-version.
The audio files that are tested and used in the example implementation exist in the well known and common WAV-format. They were recorded both at the University of Bielefeld and in field studies in the Ivory Coast with native speakers.
Transcriptions are human products and therefore subject to
non-systematic (and sometimes also systematic) errors. Special
characters that are not part of the ASCII-128 character set as well as
special non-letter characters such as
Inserting data and adding a language is shell based, it was tested on UNIX and LINUX machines. GUI's have not undergone sufficient testing yet.
To add a language go to the concordance directory, change to the DATA directory and create a subdirectory named by the name of the language. This will be the directory for the normalised texts and wordlist for the language.
You may add the new corpus at this point of processing.
Copy the audio files of the language into the concordance subdirectory
CORPUS/AUDIO . The format of the filenames should be
corpusfilename
You may want to store your original corpus transcription in the CORPUS-directory.
After storing the new corpus in the CORPUS-directory change to the MODULES-directory. There call the script normalisation.pl with the arguments: ../CORPUS/newcorpusfile.txt and ../DATA/language/newcorpusfile.ctt. Note that newcorpusfile here is the placeholder for the filename of your corpus, the txt means the ASCII input format, the ctt indicates the output-format. For automatic use of the concordance the PAC-software expects the ctt extension for the output format, so you should stick to that.
After you called the script on the shell you will be asked for the name of the author, the title of the corpus, the date of creation, the language of the corpus and for changes (this means dates of changes to the normalised text). Please have this information at hand. Escape the question for changes by hitting return without any other input to that question. This will be the regular case as the changes tag should be used if and only if the normalised text received changes that are not due to a format change and not part of a (corrected and changed) original corpus.
As soon as this is finished the new corpus is usable according to the limitation of your versions.
Please send error reports, questions and suggestions to langdoc@spectrum.uni-bielefeld.de with a brief explanation of the problem.