Next: Documentation: requirementsdesign, specification,
Up: 23 01 19 LINGUISTIK
Previous: Language and speech corpora
For automatising parts of the language documentation process we will be
using Perl, a language designed for processing large quantities of text
in simple or sophisticated ways. Perl is used by many linguistics,
computational linguistics, and language and speech engineers for developing
an efficient and practical laboratory working environment for automatic
text processing in tasks such as wordlist and dictionary construction,
hypertext formatting.
The basic idea of automatic text processing is straightforward: in manual
text editing, whether on paper or on the keyboard, a number of basic
operations are performed, such as the following:
- Searching for a particular string of characters (or a set of similar strings) in a document.
- Marking certain strings (or sets of similar strings) in a document.
- Inserting certain strings (or sets of similar strings) in a document.
- Deleting certain strings (or sets of similar strings) in a document.
- Replacing particular strings (or sets of similar strings) by another string in a document.
- Extracting sections of a document which match a particular string and copying them into another document.
All of these operations can be written down, in the order in which they
are intended to be performed (perhaps repeatedly until some condition
is fulfilled) in a `script', and performed automatically.
There are many scripting languages, of which Perl is one, which have been
designed specifically for this kind of task.
Others you may come across are Python, awk,
JavaScript.
t{JavaScript is used inside HTML pages on the
web to create interactive animations.
Tk is a scripting language for creating graphical
user interfaces, which has been integrated with the scripting languages
tcl, Python and Perl.
Practical things:
- If you use Linux, or another UNIX computing environment, you will find that Perl is either included in the software distribution or can easily be dowloaded and installed.
- If you use a Windows computer, look for and install Act
ivePerl.
- The following books on Perl cover a range of levels of proficiency:
- Larry Wall, Tom Christiansen & Randal L. Schwartz (1991). Programming Perl. Cambridge &c.: O'Reilly. (`The Camel Book'.) (German translation also available; also check later editions.)
- Randal L. Schwartz & Tom Christiansen (1997, 2nd edn.). Learning Perl. Cambridge &c.: O'Reilly. (`The Lama Book'.)
- Randal L. Schwartz, Erik Olson & Tom Christiansen (1997). Learning Perl on Win32 Systems. Cambridge &c.: O'Reilly. (`The Lama Book'.)
- Larry Wall, Tom Christiansen & Randal L. Schwartz (2000). Programmieren mit Perl. Cambridge &c.: O'Reilly.
- Paul Hoffman (1997). Perl for Dummies. Cambridge &c.: O'Reilly. (German translation also available.)
- Dirk Louis (2000). Letzt lerne ich Perl. Jetzt lerne ich Perl. München: Markt+Technik Verlag.
- Try out the following simple text processing scripts:
Word count script
Line matching script
Exercise:
For each of these two applications, write a detailed description with the following components:
- Specification of functionality (requirements)
- Specification of design (architecture, data structures, algorithms)
- Specification of implementation (specific techniques and programming constructs in Perl)
- For ideas on more advanced kinds of functionality which you may want to develop, see
my HyprLex applications, and
Steven Bird's Hyperlex
- For more advanced examples of linguistic Perl applications, check Jon Fernquest's Language Exploration and Manipulation Tools for Translators, Writers, and Language Students
A task which can easily be implemented in Perl is the following:
- Requirements:
-
Produce a semi-autmatically labelled text with HTML markup, in which different parts of speech are marked in different colours.
- Design:
-
- The text is to be re-formatted in the form of a list of words and punctuation signs on separate lines.
- Each word is to be manually assigned a descriptive category such as a part of speech (e.g. Det, Adj, N, Pron, V, Adv, Prep, Conj, Interj).
- A coding table is to be defined, in which the categories are paired with colours.
- From the list and the coding table, an HTML file is to be generated in which words of particular categories appear in the appropriate colour.
- Implementation:
-
... this is up to you to specify.
The following example is designed for basic quantitative corpus linguistic
work:
- Statistical analysis for a single population:
dgstats.pl
Example: for this input file:
# test.dat
12
13
14
15
14
13
12
the script produces the following output:
===========================================================
Input file: teststats.dat
Output file: teststats.dat.stats
Date: 2000/12/7 13:55:34
Program: dgstats.pl V 1.02, D. Gibbon, 2000/12/06
Description: Single population statistics calculator)
===========================================================
n: 7
Min: 12
Max: 15
Sum: 93
Mean: 13.2857142857143
Variance: 1.23809523809524
Std dev: 1.11269728052837
Std error: 0.420560041253707
Mean diff: 0.897959183673469
95% conf int: 12.461416604857...14.1100119665716 (+/-0.824297680857265)
99% conf int: 12.2023516194447...14.3690769519838 (+/-1.08336266626955)
===========================================================
Data and Z-scores:
1: 12 -1.15549332977947
2: 13 -0.256776295506548
3: 14 0.641940738766369
4: 15 1.54065777303929
5: 14 0.641940738766369
6: 13 -0.256776295506548
7: 12 -1.15549332977947
===========================================================
For some advanced examples of linguistic Perl applications, check Jon Fernquest's Language Exploration and Manipulation Tools for Translators, Writers, and Language Students
Samples:
Next: Documentation: requirementsdesign, specification,
Up: 23 01 19 LINGUISTIK
Previous: Language and speech corpora
Dafydd Gibbon, Thu Feb 15 15:07:15 MET 2001