% % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % %
% %
% File: dtrstyle.txt %
% Purpose: style sheet for DATR .dtr files %
% Author: Gerald Gazdar, 25 September 1994 %
% Email: geraldg@cogs.sussex.ac.uk %
% Address: COGS, Sussex University, Brighton BN1 9QH, UK %
% Version: 1.01 %
% %
% Copyright (c) University of Sussex 1994. All rights reserved. %
% %
% % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % %
0. Introduction
This document contains a set of conventions, recommendations and
suggestions for the formatting and layout of theories/fragments
written in the lexical knowledge representation language DATR. Each
proposal is accompanied by an example and by a rationale for the
proposal, where one exists. The proposals are based on many hours
spent editing other people's dtr files for inclusion in the public
DATR archive and on comments and suggestions made by Roger Evans,
Dafydd Gibbon and Jim Kilbury.
% % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % %
1. Filenames
Adopt MS-DOS file name conventions even if working on a Unix system:
lower case characters only, up to eight characters in prefix and a
three character suffix after the period. Many people use DATR on
MS-DOS systems and long Unix names will not be preserved. Make the
prefix as mnemonic as you can (for others as well as yourself) but
allow for the possibility of other files on the same topic -- hence
'estonia1.dtr' or 'eston_a.dtr' rather than 'estonian.dtr'. Check
the DATR dtr file archive to see if a name is already in use or if an
existing name can be used as a model.
Other recommended file suffixes:
estonia1.dec -- file containing declarations relating to estonia1.dtr
estonia1.dmp -- file containing a theorem dump from estonia1.dtr
estonia1.doc -- file containing text documentation for estonia1.dtr
estonia1.dto -- compiled version - internal code - of estonia1.dtr
estonia1.dtv -- file containing a dump of values from estonia1.dtr
% % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % %
2. Fonts
DATR fragments should always be printed or displayed using a
monospaced font.
Rationale: use of a proportional font will make it impossible to
preserve vertical character alignments.
% % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % %
3. Character sets
All existing implementations of DATR allow one to use all the
characters found on a US typewriter keyboard. At least one widely
used implementation (QDATR) will also allow you to use characters
found only on keyboards designed for european languages other than
English. You may want to bear this fact mind when you decide how to
encode your analysis. If the orthographic representation is not the
main issue and there are conventions for representing the language in
7 bit ASCII, then you should simply embrace those conventions in the
the interests of the portability of your analysis. But if many
non-ASCII characters are involved and you want your theorems to
deliver the language as it is actually written, then you may decide
to use those characters in your encoding. Whichever way you go, it
makes sense to include a comment near the top of the file which
addresses the character set issue.
% % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % %
4. Tabs or spaces
NEVER EVER USE TABS!
Rationale: the appearance of tabs depends on the output device and
thus the appearance of your fragment will vary across output devices.
The indentation you intend may not be the indentation that appears.
In addition, since tabs are invisible, subsequent editing is likely
to lead to the file containing a mixture of tabs and spaces used to
the same end (e.g., indentation) which will make the results of
global edits unpredictable.
% % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % %
5. Order of material in the file
1. The file header.
2. Opening comment explaining what the file is and any relevant
background material. For long files it may well be useful to
include a table of contents.
3. Opening declarations (see below).
4. Node definitions -- normally presented with the top of the
inheritance tree at the top, and the leaves at the bottom.
5. Closing declarations (see below).
6. The RCS Id comment (see below).
You should also include, within a comment, some examples of typical
theorems that can be derived from the theory: these will help users
to see what the "sensible" queries are (which may not otherwise be
obvious, especially if theorem dump declarations are not included).
These example theorems can be included in the opening comment or
immediately prior to the closing declarations.
% % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % %
6. File headers
Use the following format:
% % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % %
% %
% File: estonia1.dtr %
% Purpose: morphophonemics of Estonian in the style of Chomsky 1952 %
% Author: Gerald Gazdar, 1 September 1994 %
% Email: geraldg@cogs.sussex.ac.uk %
% Address: COGS, Sussex University, Brighton BN1 9QH, UK %
% Documentation: estonia1.doc and paper cited below %
% Related files: estonia2.dtr, latvian1.dtr %
% Version: 2.01 %
% %
% Copyright (c) University of Sussex 1994. All rights reserved. %
% %
% % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % %
Rationale: in addition to providing useful information about the file
and its author, the header can be used as a formal database entry
that allows sets of dtr (and other) files to be indexed, catalogued
and sorted. The choice of format is pretty arbitrary for the first
purpose but choice of the *same* format is crucial for the second.
The File, Purpose, Author and Version lines should be considered
obligatory, the rest optional. For the purposes of indexing
programs, the Purpose line needs to be self-contained, but you can
always add further lines for human readers:
% Purpose: morphophonemics of Estonian in the style of Chomsky 1952 %
% in which it is shown that Estonian and Japanese are %
% alphabetic variants of the same language. %
The date following the author's name is intended to be the date the
file was first made publicly available. If you want to note the date
of the most recent revision, then this can be done on the Version line:
% Version: 2.01 (25th December 1994) %
The Version number X.YY relates to the version of the file, the X
part being for major revisions, and the YY part for minor revisions.
The version number is distinct from any automatically included RCS
ID number found at the end of the file -- the latter relates only
to the edition of the DATR archive from which the dtr file was taken.
Although the Address line is not currently used by the program that
indexes dtr files, it will make life easier if you can fit all of
your address onto the line, rather than continuing below.
If your file is one of many (estonia1.dtr, estonia2.dtr,
estonia3.dtr, estonia4.dtr, estonia5.dtr, ..) then you may want to
use regular expressions in the Related files line:
% Related files: estonia*.dtr, latvian*.dtr %
Whether you include a copyright line is entirely up to you. For all
practical purposes, the existing archive of dtr files, like the
existing implementations, is treated as "academic public domain",
i.e., can be freely copied, used and modified for the purposes of
teaching and research provided that any use properly credits the
original author of the material. If you don't want your file to be
circulated in the way other dtr files are, then make sure you put
something like:
% NOT FOR PUBLIC CIRCULATION !!
at the top of the file.
% % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % %
7. Opening declarations
A general rule: declarations that relate to the way the theory
provides theorems (e.g., '# show' and '# hide' in those
implementations that support them) should appear at the end of the
file; all other declarations should normally appear at the top of the
file, immediately following the header and any general comment that
the file includes. Implementations vary as to the declarations they
permit, but all the following (if they are allowed at all) should
normally appear immediately before any node definitions:
# load 'estonia2.dtr'.
# reset.
# vars $vowel: a e i o u.
# atom T F.
# node top.
# nc node: <nodename>.
# qnode Lexeme.
Rationale: the reader needs to know what these declarations are in
order to make sense of the code that follows. For example, they will
assume that 'T' and 'F' are nodes unless they have seen the '# atom'
declaration, and
top:
<> == undef.
will look like it includes a typo unless the reader has seen the
relevant '# node' declaration. However, it may sometimes make sense
to locate a declaration lower in the file. For example, if the only
code to make use of a variable is that for a finite state transducer
defined near the end of the file, then it may well be more
perspicuous to locate the variable declaration immediately above the
transducer node definitions.
If you use # load then you should include a comment that indicates
what it is that is being loaded.
% % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % %
8. DATR code style
Put the name of the node being defined on a line on its own, followed
by all the equations associated with, each on its own line, and each
indented four spaces from the left margin. Put at least one blank
line between each node definition. For example:
V:
<cat> == v
<sem> == pred1.
Adjective:
<> == V
<cat> == a.
Rationale: if you put the first equation on the same line as the node
name then constant indentation of equations becomes highly
problematic. The only solution that looks good is to indent every
equation to that required by the longest node name in the theory.
But this is a global matter, not one you can decide at the time you
define a particular node. So it requires you to re-edit the whole
file when you have completed it. It also has at least two further
disadvantages: (i) it normally means that the indentation for
equations is much greater than four spaces and this cramps the space
available for material on the RHS and may make it necessary to split
the RHS across two lines; and (ii) it makes the appearance of the
file contingent on choice of node name -- you may subsequently want
to change "Very_Long_Node_Name" to "Foo" and thus destroy the basis
for the 20 space indentation that you were previously obliged to use.
The alternatives, to indent all non-first equations to some standard
depth, or (worse) to indent the equations for each node differently
depending on the length of the node name, both look awful.
Having two or more equations on the same line will normally make the
code hard to read, as will failing to separate node definitions with
blank lines. In exceptional circumstances, it may make sense to
violate some or all of these recommendations. For example:
N1: <a0> == 0 <a1> == 1 <a2> == 0 <a3> == 1.
N2: <a0> == 1 <a1> == 1 <a2> == 0 <a3> == 1.
N3: <a0> == 0 <a1> == 1 <a2> == 1 <a3> == 1.
N4: <a0> == 0 <a1> == 1 <a2> == 0 <a3> == 0.
But this kind of case is rare in practice.
DATR allows you to call nodes, attributes and values whatever you
like. However, it is worth giving your names some thought and
adopting some principles and/or conventions that will help others to
understand your code, either because they are self-explicating or
because you include a comment explaining what your conventions are.
A typical convention you might consider adopting is to represent
abstract non-terminal nodes in capitals (e.g., VERB) but to represent
the leaves of the inheritance tree (typically lexeme nodes) with
initial capitals only (e.g., Love).
Give some thought also to the *length* of your attribute names. This
may sound a peculiar suggestion but it has a very visible bearing on
how the theorems of your theory will appear and, hence, on how
intelligible they will be. A good policy to adopt is to use the same
character length for all attributes that can appear in a given
position in a path. To see why this is a good policy, compare the
two following examples:
Puer:
<mor nom sing> = puer
<mor voc sing> = puer
<mor acc sing> = puer um
<mor nom plur> = puer ii
<mor voc plur> = puer ii
<mor acc plur> = puer oos.
Puer:
<mor nomin sg> = puer
<mor vc sg> = puer
<mor accusative sg> = puer um
<mor nomin plural> = puer ii
<mor vc plural> = puer ii
<mor accusative plural> = puer oos.
If your DATR analysis deals, however marginally, with more than one
level of linguistic description, then you should probably use
attributes like 'phn', 'mor', 'syn', 'sem', etc., as the first items
in all the relevant paths. This will make your code easier for
others to read and may well make it easier for you to develop and
maintain it.
There is a price to be paid for putting comments on the same line
as DATR code. If the % comment characters are not aligned throughout
the file, then it will look a mess. But keeping them aligned is
subject to all the problems associated with indentation discussed
above. In particular, global substitutions of node, attribute, or
value names will often destroy your alignment and require further
tedious editing to restore it.
Aligning the '==' in equations is tempting and can sometimes improve
readability. But it is usually impossible to maintain the alignment
over more than a few nodes without leaving enormous gaps (that can
themselves make the code less readable) and such an alignment is
subject to the vagaries of subsequent changes to attribute names.
Such alignment should thus be used sparingly and only when it makes
an obvious contribution to the readability of the code.
Be orderly (in Grice's sense) in your presentation of the nodes in
your theory. If the structure of the nodes approximates to a tree
then one kind of orderly presentation would start with the root node
and end with the leaves. But if the interest of the theory lies in
the leaves, then an alternative presentation would have the leaves
at the beginning. Give some thought to how the order of presentation
can assist the human reader of your file in making sense of your
analysis.
If your code contains utility nodes (like CASE or boolean connectives)
or other nodes that serve a specialist function, then you may want
to put them together near the end of the node definitions so that
they do not clutter or obscure the logic of the substantive content
of the file.
To separate sections of material in your file, use the sequence of
spaces and percent symbols that appears immediately below. It has
the advantage that you can simply copy it from your file header and
it is consistent with the style set by that header (which the
alternatives probably aren't).
% % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % %
9. Closing declarations
Declarations that relate to the choice of theorems to be derived
should be placed after all the node definitions. The DATR code
can be understood without reference to them and there are often
a great many of them, best hidden away near the end. If the
number of nodes being hidden, or paths being shown, is small,
then they can conveniently appear on one line:
# hide Cat Noun Verb Adjective.
# show <case> <number> <gender> <tense>.
If there are many (especially in the case of paths) then it is
likely to look better if you employ separate lines with
the standard indentation of four spaces:
# show
<mor nom sing>
<mor voc sing>
<mor acc sing>
<mor nom plur>
<mor voc plur>
<mor acc plur>.
% % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % %
10. The RCS Archive Id comment
Add the following to the very end of your file:
% The next line is the Revision Control System Archive Id: do not delete it.
% $Id$
RCS is a public domain version control system available on both
MS-DOS and Unix. By including this line, you ensure that a record
will be kept within the file in the event that the file is ever
maintained under RCS. That record will look something like this:
% $Id: estonia1.dtr 2.09 94/09/04 13:02:27 geraldg Exp $
and appear as the last line of the file, in place of '% $Id$'.
It is anticipated that the Sussex archive of dtr files will be
maintained under RCS.
% % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % %