next up previous
Next: References Up: No Title Previous: 4 DATR on the

5 DATR stylesheet

% % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % %
%                                                                           %
% File:            dtrstyle.txt                                             %
% Purpose:         style sheet for DATR .dtr files                          %
% Author:          Gerald Gazdar, 25 September 1994                         %
% Email:           geraldg@cogs.sussex.ac.uk                                %
% Address:         COGS, Sussex University, Brighton BN1 9QH, UK            %
% Version:         1.01                                                     %
%                                                                           %
%      Copyright (c) University of Sussex 1994.  All rights reserved.       %
%                                                                           %
% % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % %


0.  Introduction

    This document contains a set of conventions, recommendations and
    suggestions for the formatting and layout of theories/fragments
    written in the lexical knowledge representation language DATR.  Each
    proposal is accompanied by an example and by a rationale for the
    proposal, where one exists.  The proposals are based on many hours
    spent editing other people's dtr files for inclusion in the public
    DATR archive and on comments and suggestions made by Roger Evans,
    Dafydd Gibbon and Jim Kilbury.


% % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % %

1.  Filenames

    Adopt MS-DOS file name conventions even if working on a Unix system:
    lower case characters only, up to eight characters in prefix and a
    three character suffix after the period.  Many people use DATR on
    MS-DOS systems and long Unix names will not be preserved.  Make the
    prefix as mnemonic as you can (for others as well as yourself) but
    allow for the possibility of other files on the same topic -- hence
    'estonia1.dtr' or 'eston_a.dtr' rather than 'estonian.dtr'.  Check
    the DATR dtr file archive to see if a name is already in use or if an
    existing name can be used as a model.

    Other recommended file suffixes:

    estonia1.dec -- file containing declarations relating to estonia1.dtr
    estonia1.dmp -- file containing a theorem dump from estonia1.dtr
    estonia1.doc -- file containing text documentation for estonia1.dtr
    estonia1.dto -- compiled version - internal code - of estonia1.dtr
    estonia1.dtv -- file containing a dump of values from estonia1.dtr

% % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % %

2.  Fonts

    DATR fragments should always be printed or displayed using a
    monospaced font.

    Rationale: use of a proportional font will make it impossible to
    preserve vertical character alignments.


% % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % %

3.  Character sets

    All existing implementations of DATR allow one to use all the
    characters found on a US typewriter keyboard.  At least one widely
    used implementation (QDATR) will also allow you to use characters
    found only on keyboards designed for european languages other than
    English.  You may want to bear this fact mind when you decide how to
    encode your analysis.  If the orthographic representation is not the
    main issue and there are conventions for representing the language in
    7 bit ASCII, then you should simply embrace those conventions in the
    the interests of the portability of your analysis.  But if many
    non-ASCII characters are involved and you want your theorems to
    deliver the language as it is actually written, then you may decide
    to use those characters in your encoding.  Whichever way you go, it
    makes sense to include a comment near the top of the file which
    addresses the character set issue.


% % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % %

4.  Tabs or spaces

    NEVER EVER USE TABS!

    Rationale: the appearance of tabs depends on the output device and
    thus the appearance of your fragment will vary across output devices.
    The indentation you intend may not be the indentation that appears.
    In addition, since tabs are invisible, subsequent editing is likely
    to lead to the file containing a mixture of tabs and spaces used to
    the same end (e.g., indentation) which will make the results of
    global edits unpredictable.


% % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % %

5.  Order of material in the file

    1. The file header.
    2. Opening comment explaining what the file is and any relevant
       background material.  For long files it may well be useful to
       include a table of contents.
    3. Opening declarations (see below).
    4. Node definitions -- normally presented with the top of the
       inheritance tree at the top, and the leaves at the bottom.
    5. Closing declarations (see below).
    6. The RCS Id comment (see below).

    You should also include, within a comment, some examples of typical
    theorems that can be derived from the theory: these will help users
    to see what the "sensible" queries are (which may not otherwise be
    obvious, especially if theorem dump declarations are not included).
    These example theorems can be included in the opening comment or
    immediately prior to the closing declarations.

% % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % %

6.  File headers

    Use the following format:

% % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % %
%                                                                           %
% File:            estonia1.dtr                                             %
% Purpose:         morphophonemics of Estonian in the style of Chomsky 1952 %
% Author:          Gerald Gazdar, 1 September 1994                          %
% Email:           geraldg@cogs.sussex.ac.uk                                %
% Address:         COGS, Sussex University, Brighton BN1 9QH, UK            %
% Documentation:   estonia1.doc and paper cited below                       %
% Related files:   estonia2.dtr, latvian1.dtr                               %
% Version:         2.01                                                     %
%                                                                           %
%      Copyright (c) University of Sussex 1994.  All rights reserved.       %
%                                                                           %
% % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % %

    Rationale: in addition to providing useful information about the file
    and its author, the header can be used as a formal database entry
    that allows sets of dtr (and other) files to be indexed, catalogued
    and sorted.  The choice of format is pretty arbitrary for the first
    purpose but choice of the *same* format is crucial for the second.
    The File, Purpose, Author and Version lines should be considered
    obligatory, the rest optional.  For the purposes of indexing
    programs, the Purpose line needs to be self-contained, but you can
    always add further lines for human readers:

% Purpose:         morphophonemics of Estonian in the style of Chomsky 1952 %
%                  in which it is shown that Estonian and Japanese are      %
%                  alphabetic variants of the same language.                %

    The date following the author's name is intended to be the date the
    file was first made publicly available.  If you want to note the date
    of the most recent revision, then this can be done on the Version line:

% Version:         2.01 (25th December 1994)                                %

    The Version number X.YY relates to the version of the file, the X
    part being for major revisions, and the YY part for minor revisions.
    The version number is distinct from any automatically included RCS
    ID number found at the end of the file -- the latter relates only
    to the edition of the DATR archive from which the dtr file was taken.

    Although the Address line is not currently used by the program that
    indexes dtr files, it will make life easier if you can fit all of
    your address onto the line, rather than continuing below.

    If your file is one of many (estonia1.dtr, estonia2.dtr,
    estonia3.dtr, estonia4.dtr, estonia5.dtr, ..) then you may want to
    use regular expressions in the Related files line:

% Related files:   estonia*.dtr, latvian*.dtr                               %

    Whether you include a copyright line is entirely up to you.  For all
    practical purposes, the existing archive of dtr files, like the
    existing implementations, is treated as "academic public domain",
    i.e., can be freely copied, used and modified for the purposes of
    teaching and research provided that any use properly credits the
    original author of the material.  If you don't want your file to be
    circulated in the way other dtr files are, then make sure you put
    something like:

% NOT FOR PUBLIC CIRCULATION !!

    at the top of the file.


% % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % %

7.  Opening declarations

    A general rule: declarations that relate to the way the theory
    provides theorems (e.g., '# show' and '# hide' in those
    implementations that support them) should appear at the end of the
    file; all other declarations should normally appear at the top of the
    file, immediately following the header and any general comment that
    the file includes.  Implementations vary as to the declarations they
    permit, but all the following (if they are allowed at all) should
    normally appear immediately before any node definitions:

# load 'estonia2.dtr'.
# reset.
# vars $vowel: a e i o u.
# atom T F.
# node top.
# nc node: <nodename>.
# qnode Lexeme.

    Rationale: the reader needs to know what these declarations are in
    order to make sense of the code that follows.  For example, they will
    assume that 'T' and 'F' are nodes unless they have seen the '# atom'
    declaration, and

top:
    <> == undef.

    will look like it includes a typo unless the reader has seen the
    relevant '# node' declaration.  However, it may sometimes make sense
    to locate a declaration lower in the file.  For example, if the only
    code to make use of a variable is that for a finite state transducer
    defined near the end of the file, then it may well be more
    perspicuous to locate the variable declaration immediately above the
    transducer node definitions.

    If you use # load then you should include a comment that indicates
    what it is that is being loaded.


% % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % %

8.  DATR code style

    Put the name of the node being defined on a line on its own, followed
    by all the equations associated with, each on its own line, and each
    indented four spaces from the left margin.  Put at least one blank
    line between each node definition.  For example:

V:
    <cat> == v
    <sem> == pred1.

Adjective:
    <> == V
    <cat> == a.

    Rationale: if you put the first equation on the same line as the node
    name then constant indentation of equations becomes highly
    problematic.  The only solution that looks good is to indent every
    equation to that required by the longest node name in the theory.
    But this is a global matter, not one you can decide at the time you
    define a particular node.  So it requires you to re-edit the whole
    file when you have completed it.  It also has at least two further
    disadvantages: (i) it normally means that the indentation for
    equations is much greater than four spaces and this cramps the space
    available for material on the RHS and may make it necessary to split
    the RHS across two lines; and (ii) it makes the appearance of the
    file contingent on choice of node name -- you may subsequently want
    to change "Very_Long_Node_Name" to "Foo" and thus destroy the basis
    for the 20 space indentation that you were previously obliged to use.
    The alternatives, to indent all non-first equations to some standard
    depth, or (worse) to indent the equations for each node differently
    depending on the length of the node name, both look awful.

    Having two or more equations on the same line will normally make the
    code hard to read, as will failing to separate node definitions with
    blank lines.  In exceptional circumstances, it may make sense to
    violate some or all of these recommendations.  For example:

N1: <a0> == 0    <a1> == 1    <a2> == 0    <a3> == 1.
N2: <a0> == 1    <a1> == 1    <a2> == 0    <a3> == 1.
N3: <a0> == 0    <a1> == 1    <a2> == 1    <a3> == 1.
N4: <a0> == 0    <a1> == 1    <a2> == 0    <a3> == 0.

    But this kind of case is rare in practice.

    DATR allows you to call nodes, attributes and values whatever you
    like.  However, it is worth giving your names some thought and
    adopting some principles and/or conventions that will help others to
    understand your code, either because they are self-explicating or
    because you include a comment explaining what your conventions are.
    A typical convention you might consider adopting is to represent
    abstract non-terminal nodes in capitals (e.g., VERB) but to represent
    the leaves of the inheritance tree (typically lexeme nodes) with
    initial capitals only (e.g., Love).

    Give some thought also to the *length* of your attribute names.  This
    may sound a peculiar suggestion but it has a very visible bearing on
    how the theorems of your theory will appear and, hence, on how
    intelligible they will be.  A good policy to adopt is to use the same
    character length for all attributes that can appear in a given
    position in a path.  To see why this is a good policy, compare the
    two following examples:

Puer:
    <mor nom sing> = puer
    <mor voc sing> = puer
    <mor acc sing> = puer um
    <mor nom plur> = puer ii
    <mor voc plur> = puer ii
    <mor acc plur> = puer oos.

Puer:
    <mor nomin sg> = puer
    <mor vc sg> = puer
    <mor accusative sg> = puer um
    <mor nomin plural> = puer ii
    <mor vc plural> = puer ii
    <mor accusative plural> = puer oos.

    If your DATR analysis deals, however marginally, with more than one
    level of linguistic description, then you should probably use
    attributes like 'phn', 'mor', 'syn', 'sem', etc., as the first items
    in all the relevant paths.  This will make your code easier for
    others to read and may well make it easier for you to develop and
    maintain it.

    There is a price to be paid for putting comments on the same line
    as DATR code.  If the % comment characters are not aligned throughout
    the file, then it will look a mess.  But keeping them aligned is
    subject to all the problems associated with indentation discussed
    above.  In particular, global substitutions of node, attribute, or
    value names will often destroy your alignment and require further
    tedious editing to restore it.

    Aligning the '==' in equations is tempting and can sometimes improve
    readability.  But it is usually impossible to maintain the alignment
    over more than a few nodes without leaving enormous gaps (that can
    themselves make the code less readable) and such an alignment is
    subject to the vagaries of subsequent changes to attribute names.
    Such alignment should thus be used sparingly and only when it makes
    an obvious contribution to the readability of the code.

    Be orderly (in Grice's sense) in your presentation of the nodes in
    your theory.  If the structure of the nodes approximates to a tree
    then one kind of orderly presentation would start with the root node
    and end with the leaves.  But if the interest of the theory lies in
    the leaves, then an alternative presentation would have the leaves
    at the beginning.  Give some thought to how the order of presentation
    can assist the human reader of your file in making sense of your
    analysis.

    If your code contains utility nodes (like CASE or boolean connectives)
    or other nodes that serve a specialist function, then you may want
    to put them together near the end of the node definitions so that
    they do not clutter or obscure the logic of the substantive content
    of the file.

    To separate sections of material in your file, use the sequence of
    spaces and percent symbols that appears immediately below.  It has
    the advantage that you can simply copy it from your file header and
    it is consistent with the style set by that header (which the
    alternatives probably aren't).


% % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % %

9.  Closing declarations

    Declarations that relate to the choice of theorems to be derived
    should be placed after all the node definitions.  The DATR code
    can be understood without reference to them and there are often
    a great many of them, best hidden away near the end.  If the
    number of nodes being hidden, or paths being shown, is small,
    then they can conveniently appear on one line:

# hide Cat Noun Verb Adjective.

# show <case> <number> <gender> <tense>.

    If there are many (especially in the case of paths) then it is
    likely to look better if you employ separate lines with
    the standard indentation of four spaces:

# show
    <mor nom sing>
    <mor voc sing>
    <mor acc sing>
    <mor nom plur>
    <mor voc plur>
    <mor acc plur>.


% % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % %

10. The RCS Archive Id comment

    Add the following to the very end of your file:

% The next line is the Revision Control System Archive Id: do not delete it.
% $Id$

    RCS is a public domain version control system available on both
    MS-DOS and Unix.  By including this line, you ensure that a record
    will be kept within the file in the event that the file is ever
    maintained under RCS.  That record will look something like this:

% $Id: estonia1.dtr 2.09 94/09/04 13:02:27 geraldg Exp $

    and appear as the last line of the file, in place of '% $Id$'.
    It is anticipated that the Sussex archive of dtr files will be
    maintained under RCS.

% % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % %



Dafydd Gibbon
Tue Sep 17 16:45:21 MET DST 1996