next up previous contents
Next: Architecture and rendering analysis Up: 23 03 56 ENGLISCH Previous: Classification criteria: Content -

Describing hypertexts: `markup systems'

Hypertext document syntax

A text model contains specifications of a range of factors, such as those proposed by Bühler or Jakobson, speech act theorists or discourse analysts. One of these factors is the syntax of the text.

Text syntax covers a range of features, from the overall sequential organisation of texts and parts of texts to the `break-up' into media-specific organisational units such as pages (and, at a further level, files: the WWW convention is to have one page per file). Text syntax may be specified in any of a wide range of formal notations, including attribute grammars, or, if the text is `tree-structured', in terms of straightforward context-free phrase structure rules.

In the WWW environment for search in the Internet, the syntax of individual documents is specified in Hypertext Markup Language (HTML) . HTML is a specific application of the Standard Generalized Markup Language (SGML) . Other terms for `markup' are `annotation', `tagging', `labelling'. At the levels of morphology and sentence syntax, there exist `taggers', programmes for providing automatic grammatical or morphological markup, and for a number of languages, standardised `tagsets' have been developed and agreed on for this purpose.

There is no general convention for specifying hypertext structures in the WWW, however. This is a complicated matter, and for completeness would need to take browser and server properties into account as well as the properties of the text.

A hypertext in the WWW environment may be a single page-file, with links between parts of the text on the page, or it may be a collection of pages linked as a tree or a graph, with links pointing to other page-files or to a graphics file or to a Java application. Page-files themselves are either HTML documents or plain text.

Starting from the bottom level of texts as a inscriptions in an object language, a text markup (traditionally known in linguistics as a structural description or SD) is a statement in a metalanguage such as HTML (LaTeX can also be seen as a markup language) about the object language text. The markup contains images of segments of the object text, and markup labels for these segments. The markup language HTML is described (defined) in SGML, a generic language for defining markup applications; HTML is an application designed for document markup in the WWW. SGML is a higher order metalanguage for describing (defining) markup languages. SGML itself is specified in a general data type definition language such as BNF (Backus Naur Form). The hierarchy of metalanguages is shown in the Figure.

[Markuphierarchy]

A markup language hierarchy centring on SGML and HTML is not the only possible strategy for providing markup. There are many other conventions at Level 4, the Markup Level, for example for specifying sentence structures, or for dialogue transcription. Linguistic structural descriptions, for example, are markups.

Alternative techniques start with different specification conventions altogether and if a WWW document needs to be generated, translation into HTML is done automatically. Documents can be written in LaTeX, for example, and translated into a graph-structured set of HTML documents by means of the programme latex2html. I have experimented with representing hypertexts with a semantic network representation language (DATR) and translating these representations directly into HTML.

It is important to note that HTML does not prescribe possible structures above the individual document level, but allows arbitrary links between documents, and from positions within documents to positions within (possibly the same) documents. This is the most interesting area of hypertext syntax, however, and different genres of hypertext can be defined on the basis of different model-driven constraints on hypertext structure.

Models for Hypertext syntax

A syntax defines an abstract structure, and it could perhaps be an interesting art form to design hypertexts on purely abstract structural grounds, by analogy with abstract painting or sculpture, or phonetic poetry. In order to assign content to a syntactic structure, it must be interpreted in terms of a model. Mathematically, a model is a structure of some kind, but for the interpretation of signs we require a different notion of a model, one which is an approximation to our experience of reality.

Our experience of reality may be approximated to varying degrees of granularity, and on the basis of a various abstractions of properties from what we experience. For example, we may consider the following features of the syntax of texts (or sentence, or words, ..).

The simplest text syntax model for hypertext is linear arrangement of parts of the text in time. This model is in fact the basic model for simple traditional texts like straightforward chapter structure of a novel: the parts of the text can be regarded as hypertext documents linked end-to-beginning. This a hypertext built on this syntaax model is a string hypertext.

A more interesting text syntax model is that of a text with a table of contents and footnotes, which may be modelled by a tree structure model superimposed on the basic string model. A hypertext which corresponds to this model is a tree hypertext.

More interesting still is the text syntax model of a text with table of contents and footnotes, and with bibliographical references. Since the items in the bibliography may be referenced from more than one position, the model is no longer a tree, but a graph with re-entrancy (known in computational linguistics as structure sharing). A hypertext built on this model is a DAG hypertext or `directed acyclic graph hypertext'. The simpler models described above are special cases of DAG hypertexts.

Finally, a text syntax model which does not apparently occur in traditional texts is that in which a position back to a previous position in the DAG hypertext, resulting in a DCG hypertext, a `directed cyclic graph hypertext'.

But text syntax has two dimensions which are not yet covered by these distinctions:

  1. The text syntax of the hypertext, on the basis of a DAG or DCG text syntax model for a given genre of hypertext. This is known as ID ordering or `immediate dominance' ordering in linguistics. This level is not identical to, but maps closely to the conceptual interpretation of the hypertext.
  2. The relative linear ordering imposed on individual documents by the syntax of natural language, which may continue to the relation between individual documents, in the case of the string hypertext. This ordering is sometimes known as LP ordering or QLP ordering, for `linear precedence ordering' or `quasi-linear precedence ordering' in computational linguistics. This ordering is imposed by certain properties of the medium, and includes line-break, page-break, and document break specifications. This level corresponds to the abstract layout which can be defined in a language such as HTML.
  3. The absolute linear (or higher dimensional) ordering imposed by the viewing situation, in terms of absolute position in space and time coordinates. These coordinate-based descriptions are also abstractions from our specific experiences, though they are intended to mirror these experiences. Coordinate-based ordering will be called CM ordering, for `coordinate matrix ordering'. This ordering corresponds, for example, to the structure imposed by the browser environment on a specific computer.
.

The specification of the syntax of a hypertext document, for example in HTML, will be interpretable in terms of our own viewing experience by using a model of this kind. The model is very general, and applies to linguistic `microstructures' (where it has been known as the ILEX, or `Integrated Lexicon') model, as well as to hypertext `macrostructures'.

[Markuphierarchy]

Hypertext document traversal

Traversal of a graph is analogous to `browsing' or `navigating' in the World Wide Web. The user decides at which page to start and where to finish; if he knows the absolute address of a given page, he may jump directly to this page. But if not, the structure of links within a document and a system of documents which form a hypertext determines how it may be traversed, given a starting point:

  1. Tree with tree-structured Contents index: arbitrary jumps within the tree are possible.
  2. Tree with no index: one consistent strategy would be left-right depth-first traversal to some depth (A* traversal), that is: start at the top, take the leftmost child node, from here take the leftmost child node, and so on, until a leaf of the tree or the maximum defined depth is reached; then backtrack to the parent, take the next node, and continue the left-right, depth-first strategy from there.
  3. This strategy may be used as a basis for defining Traversal Links between pages: the notion of Next may means the next item in a left-right depth-first traversal strategy; Previous may be the inverse of Next.
  4. Combined Contents indexing and Traversal indexing may be used, with a jumpto the Contents index included in every page. This is the linking strategy used in latex2html, for example.
  5. Other indexing strategies are possible, as a glance at different indices in books shows.
  6. Dynamic indexing: a browser will generally keep a history stack of pointers to nodes in the order in which they were actually visited. This will be a mirror of the link structure if a simple traversal linking system is used. However, if a mixture of traversal strategies (for instance via a Contents index and via another traversal strategy, then the path actually taken may not reflect the structure of any one strategy.
  7. If a document structure is not a tree but a general graph, then some documents will be pointed to by more than one other document. From the document concerned it will not be clear what the `previous' document was, but the `heuristic' for deciding this question is the history stack, which shows precisely from which document a given document was in fact reached.

Tasks:

  1. Think of tree diagrammes in linguistic structural descriptions. How would a diagramme of a hypertext resemble one of these?
  2. Think of the attribute-value structures in the HPSG or in DATR (if you know them). Could you use them to describe a hypertext?
  3. What would `nodes', `edges', and `node labels' and `edge labels' correspond to in a hypertext?

HTML

The syntax of individual documents in WWW hypertext is specified in Hypertext Markup Language (HTML) . HTML is a specific application of the Standard Generalized Markup Language (SGML) .

These introductions should be consulted for definitions of HTML syntax and, possibly, hints on HTML use.

There is no general convention for specifying larger hypertext structures in the WWW, however. This is a complicated matter, and for completeness would need to take browser and server properties into account as well as the properties of the text. HTML itself is located, within the model presented above, at the intermetdiate level of layout.

A hypertext in the WWW environment may be a single page-file containing a document whose layout syntax is specified in HTML (or is plain text), with links between parts of the text on the page, or it may be a collection of pages linked as a tree or a graph, with links pointing to other page-files or to a graphics file or to a Java application.

Tasks:

  1. Select a basic set of HTML tags which you consider will be useful for making a text.
  2. How would you characterise the various HTML constructs from a linguistic point of view?
  3. Do you agree with the point of view sometimes met in the HTML introductory literature that with HTML you can distinguish between the `logical structure' and the `layout' of a document? Is there a clear distinction?
  4. How is the layout of documents affected by the HTML description, and how far is it affected by the browser (and your computer)?
  5. What is the minimal HTML document?
  6. Describe your favourite recipe and describe your description in HTML. Maybe start with something like `Soup. Take the soup out of the fridge and warm it in the microwave.', and then make it into a more detailed conventional recipe.
  7. Make a description of the soup from the point of view of an experienced soup taster (but take pity on his taste buds).
  8. Design a soup advertisement and describe the advertisement in HTML.
  9. Is it possible to define `locutionary acts' and `illocutionary acts' (in the sense of either Austin or Searle) in HTML?

Text design and `HTML': A design procedure

The following steps will be distinguished on the basis of discussion in the preceding sections:

  1. Specifying the model. The general model is selected, for instance a Jakobsonian model, augmented by speech act types, and the parameters of the general model are instantiated to provide a specific model in terms of the desired finished product.
  2. Formulating the theory. The syntax of the model is specified at the different hierarchical and realisational levels. At the final stage, the theory is translated automatically or manually into an HTML hypertext structure.
  3. Evaluating the theory relative to the model. The result is checked with the model specifications, both in terms of the formal specifications and in terms of subjective assessment of the result.


next up previous contents
Next: Architecture and rendering analysis Up: 23 03 56 ENGLISCH Previous: Classification criteria: Content -

Dafydd Gibbon, Thu Feb 15 15:07:15 MET 2001