A text model contains specifications of a range of factors, such as those proposed by Bühler or Jakobson, speech act theorists or discourse analysts. One of these factors is the syntax of the text.
Text syntax covers a range of features, from the overall sequential organisation of texts and parts of texts to the `break-up' into media-specific organisational units such as pages (and, at a further level, files: the WWW convention is to have one page per file). Text syntax may be specified in any of a wide range of formal notations, including attribute grammars, or, if the text is `tree-structured', in terms of straightforward context-free phrase structure rules.
In the WWW environment for search in the Internet, the syntax of individual documents is specified in Hypertext Markup Language (HTML) . HTML is a specific application of the Standard Generalized Markup Language (SGML) . Other terms for `markup' are `annotation', `tagging', `labelling'. At the levels of morphology and sentence syntax, there exist `taggers', programmes for providing automatic grammatical or morphological markup, and for a number of languages, standardised `tagsets' have been developed and agreed on for this purpose.
There is no general convention for specifying hypertext structures in the WWW, however. This is a complicated matter, and for completeness would need to take browser and server properties into account as well as the properties of the text.
A hypertext in the WWW environment may be a single page-file, with links between parts of the text on the page, or it may be a collection of pages linked as a tree or a graph, with links pointing to other page-files or to a graphics file or to a Java application. Page-files themselves are either HTML documents or plain text.
Starting from the bottom level of texts as a inscriptions in an object language, a text markup (traditionally known in linguistics as a structural description or SD) is a statement in a metalanguage such as HTML (LaTeX can also be seen as a markup language) about the object language text. The markup contains images of segments of the object text, and markup labels for these segments. The markup language HTML is described (defined) in SGML, a generic language for defining markup applications; HTML is an application designed for document markup in the WWW. SGML is a higher order metalanguage for describing (defining) markup languages. SGML itself is specified in a general data type definition language such as BNF (Backus Naur Form). The hierarchy of metalanguages is shown in the Figure.
A markup language hierarchy centring on SGML and HTML is not the only possible strategy for providing markup. There are many other conventions at Level 4, the Markup Level, for example for specifying sentence structures, or for dialogue transcription. Linguistic structural descriptions, for example, are markups.
Alternative techniques start with different specification conventions altogether and if a WWW document needs to be generated, translation into HTML is done automatically. Documents can be written in LaTeX, for example, and translated into a graph-structured set of HTML documents by means of the programme latex2html. I have experimented with representing hypertexts with a semantic network representation language (DATR) and translating these representations directly into HTML.
It is important to note that HTML does not prescribe possible structures above the individual document level, but allows arbitrary links between documents, and from positions within documents to positions within (possibly the same) documents. This is the most interesting area of hypertext syntax, however, and different genres of hypertext can be defined on the basis of different model-driven constraints on hypertext structure.
A syntax defines an abstract structure, and it could perhaps be an interesting art form to design hypertexts on purely abstract structural grounds, by analogy with abstract painting or sculpture, or phonetic poetry. In order to assign content to a syntactic structure, it must be interpreted in terms of a model. Mathematically, a model is a structure of some kind, but for the interpretation of signs we require a different notion of a model, one which is an approximation to our experience of reality.
Our experience of reality may be approximated to varying degrees of granularity, and on the basis of a various abstractions of properties from what we experience. For example, we may consider the following features of the syntax of texts (or sentence, or words, ..).
The simplest text syntax model for hypertext is linear arrangement of parts of the text in time. This model is in fact the basic model for simple traditional texts like straightforward chapter structure of a novel: the parts of the text can be regarded as hypertext documents linked end-to-beginning. This a hypertext built on this syntaax model is a string hypertext.
A more interesting text syntax model is that of a text with a table of contents and footnotes, which may be modelled by a tree structure model superimposed on the basic string model. A hypertext which corresponds to this model is a tree hypertext.
More interesting still is the text syntax model of a text with table of contents and footnotes, and with bibliographical references. Since the items in the bibliography may be referenced from more than one position, the model is no longer a tree, but a graph with re-entrancy (known in computational linguistics as structure sharing). A hypertext built on this model is a DAG hypertext or `directed acyclic graph hypertext'. The simpler models described above are special cases of DAG hypertexts.
Finally, a text syntax model which does not apparently occur in traditional texts is that in which a position back to a previous position in the DAG hypertext, resulting in a DCG hypertext, a `directed cyclic graph hypertext'.
But text syntax has two dimensions which are not yet covered by these distinctions:
The specification of the syntax of a hypertext document, for example in HTML, will be interpretable in terms of our own viewing experience by using a model of this kind. The model is very general, and applies to linguistic `microstructures' (where it has been known as the ILEX, or `Integrated Lexicon') model, as well as to hypertext `macrostructures'.
Traversal of a graph is analogous to `browsing' or `navigating' in the World Wide Web. The user decides at which page to start and where to finish; if he knows the absolute address of a given page, he may jump directly to this page. But if not, the structure of links within a document and a system of documents which form a hypertext determines how it may be traversed, given a starting point:
Tasks:
The syntax of individual documents in WWW hypertext is specified in Hypertext Markup Language (HTML) . HTML is a specific application of the Standard Generalized Markup Language (SGML) .
These introductions should be consulted for definitions of HTML syntax and, possibly, hints on HTML use.
There is no general convention for specifying larger hypertext structures in the WWW, however. This is a complicated matter, and for completeness would need to take browser and server properties into account as well as the properties of the text. HTML itself is located, within the model presented above, at the intermetdiate level of layout.
A hypertext in the WWW environment may be a single page-file containing a document whose layout syntax is specified in HTML (or is plain text), with links between parts of the text on the page, or it may be a collection of pages linked as a tree or a graph, with links pointing to other page-files or to a graphics file or to a Java application.
Tasks:
The following steps will be distinguished on the basis of discussion in the preceding sections: