A text model contains specifications of a range of factors, such as those proposed by B"uhler or Jakobson, speech act theorists or discourse analysts. One of these factors is the syntax of the text.
Text syntax covers a range of features, from the overall sequential organisation of texts and parts of texts to the `break-up' into media-specific organisational units such as pages (and, at a further level, files: the WWW convention is to have one page per file). Text syntax may be specified in any of a wide range of formal notations, including attribute grammars, or, if the text is `tree-structured', in terms of straightforward context-free phrase structure rules.
In the WWW environment for search in the Internet, the syntax of individual documents is specified in Hypertext Markup Language (HTML) . HTML is a specific application of the Standard Generalized Markup Language (SGML) . Other terms for `markup' are `annotation', `tagging', `labelling'. At the levels of morphology and sentence syntax, there exist `taggers', programmes for providing automatic grammatical or morphological markup, and for a number of languages, standardised `tagsets' have been developed and agreed on for this purpose.
There is no general convention for specifying hypertext structures in the WWW, however. This is a complicated matter, and for completeness would need to take browser and server properties into account as well as the properties of the text.
A hypertext in the WWW environment may be a single page-file, with links between parts of the text on the page, or it may be a collection of pages linked as a tree or a graph, with links pointing to other page-files or to a graphics file or to a Java application. Page-files themselves are either HTML documents or plain text.
Starting from the bottom level of texts as a inscriptions in an object language, a text markup (traditionally known in linguistics as a structural description or SD) is a statement in a metalanguage such as HTML (LaTeX can also be seen as a markup language) about the object language text. The markup contains images of segments of the object text, and markup labels for these segments. The markup language HTML is described (defined) in SGML, a generic language for defining markup applications; HTML is an application designed for document markup in the WWW. SGML is a higher order metalanguage for describing (defining) markup languages. SGML itself is specified in a general data type definition language such as BNF (Backus Naur Form). The hierarchy of metalanguages is shown in the Figure.
Figure 1: Markup language hierarchy
A markup language hierarchy centring on SGML and HTML is not the only possible strategy for providing markup. There are many other conventions at Level 4, the Markup Level, for example for specifying sentence structures, or for dialogue transcription. Linguistic structural descriptions, for example, are markups.
Alternative techniques start with different specification conventions altogether and if a WWW document needs to be generated, translation into HTML is done automatically. Documents can be written in LaTeX, for example, and translated into a graph-structured set of HTML documents by means of the programme latex2html. I have experimented with representing hypertexts with a semantic network representation language (DATR) and translating these representations directly into HTML.
It is important to note that HTML does not prescribe possible structures above the individual document level, but allows arbitrary links between documents, and from positions within documents to positions within (possibly the same) documents. This is the most interesting area of hypertext syntax, however, and different genres of hypertext can be defined on the basis of different model-driven constraints on hypertext structure.