Next: Analysis: Parsing HTML token
Up: From structured text to
Previous: Analysis: Tokenisation of HTML
- Single space: [ SP | NL ]+ (i.e. a sequence of at least one white space character)
- Atoms:
- sequence of non-special alphanumeric characters bounded by separators
- sequence of any character except ", and enclosed by "..." (or any character except ', and enclosed by '...')
- Tags: Start-tag | End-tag
- Start-tag: < Atom Property* >, e.g.
<P ALIGN=CENTER>
- End-tag: < / Atom >
</P>, </TABLE>
- Property: Atom = Atom | Atom, e.g.
ALIGN=CENTER, NOSHADE
- Upper/lower case: atoms in HTML tags are case-insensitive; exceptions are addresses.
- Text: sequence of atoms separated by white space
© Dafydd Gibbon
Mon Jul 13 18:34:24 MET DST 1998