Next: Analysis: Tokenisation of HTML
Up: From structured text to
Previous: Some coding devices
- Characters:
- smallest units (cf. phonemes)
- in many document description languages coded with ASCII code sequences (exception: internal representations of word processors)
- Tokens, symbols:
- smallest interpretable units (cf. morphemes)
- identified by tokenisation, and
- generally describable with a regular grammar (or a finite state automaton)
- Objects:
- groups of tokens with complex meaning (cf. phrases)
- identified by parsing and
- generally describable with a context-free grammar (or a push-down automaton)
© Dafydd Gibbon
Mon Jul 13 18:34:24 MET DST 1998