Tokenisation is the identification of character groupings which constitute the basic units of text, such as words and punctuation. A common strategy for initialisation is to surround words and punctuation marks with blanks. Given a normalised text data format of the kind already discussed, the procedure shown in Figure 10 will perform this task.

Figure 10: Simple tokenisation.
This is, like the normalisation example, a rather primitive kind of tokenisation. A more sophisticated kind known as tagging not only identifies tokens but assigns them to types, e.g. parts of speech (POS). Taggers (tagging software) may be based on algebraic linguistic formalisms, such as regular expressions (i.e. finite state descriptions), or on a combination of algebraic and numerical formalisms, as in stochastic taggers, which generally use a variety of probabilistic finite state machine called HMM (Hidden Markov Model), though more complex parsing algorithms may be involved. Stochastic taggers generate hypotheses about word classes based on transitional probability estimations for word sequences; the transitional probability estimations are derived from observed frequencies of word combinations (e.g. digrams, trigrams -- combinations of two, three words, etc.).