Tokenization

Tokenization is the process of breaking a piece of text down into smaller units (tokens).
E.g. "This is a sentence." is tokenized as: <This>, <is>, <a>, <sentence>, <.>


There are two types of text tokenization:

  • Sentence Tokenization
  • Word Tokenization

Types:

  • Deterministic (rule-based): They only use preset rules.
  • Stochastic (trained/learned models): They can be fine-tuned and trained on a new text corpus to improve.