Token
The basic unit a model reads and writes. Often a subword, sometimes a word, sometimes a single character.
Tokenization is the first step in any language model pipeline. Whitespace tokenizers split on spaces; BPE tokenizers find a granularity between words and characters by learning which subwords appear most often.
Continue