Tokenization (NLP)
Tokenization in the context of Large Language Models (LLMs) is a fundamental preprocessing step that involves breaking down a sequence of text into smaller, manageable units called tokens. These tokens are then converted into numerical IDs that the model can process. Tokenization is a critical first step as it directly impacts the model's performance, efficiency, fairness, and language understanding capabilities.
Main Concepts
Token
A token is a discrete unit of text that a language model processes. Depending on the chosen tokenization method, a token can represent:
- An entire word (e.g., "cat").
- A part of a word or subword (e.g., "un-", "-pack-", "-ing").
- An individual character (e.g., "a", "b", "c").
- A byte (in the case of byte-level tokenization).
Each unique token is assigned a specific index number from the tokenizer's vocabulary.
Tokenizer Vocabulary
The vocabulary is the complete set of all possible tokens that the model can recognize. The vocabulary size is an important hyperparameter:
- A large vocabulary allows more words to be represented whole, which improves understanding and shortens sequence lengths, but it increases the model size and training complexity.
- A small vocabulary is more compact, but it requires splitting rare or complex words into more subwords, which can lengthen sequences and make it harder to capture semantics.
Vocabulary size varies significantly across models: from ~50,000 tokens in GPT-2 to over 100,000 in modern models like GPT-4 (100,277) and LLaMA-3 (128,000).
Main Tokenization Methods
There are three main levels of tokenization granularity.
1. Word-level Tokenization
- Principle: The text is split into individual words based on delimiters (spaces, punctuation).
- Advantages: It is intuitive; token sequences are shorter, which reduces computational load.
- Disadvantages:
- Out-of-Vocabulary (OOV) problem: The model cannot process words that were not in the training vocabulary, as well as typos and new words.
- Large vocabulary size: Requires storing all unique words, which is particularly problematic for morphologically rich languages.
2. Character-level Tokenization
- Principle: The text is split into individual characters.
- Advantages:
- No OOV problem: Any word can be represented as a sequence of characters.
- Small vocabulary: Limited to the size of the alphabet and special characters.
- Disadvantages:
- Long sequences: Text is converted into very long token sequences, significantly increasing computational costs.
- Loss of semantics: It is harder for the model to capture meaning as it operates on individual characters rather than whole words.
3. Subword Tokenization
This is an intermediate approach and the most popular one today, combining the advantages of the previous methods.
- Principle: Frequently used words remain whole tokens, while rare or unknown words are broken down into smaller, meaningful parts (subwords).
- Advantages:
- Effectively handles OOV words and morphological variations.
- Controlled vocabulary size.
- Captures the morphological structure of words.
- Main algorithms:
- Byte Pair Encoding (BPE): An iterative algorithm that starts with a set of characters and progressively merges the most frequent pairs into new tokens. Used in GPT models. Byte-level BPE, used in GPT-2 and RoBERTa, treats words as sequences of bytes, which completely solves the OOV problem.
- WordPiece: An algorithm similar to BPE, but it selects pairs for merging that maximize the likelihood of the training data. Used in BERT models.
- Unigram LM: Unlike BPE/WordPiece, this method starts with a large set of subwords and gradually prunes it by removing tokens that have the least impact on the overall corpus probability. This allows for multiple probable tokenizations for a single word (subword regularization).
- SentencePiece Toolkit: A library from Google that implements BPE and Unigram LM and treats text as a continuous stream of characters, making it universal for languages without explicit word delimiters (e.g., Chinese). Used in LLaMA and T5 models.
Tokenization in Multimodal LLMs
In multimodal models, which work with more than just text, tokenization is extended to other data types:
- Visual tokenization: Images are broken down into small patches (e.g., 16x16 pixels), which are then converted into vector-tokens, similar to text tokens.
- Audio tokenization: Continuous audio signals are converted into a sequence of discrete tokens that represent short segments of sound.
- Unified approach (TEAL): A concept where data from any modality is first tokenized using a corresponding tokenizer, and then their embeddings are processed in a single joint space.
Problems and Limitations
Tokenization, despite its importance, is a source of many problems in the operation of LLMs:
- Inconsistency and sensitivity: Small changes in the input data (a typo, capitalization, a trailing space) can drastically alter the tokenization, leading to unpredictable model behavior.
- Multilingual challenges: A single vocabulary for many languages is often inefficient for low-resource or morphologically rich languages, resulting in overly long token sequences.
- Impact on reasoning: Illogical splitting of numbers (e.g., "25,000" into "25", ",", "000") or symbols hinders the performance of arithmetic and symbolic tasks.
- Glitch Tokens: Anomalous or rare tokens from the training data (e.g., usernames from Reddit) that can trigger unpredictable or malicious model behavior.
Evolving Landscape and Future Directions
Research in tokenization is actively pursuing the following directions:
- Tokenizer-free models: Developing models (CANINE, ByT5) that operate directly at the byte or character level to completely eliminate the explicit tokenization step and its associated problems.
- Adaptive and learnable tokenization: Creating tokenizers that can dynamically adapt to the language, domain, or even a specific input text, or are trained jointly with the main model.
- Cognitively-inspired approaches: Developing methods inspired by the cognitive science of human language processing (e.g., the "Principle of Least Effort") to create more semantically meaningful tokenizations.
Links
Literature
- Schuster, M.; Nakajima, K. (2012). Japanese and Korean Voice Search. PDF.
- Sennrich, R.; Haddow, B.; Birch, A. (2016). Neural Machine Translation of Rare Words with Subword Units. arXiv:1508.07909.
- Kudo, T.; Richardson, J. (2018). SentencePiece: A Simple and Language-Independent Subword Tokenizer and Detokenizer for Neural Text Processing. arXiv:1808.06226.
- Kudo, T. (2018). Subword Regularization: Improving Neural Network Translation Models with Multiple Subword Candidates. arXiv:1804.10959.
- Song, X. et al. (2021). Fast WordPiece Tokenization. ACL-Anthology.
- Mielke, S. J.; Dalmia, S.; Cotterell, R. (2021). A Brief History of Open-Vocabulary Modeling and Tokenization in NLP. arXiv:2112.10508.
- Xue, J. et al. (2022). ByT5: Towards a Token-Free Future with Pre-trained Byte-to-Byte Models. arXiv:2105.13626.
- Clark, J. H. et al. (2022). CANINE: Pre-Training an Efficient Tokenization-Free Encoder for Language Representation. arXiv:2103.06874.
- Limisiewicz, T.; Balhar, J.; Mareček, D. (2023). Tokenization Impacts Multilingual Language Modeling. arXiv:2305.17179.
- Pourmostafa Roshan Sharami, J.; Shterionov, D.; Spronck, P. (2023). A Systematic Analysis of Vocabulary and BPE Settings for Optimal Fine-Tuning of NMT. arXiv:2303.00722.
- Batsuren, K. et al. (2024). Evaluating Subword Tokenization: Alien Subword Composition and OOV Generalization Challenge. arXiv:2404.13292.
- Chai, Y. et al. (2024). Tokenization Falling Short: On Subword Robustness in Large Language Models. arXiv:2406.11687.