Token (LLM)

From Systems Analysis Wiki
Jump to navigation Jump to search

Token — the smallest unit of text that large language models (LLMs) can work with. Before being processed by an LLM, any text is first converted into a sequence of tokens, which are then translated into numerical representations that are convenient for the model to analyze and process.

Depending on the tokenization strategy used, a token can represent:

  • A whole word (e.g., "house")
  • A part of a word or a root (a subword), such as "run" in "running"
  • A single character or punctuation mark (e.g., ",", "!")

The use of tokens allows language models to effectively learn and reproduce text structures, identify patterns, and understand the semantics and syntax of the text.

Tokenization

Tokenization is the process of breaking down source text into tokens and subsequently converting them into numerical identifiers that the model can understand.

This stage is a mandatory and fundamental step for how large language models function. It enables an LLM to:

  • Analyze syntax — the structure of the text and the arrangement of its elements (words and phrases);
  • Extract semantics — the deeper meaning of the text and the relationships between its elements.

There are several main methods of tokenization, including:

  • Byte Pair Encoding (BPE): An algorithm that iteratively replaces the most frequent pairs of characters with new tokens, allowing for efficient processing of rare words and morphological variations.
  • WordPiece: Used in models like BERT, it breaks words into subword units, which helps in processing unknown words.
  • SentencePiece: A method that treats text as a sequence of raw characters and applies models based on BPE or Unigram for tokenization.

The choice of tokenization method affects the model's performance, its ability to process different languages, and the efficiency of its training.

Special Tokens

In addition to regular tokens, models also use special tokens to denote functional elements of the text, such as:

  • [CLS] (class) — a token indicating the start of a sequence, often used for text classification tasks;
  • [SEP] (separator) — separates different parts of the text (e.g., a question and an answer, sentences, or paragraphs);
  • [MASK] — a special token used to denote a word that the model must predict (used in BERT and other masked-language models);
  • [PAD] (padding) — used to align sequences to a uniform length.

These special tokens help models more accurately perceive the structure and context of the text being processed.

Tokens and the Context Window

The context window is the maximum number of tokens that a model can simultaneously consider and process when generating text.

For example, the GPT-3 model has a context window of 2048 tokens. This means that when generating text, the model can simultaneously consider information contained within a maximum of 2048 tokens of the source text. The size of the context window affects:

  • The maximum amount of information available to the model;
  • The quality and coherence of the generated responses;
  • The model's ability to comprehend long texts and maintain context over long distances between tokens.

References

  • Sennrich, R.; Haddow, B.; Birch, A. (2016). Neural Machine Translation of Rare Words with Subword Units. arXiv:1508.07909.
  • Kudo, T.; Richardson, J. (2018). SentencePiece: A Simple and Language-Independent Subword Tokenizer and Detokenizer for Neural Text Processing. arXiv:1808.06226.
  • Kudo, T. (2018). Subword Regularization: Improving Neural Network Translation Models with Multiple Subword Candidates. arXiv:1804.10959.
  • Devlin, J. et al. (2019). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv:1810.04805.
  • Song, X. et al. (2021). Fast WordPiece Tokenization. *EMNLP 2021*. ACL Anthology.
  • Mielke, S. J.; Dalmia, S.; Cotterell, R. (2021). A Brief History of Open-Vocabulary Modeling and Tokenization in NLP. arXiv:2112.10508.
  • Xue, J. et al. (2022). ByT5: Towards a Token-Free Future with Pre-trained Byte-to-Byte Models. arXiv:2105.13626.
  • Limisiewicz, T.; Balhar, J.; Mareček, D. (2023). Tokenization Impacts Multilingual Language Modeling: Assessing Vocabulary Allocation and Overlap Across Languages. arXiv:2305.17179.
  • Pourmostafa Roshan Sharami, J.; Shterionov, D.; Spronck, P. (2023). A Systematic Analysis of Vocabulary and BPE Settings for Optimal Fine-tuning of NMT. arXiv:2303.00722.
  • Batsuren, K. et al. (2024). Evaluating Subword Tokenization: Alien Subword Composition and OOV Generalization Challenge. arXiv:2404.13292.
  • Chai, Y. et al. (2024). Tokenization Falling Short: On Subword Robustness in Large Language Models. arXiv:2406.11687.