BERT (language model)

From Systems Analysis Wiki
Jump to navigation Jump to search

BERT (Bidirectional Encoder Representations from Transformers) is a large language model (LLM) for natural language understanding, developed by researchers at Google and introduced in 2018. BERT marked a new era in natural language processing (NLP) by demonstrating unprecedented performance across a wide range of tasks and establishing the "pre-train and fine-tune" paradigm as an industry standard.

The key innovation of BERT is its deeply bidirectional architecture, which allows the model to consider the context of a word from both the left and the right simultaneously across all layers of the network. This is achieved through a new pre-training task: Masked Language Modeling (MLM).

Name and Operating Principle

The acronym BERT stands for Bidirectional Encoder Representations from Transformers.

  • Bidirectional: This points to the model's core feature—the ability to process a word's context in both directions (left-to-right and right-to-left) simultaneously. Unlike unidirectional models (like GPT), which only see the preceding context when processing a word, BERT sees the entire sequence at once, allowing it to form a deeper and more accurate understanding of the word's meaning.
  • Encoder: This signifies that BERT uses only the encoder part of the Transformer architecture. The encoder's task is to read an input sequence of text and create a rich contextual representation (vector) for each token. BERT is not designed for free-form text generation like decoder-based models.
  • Representations: The model is trained to create high-quality numerical representations (vectors or embeddings) for words and sentences, which can then be used to solve various NLP tasks.
  • from Transformers: This indicates that the model's architecture is based entirely on the Transformer.

History

The development of BERT was the result of several key breakthroughs in NLP:

  1. Contextual Embeddings: Models like Word2vec and GloVe created static vectors for words, disregarding context. The ELMo model (2018) was a step forward, generating context-dependent representations using bidirectional LSTMs, but this bidirectionality was "shallow" (a concatenation of two unidirectional models).
  2. Transfer Learning and GPT: In mid-2018, OpenAI introduced the GPT model, which demonstrated the effectiveness of pre-training a large Transformer model on unlabeled data followed by fine-tuning on specific tasks. However, GPT was strictly unidirectional (left-to-right), limiting its capabilities in tasks that require understanding the full context.

Recognizing these limitations, Google researchers led by Jacob Devlin developed BERT to create a truly deeply bidirectional model. The paper on BERT was published on arXiv in October 2018, and the code and pre-trained models were released open-source, sparking an explosion of interest in the research community. BERT broke records on 11 key NLP benchmarks, including GLUE and SQuAD, and was dubbed the "ImageNet moment" for NLP, as one universal model could be easily adapted for a multitude of tasks.

Architecture

BERT is based entirely on the encoder part of the Transformer architecture. It consists of several identical layers stacked on top of each other. There are two main versions:

  • BERT-Base: 12 layers, 12 attention heads, hidden state size of 768, ~110 million total parameters.
  • BERT-Large: 24 layers, 16 attention heads, hidden state size of 1024, ~340 million total parameters.

Each layer contains two main sub-layers:

  1. Multi-Head Self-Attention mechanism: Allows each token in the input sequence to "attend" to all other tokens, weighing their importance to determine its own contextual meaning.
  2. Feed-Forward Network: Applied to each token individually.

Input Data

For proper operation, BERT requires a specific input data format. The token sequence fed into the model always begins with a special `[CLS]` (classification) token, which is used for text classification tasks. If a pair of sentences is provided as input (e.g., in question-answering tasks), they are separated by a `[SEP]` (separator) token.

The final representation of each input token is the sum of three embeddings:

  • Token Embedding: A vector corresponding to a specific token from the vocabulary (BERT uses WordPiece tokenization).
  • Segment Embedding: Indicates which sentence (the first or second) the token belongs to.
  • Positional Embedding: Indicates the token's position in the sequence, as the Transformer architecture itself does not account for word order.

Pre-training Tasks

To achieve deep bidirectionality, BERT is trained on two unique tasks simultaneously.

Masked Language Modeling (MLM)

This is the key innovation of BERT. Instead of predicting the next word, as in standard language models, BERT predicts randomly "masked" words within a sentence. The process is as follows:

  • 15% of the tokens from the input sequence are randomly selected.
  • Of these 15%:
    • 80% are replaced with a special `[MASK]` token.
    • 10% are replaced with a random token from the vocabulary.
    • 10% remain unchanged.
  • The model's task is to predict the original values of these 15% of tokens based on their surrounding (left and right) context.

This scheme forces the model to learn deep semantic and syntactic relationships between words and allows it to be truly bidirectional.

Next Sentence Prediction (NSP)

This task was designed to teach BERT to understand relationships between sentences, which is critical for tasks like question-answering or natural language inference (NLI). The model is given a pair of sentences (A and B) and must predict whether sentence B is a logical continuation of sentence A.

  • In 50% of cases, B is indeed the next sentence from the original text.
  • In 50% of cases, B is a random sentence taken from elsewhere in the corpus.

Later research (e.g., in the RoBERTa model) showed that the NSP task is less important than MLM and can be abandoned in favor of more efficient training schemes, but it played a significant role in the original BERT.

Application and Fine-Tuning

The power of BERT lies in the transfer learning paradigm. After large-scale and costly pre-training on massive corpora (Wikipedia + BooksCorpus), the pre-trained model can be easily and quickly fine-tuned for a specific application.

The fine-tuning process typically looks like this: 1. A small, untrained task-specific layer (e.g., a classifier for sentiment analysis) is added to the pre-trained BERT architecture. 2. The entire model (including BERT's weights and the new layer) is trained on a small, labeled dataset for that specific task.

Examples of tasks for which BERT is adapted:

  • Text classification (sentiment analysis, spam filters): A classifier is added to the output of the `[CLS]` token.
  • Question-answering systems (e.g., SQuAD): The model is trained to predict the start and end tokens of the answer within a given text.
  • Named Entity Recognition (NER): A classifier is added to the output of each token to determine if it is part of a name, organization, date, etc.

Variants and Derivative Models

The success of BERT has led to the emergence of a whole family of models based on its ideas:

  • RoBERTa (from Facebook AI): "A Robustly Optimized BERT." It is not a new architecture but rather the result of more thorough and prolonged training of BERT: on more data, without the NSP task, and with dynamic masking. RoBERTa demonstrated that the original BERT was "undertrained" and surpassed it on all major benchmarks.
  • DistilBERT (from Hugging Face): A smaller version of BERT created using knowledge distillation. DistilBERT is 40% smaller, 60% faster, and retains 97% of BERT's performance, making it ideal for use in production and on resource-constrained devices.
  • ALBERT (A Lite BERT, from Google): A version optimized for reducing the number of parameters. It uses two key techniques: embedding factorization and cross-layer parameter sharing. This allows for the creation of much larger models with fewer parameters.
  • mBERT (Multilingual BERT): A version of BERT pre-trained on 104 languages simultaneously. It has shown a surprising ability for cross-lingual knowledge transfer.
  • Domain-specific models: Numerous models fine-tuned on data from specific fields, such as BioBERT (biomedicine), SciBERT (scientific texts), and FinBERT (finance).
  • ModernBERT (2024-2025): A new generation of BERT-like models from companies like Answer.AI and LightOn, incorporating modern architectural improvements such as RoPE (Rotary Position Embeddings) and support for longer contexts (up to 8192 tokens), while retaining BERT's core principles.

Comparison with Other Models

Comparison of BERT with Other Key Architectures
Model Developer Architecture Context Direction Primary Task
BERT Google Encoder Bidirectional Text understanding, classification, extraction
GPT OpenAI Decoder Unidirectional (left-to-right) Text generation, sequence continuation
XLNet Google / CMU Autoregressive (permutation-based) Bidirectional (in theory) Text understanding (alternative to MLM)
T5 Google Encoder-Decoder Bidirectional (encoder) + Unidirectional (decoder) Universal "text-to-text" transformation

Impact

BERT brought about a true revolution in NLP and laid the foundation for many subsequent developments:

  1. Solidified the "pre-train and fine-tune" paradigm as the dominant approach in NLP.
  2. Proved the importance of deep bidirectional context for language understanding.
  3. Lowered the barrier to entry for creating high-performance NLP systems, as researchers and developers no longer needed to build complex architectures from scratch for each task.
  4. Was integrated into Google Search, which became one of the biggest updates to the search engine in its history and clearly demonstrated the model's practical utility.
  5. Spawned an entire ecosystem of derivative models, tools, and research ("BERTology"), becoming one of the most cited works in the field of AI.

Although newer and larger models like GPT-3 and GPT-4 have surpassed BERT on many benchmarks (especially in generative tasks), BERT and its variants remain powerful and widely used tools for tasks requiring deep text understanding.

Literature

  • Devlin, J. et al. (2019). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv:1810.04805.
  • Vaswani, A. et al. (2017). Attention Is All You Need. arXiv:1706.03762.
  • Peters, M. E. et al. (2018). Deep Contextualized Word Representations. arXiv:1802.05365.
  • Liu, Y. et al. (2019). RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv:1907.11692.
  • Lan, Z. et al. (2020). ALBERT: A Lite BERT for Self-supervised Learning of Language Representations. arXiv:1909.11942.
  • Sanh, V. et al. (2020). DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. arXiv:1910.01108.
  • Yang, Z. et al. (2019). XLNet: Generalized Autoregressive Pretraining for Language Understanding. arXiv:1906.08237.
  • Raffel, C. et al. (2020). Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. arXiv:1910.10683.
  • Lee, J. et al. (2020). BioBERT: a pre-trained biomedical language representation model for biomedical text mining. arXiv:1901.08746.
  • Warner, B. et al. (2024). Smarter, Better, Faster, Longer: A Modern Bidirectional Encoder for Fast, Memory Efficient, and Long Context Finetuning and Inference. arXiv:2412.13663.