GPT (OpenAI)

From Systems Analysis Wiki
Jump to navigation Jump to search

GPT (Generative Pre-trained Transformer) is a family of large language models (LLMs) developed by OpenAI. GPT models are built on the transformer architecture and implement the generative pre-training paradigm: in the first stage, the model is trained on extensive text corpora without explicit labeling, and can then be fine-tuned for specific tasks. For later generations (starting with GPT‑5), OpenAI also uses the term unified system, as the product combines a fast response mode, a deep reasoning mode, and a router[1].

Name

The abbreviation GPT stands for Generative Pre-trained Transformer.

  • Generative: indicates that the model is capable of creating (generating) new content, such as text.
  • Pre-trained: indicates that the model undergoes an extensive initial training stage on a large dataset (e.g., texts from the internet). After pre-training, the model can often be additionally "fine-tuned" for more specific tasks.
  • Transformer: refers to a specific neural network architecture that is a key innovation underlying GPT and many other modern AI models.

The main characteristic of GPT is that training occurs in an autoregressive manner — the model predicts the next token based on the preceding context. That is, the model is trained to maximize the probability of the next token given a sequence of previous tokens. During training, the prediction error for the next element is minimized, which enables the generation of texts with high coherence and consistency.

Text generation process in GPT

The GPT model generates text sequentially, token by token, according to the following iterative scheme:

  • Receives an initial text sequence (prompt, seed text) as input.
  • Computes a probability distribution over all tokens in the vocabulary for the next text element.
  • Selects the next token:
    • either by the highest probability (greedy selection),
    • or by stochastic sampling (sampling),
    • or using special filtering strategies (top-k, top-p).
  • Appends the selected token to the current sequence.
  • The updated sequence is again fed into the model for predicting the next token.

Transformer architecture: text processing

The data processing inside the transformer for predicting the next token involves several main stages:

  • Tokenization (Tokenization). The input text is split into tokens — small text units that can be words, subwords, or punctuation marks. In GPT-3, for example, the vocabulary includes approximately 50,257 tokens.
  • Token embeddings (Embeddings). Each token is converted into a fixed-length vector using an embedding matrix (W_E). The vectors encode token meanings: semantically similar tokens are located close together in the high-dimensional space. In GPT-3, the embedding dimensionality is 12,288.
  • Processing in transformer layers.
    • Attention blocks: Each token interacts with other tokens in the sequence. The attention mechanism allows the model to account for context and correctly interpret word meanings.
    • Feed-forward layers: After attention, each token is processed individually through a two-layer neural network with nonlinear activation.
  • Reverse transformation and Softmax. After all layers, the processed vector is transformed back into token space using a matrix (W_U), which is often a transposed version of W_E. The resulting logits vector is normalized using the Softmax function to obtain a probability distribution over all tokens.
  • Next token selection (Sampling). The next token is selected based on the probability distribution. The temperature parameter controls the randomness of selection: at temperature 0, the most probable token is selected; at higher temperatures, the probability of selecting less likely options increases, which promotes greater text diversity.

GPT models

  • GPT-1 (2018): the first model in the family; a 12-layer decoder-only transformer; two-stage training (pre-training + fine-tuning on NLP tasks).
  • GPT-2 (2019): 1.5 billion parameters; trained on the WebText corpus; the first model capable of generating long coherent texts; improved zero-shot generation quality. Announced on February 14, 2019; the full version (1.5B) was released on November 5, 2019 due to safety concerns.
  • GPT-3 (2020): 175 billion parameters; large-scale training on a combination of Common Crawl, Books, and Wikipedia; strong development of few-shot and zero-shot capabilities.
  • GPT-3.5 (2022): an intermediate version between GPT-3 and GPT-4; improved instruction following through Reinforcement Learning from Human Feedback (RLHF) in the text-davinci-003 and gpt-3.5-turbo versions; context window up to 4,096 tokens in early versions and up to 16,385 tokens in later ones (gpt-3.5-turbo-16k and updated gpt-3.5-turbo).
  • GPT-4 (2023): a multimodal model with text and image input (image support was deployed later, after the text-only launch); context window of 8,192 tokens in the base version and 32,768 tokens in the GPT-4-32k variant; significant improvements in accuracy, robustness, and reasoning.
  • GPT-4 Turbo (2023): an optimized version of GPT-4; increased context window up to 128,000 tokens; lower latency and cost.
  • GPT-4o (2024): a next-generation multimodal model (text, image, audio) with a unified neural network architecture; very high response speed and accuracy; context window of 128,000 tokens.
  • GPT-4.5 (2025): a research preview; the OpenAI system card states that the model "builds on GPT-4o"[2][3]; improved understanding of user queries, reduced error rate; context window of 128,000 tokens. The API model gpt-4.5-preview was declared deprecated on April 14, 2025 and shut down on July 14, 2025[4].
  • GPT-4.1 (2025): an improved version of the GPT-4 family with a context window of up to 1 million tokens; accepts text and images as input, outputs text[5]. Released simultaneously in three variants: GPT-4.1, GPT-4.1 mini, GPT-4.1 nano.
  • GPT-5 (2025): a unified system with fast response and deep reasoning modes; context window of approximately 400,000 tokens; notable reduction in hallucinations on factual tasks.
  • GPT-5.1 (2025): adaptive reasoning, improvements in coding and long-context retention.
  • GPT-5.2 (2025): focus on professional work; Pro mode for frontier tasks; the agentic GPT-5.2-Codex was released based on GPT-5.2.
  • GPT-5.3-Codex (2026): an agentic coding model combining coding capabilities and reasoning; 25% faster than predecessors.
  • GPT-5.3 Instant (2026): an update to the most widely used conversational model in ChatGPT; released on March 3, 2026. Improved factual accuracy, web search quality, conversational flow, and reduced excessive refusals and unnecessary caveats. Available in the API as gpt-5.3-chat-latest[6].
  • GPT-5.4 (2026): OpenAI's frontier model for professional work, introduced on March 5, 2026; the first general-purpose OpenAI model with native computer-use capabilities. In the API, gpt-5.4 is recommended as the default model for a wide range of general-purpose and coding tasks[7][8].

GPT-1

The first model, GPT-1, was introduced by OpenAI in 2018 in the paper "Improving Language Understanding by Generative Pre-Training". The model was a 12-layer decoder-only transformer[9] built on the transformer architecture. GPT-1 training proceeded in two stages: an unsupervised generative pre-training stage (pre-training), followed by a supervised fine-tuning stage (fine-tuning).

During the pre-training stage, the model was trained on the BookCorpus, comprising over 7,000 unpublished books of various genres. A distinctive feature of this corpus was the presence of long continuous text passages, which was critically important for developing the model's ability to process complex and long-range textual dependencies.

During the fine-tuning stage, the model was adapted to solve specialized natural language processing tasks, including:

  • Question Answering (QA) — generating a correct answer based on a given textual context;
  • Natural Language Inference (NLI) — determining the logical relationship between two texts: entailment, contradiction, or neutrality;
  • Semantic Textual Similarity — measuring the degree of semantic closeness between two text sequences.

Thanks to this approach, GPT-1 demonstrated significant superiority over previous models on a number of standard benchmarks for text comprehension tasks.

The development of GPT-1 demonstrated several key achievements and discoveries in natural language processing (NLP):

  • Effectiveness of generative pre-training. It was empirically confirmed that pre-training on large corpora of unlabeled text enables the model to acquire universal language representations suitable for subsequent application in various tasks without requiring fundamental architectural changes.
  • Versatility of the transformer architecture. The use of a multi-layer decoder transformer enabled the model to successfully process long-range dependencies in text, which had previously been difficult for models based on recurrent neural networks.
  • Reduced dependence on labeled data. The work confirmed that large-scale pre-training on unlabeled data can significantly reduce the amount of labeled data needed to achieve high quality on target tasks.
  • Foundation for further development. The results of GPT-1 laid the conceptual and technical groundwork for subsequent versions of the GPT family (GPT-2, GPT-3, and beyond).

GPT-2

The GPT-2 model was announced by OpenAI on February 14, 2019. It significantly surpassed its predecessor in size: the full version of the model contained approximately 1.5 billion parameters. For safety reasons, OpenAI initially released only smaller variants of the model; the full version (1.5B parameters) was released on November 5, 2019. Unlike GPT-1, which was trained on the BookCorpus (~5 GB), GPT-2 was trained on a specially compiled WebText corpus of approximately 40 GB, comprising textual data from high-quality internet sources. The increase in both model size and training data volume enabled GPT-2 to significantly improve text generation quality: it demonstrated the ability to create substantive articles, stories, and even coherent passages of fiction.

GPT-2 employed an autoregressive decoder-only transformer architecture similar to GPT-1, without significant changes. The model consisted of 48 self-attention layers, had a hidden state size of 1,600, and included approximately 1.5 billion parameters. The number of attention heads was 25 (maintaining a head size of 64, inherited from GPT-1: 1,600 ÷ 64 = 25). Training was performed on the next-token prediction task based on the preceding context using masked attention.

One of the main distinctions of GPT-2 was that the model was the first to demonstrate high effectiveness in zero-shot learning — the ability to solve new tasks without undergoing explicit fine-tuning on examples for those tasks. The model was trained on a large corpus of general texts and did not undergo specialized training on task-specific data. Evaluation was conducted in a zero-shot regime, in which the model performed tasks solely based on knowledge acquired during pre-training. On a number of language modeling tasks, GPT-2 achieved quality comparable to or exceeding the results of models specifically trained on specialized datasets (e.g., Wikipedia, news texts, books).

GPT-3

The GPT-3 model was introduced by OpenAI in June 2020 (the arXiv paper appeared on May 28, 2020; beta API access opened on June 11, 2020). It was the next step in the development of generative transformers after GPT-2 and was distinguished by scaling the architecture to 175 billion parameters, making it the largest language model at the time.

The architecture of GPT-3 remained fundamentally the same — a multi-layer autoregressive decoder-only transformer without radical changes. The main performance improvements were achieved through increasing the number of layers, the width of hidden layers, and the scale of training. The model was trained on a combination of several large text corpora, including Common Crawl, WebText2, Books1, Books2, and Wikipedia. The total data volume was approximately 570 GB or more (570 GB accounted for the filtered portion of Common Crawl, which dominated the training mixture).

One of the main features of GPT-3 was its capability for few-shot learning and zero-shot learning: the model could perform a wide range of natural language processing tasks, including translation, summarization, question answering, essay writing, and even programming, based on just a few examples in the text prompt or none at all.

GPT-3.5

The GPT-3.5 model was introduced by OpenAI in late 2022 as part of the evolutionary development of the GPT family. It was built on the scaled autoregressive decoder-only transformer architecture used in GPT-3, with improvements in text generation quality, context processing, and the ability to follow complex instructions. The exact number of parameters in GPT-3.5 was not officially disclosed; the davinci versions are presumably comparable in size to GPT-3 (175B), but the exact parameters of the gpt-3.5-turbo version are unknown.

The training of GPT-3.5 involved expanded use of Reinforcement Learning from Human Feedback (RLHF) methods in the text-davinci-003 and gpt-3.5-turbo versions. The earlier text-davinci-002 version was trained using supervised fine-tuning (SFT) rather than RLHF. The model was trained on expanded text corpora including Common Crawl, Books, WebText, and other high-quality sources. The context window in early popular versions (gpt-3.5-turbo) was 4,096 tokens; OpenAI subsequently released updated versions with a context of up to 16,385 tokens[10].

In practice, GPT-3.5 was adapted to solve a wide range of natural language processing tasks, such as:

  • Generating coherent and logical text;
  • Question answering (QA) and context understanding;
  • Following multi-step instructions;
  • Improved long-term context maintenance in dialogues.

Several key versions based on GPT-3.5 were released for different purposes:

  • text-davinci-002 — the first publicly available model based on GPT-3.5, optimized for generation and instruction following (trained using SFT).
  • text-davinci-003 — an improved version with even greater reasoning and complex text generation capability (trained using RLHF).
  • gpt-3.5-turbo — the most performant and cost-effective version of GPT-3.5, used in the ChatGPT service since late 2022.

GPT-4

The GPT-4 model was introduced by OpenAI on March 14, 2023 in the "GPT-4 Technical Report". It represented the next stage in the development of the language model family, offering significant improvements in text comprehension, generation of meaningful and creative responses, and processing of multimodal data. The exact number of parameters and architectural details of the model were not officially disclosed — the GPT-4 technical report explicitly states that information about the architecture, model size, hardware, training compute, and dataset construction is not published[11]. According to unofficial external estimates, GPT-4 may have used a Mixture of Experts (MoE) approach with a total scale on the order of ~1.8 trillion parameters; however, OpenAI has neither officially confirmed nor denied these figures[12].

GPT-4 is a multimodal model capable of accepting both text and images as input. It should be noted that at the time of the initial launch in March 2023, only the text modality was available; image input support was deployed later. The context window was 8,192 tokens in the base version and 32,768 tokens in the GPT-4-32k variant. The model used RLHF (Reinforcement Learning from Human Feedback) methods.

GPT-4 training was performed on a combination of large-scale textual and multimodal corpora. Specific details of the training data, hardware, and methodology are not disclosed in official OpenAI publications.

Training proceeded in several stages:

  • large-scale unsupervised pre-training on texts and images,
  • supervised fine-tuning on specialized tasks,
  • a final stage of Reinforcement Learning from Human Feedback (RLHF) to improve reliability, safety, and instruction interpretation quality.

Several main versions were released based on GPT-4:

  • GPT-4 (March 2023): the base version with text input support (image support added later); context window of 8,192 tokens; a GPT-4-32k variant with a 32,768-token context was also released.
  • GPT-4 Turbo (November 2023): an optimized modification of GPT-4 with an increased context window of up to 128,000 tokens[13]; reduced compute costs and accelerated generation; support for function calling and JSON output modes.
  • GPT-4o (May 2024): a next-generation multimodal version; in the launch announcement it was positioned as an omni-model capable of working with text, images, and audio in real time (unlike GPT-4 Turbo, where different modalities were served by separate modules); however, the base API model gpt-4o is described as text+image input, text output; context window of 128,000 tokens.
  • GPT-4.5 (February 2025): a research preview; the OpenAI system card explicitly states that the model "builds on GPT-4o"[3]; improved generation of complex texts, increased instruction-following accuracy, and reduced hallucination rate; context window of 128,000 tokens. It was described as "the last OpenAI model without chain-of-thought" (codename — Orion)[14]. The API model gpt-4.5-preview was declared deprecated on April 14, 2025 and shut down on July 14, 2025[4].
  • GPT-4.1 (April 2025): a stable version with a radical context expansion to 1,047,576 tokens; accepts text and images as input, outputs text[15]; released simultaneously in three variants (GPT-4.1, GPT-4.1 mini, GPT-4.1 nano); initially available only through the API, later deployed in ChatGPT.

GPT-5

On August 7, 2025, OpenAI introduced GPT‑5 as its then "smartest, fastest, and most useful" model, with a built-in deep reasoning mode (thinking) and a focus on practical scenarios — writing, programming, health-related work, and multimodal understanding. GPT‑5 gradually became the default model for most logged-in ChatGPT users, displacing the previously used GPT‑4/4o family and o-series models.[16]

GPT‑5 is implemented as a unified system with two main operating modes: fast, cost-effective responses for everyday queries (referred to as gpt‑5 main) and deep reasoning for complex tasks (referred to as gpt‑5 thinking). The mode is selected automatically by a router that takes into account the dialogue type, query complexity, the need for tools, and explicit user cues (e.g., "think step by step" or "analyze in depth"). In ChatGPT, users have access to Auto / Instant / Thinking / Pro modes; the mini and nano variants are primarily API models, and mini in the consumer product may be used as a fallback after rate limits are exhausted[17].

Several sizes and configurations of GPT‑5 are available through the API; in the OpenAI documentation, the main variants are listed as gpt‑5, gpt‑5‑mini, and gpt‑5‑nano (all supporting text and visual data). The maximum total context window for the GPT‑5 family in the API is approximately 400,000 tokens (with separate budgets for input and reasoning/output), although specific limits may vary depending on the chosen model variant and product[18].

On a number of web-search and factual benchmarks, GPT‑5 demonstrates a notable reduction in the frequency of hallucinations and errors compared to GPT‑4o and earlier OpenAI "thinking" models. In the official announcement, OpenAI reported error reductions of approximately 45% compared to GPT-4o and approximately 80% compared to o3 in thinking mode — these results were obtained under specific conditions: with web search enabled on anonymized prompts representative of ChatGPT production traffic[19].

GPT-5.1

The GPT-5.1 model was introduced by OpenAI on November 12, 2025 as the first significant iteration after the base GPT-5, aimed at improving everyday interaction, conversational quality, and adaptability. The model retains the unified system with a fast mode (GPT-5.1 Instant) and deep reasoning (GPT-5.1 Thinking), but introduces adaptive reasoning: the model dynamically determines the amount of computation depending on query complexity, making it notably faster on simple tasks without sacrificing quality on complex ones.

GPT-5.1 training was built on top of GPT-5 with an additional post-training stage that included expanded RLHF, a focus on natural tone, and reduced "coldness" of responses. The API context window is 400,000 tokens, with a maximum output of 128,000 tokens[20]. Extended prompt caching of up to 24 hours was introduced, significantly reducing cost and latency for multi-turn dialogues[21].

Key features:

  • GPT-5.1 Instant — the primary mode for everyday tasks; the first to use adaptive reasoning to determine when it is worth "thinking" before responding to a more complex query[21].
  • GPT-5.1 Thinking — adaptive allocation of reasoning time; according to OpenAI, on a representative distribution of ChatGPT tasks the model is approximately twice as fast on the simplest tasks and approximately twice as slow on the hardest compared to GPT-5 Thinking[21].
  • Improved multimodality (text + vision).
  • Improved coding and agentic scenarios, as well as efficiency on simple tasks through adaptive reasoning and extended prompt caching[22].

GPT-5.2

The GPT-5.2 model was released on December 11, 2025 as "the most capable model in the series for professional work and learning." It is an evolution of GPT-5.1 with an emphasis on economic value: generation of tables, presentations, complex code, and end-to-end tasks. It retains the unified architecture with Instant, Thinking, and a new Pro mode (for tasks requiring maximum compute and reasoning time).

Training included an updated corpus with a knowledge cutoff of August 2025, enhanced instruction-tuning, and RLHF to reduce errors in multi-step scenarios. Context window — 400K tokens (128K max output). The model became more reliable in professional scenarios, with improved factual accuracy and tool use.

On December 18, 2025, the specialized GPT-5.2-Codex was released based on GPT-5.2 — an agentic coding model with improved context compaction, Windows support, enhanced cybersecurity, and long-horizon reasoning (tasks lasting up to several hours).

As of February 13, 2026, following the retirement of several older models, GPT-5.2 temporarily became the default model in ChatGPT. However, by early March 2026, this role was taken over by GPT‑5.3 Instant and GPT‑5.4[17].

GPT-5.3-Codex

GPT-5.3-Codex was introduced on February 5, 2026 as "the most powerful agentic coding model to date." It combines the frontier-coding capabilities of GPT-5.2-Codex with the professional reasoning of GPT-5.2 in a single model that is 25% faster than its predecessors.

The model is capable of performing virtually any developer task: long-running workflows, research, tool use, code execution, and interactive steering (the user can intervene in real time without losing context). Early versions of the model were used by the OpenAI team to debug their own training, deployment, and evaluations.

Key results at the time of the February 5, 2026 announcement: Terminal-Bench ~77.3%, OSWorld-Verified ~64.7%, SWE-Bench Pro ~56.8%. In the later GPT-5.4 release on March 5, 2026, OpenAI reported an updated OSWorld-Verified result of 74.0% for GPT-5.3-Codex when using a new API parameter that preserves the original image resolution[23][7].

On February 12, 2026, OpenAI also released GPT-5.3-Codex-Spark — a compact ultra-fast version in partnership with Cerebras, optimized for real-time use: over 1,000 tokens per second, text-only, 128K context. At launch, this was a rollout for ChatGPT Pro users in Codex and a small number of API design partners, rather than a broadly available API model[24].

GPT-5.4

On March 5, 2026, OpenAI introduced GPT‑5.4 as its new frontier model for professional work. GPT‑5.4 combines the strengths of OpenAI's latest releases in reasoning, coding, and agentic workflows and was the first in the main product line to receive built-in computer use capabilities. Simultaneously, OpenAI released GPT‑5.4 Pro — a variant for the most complex tasks, using more compute and longer reasoning[7].

In the API, the gpt-5.4 model is described as the recommended default for a wide range of general-purpose and coding tasks; the context window is 1,050,000 tokens, with a maximum output of 128,000 tokens. The model accepts text and images as input and outputs text[8][25].

In ChatGPT, the Auto mode as of March 7, 2026 automatically switches between GPT‑5.3 Instant and GPT‑5.4 Thinking, while GPT‑5.4 Pro is available as a separate high-capability mode. For logged-in ChatGPT users, the default model is GPT‑5.3[17].

GPT model evolution

GPT model evolution
Generation Release year Parameter count Training corpus size Key features
GPT-1 2018 ≈117–124M[26] ≈5 GB (BooksCorpus) Generative pre-training on large corpora; two-stage training (pre-training + fine-tuning)
GPT-2 2019 1.5B ≈40 GB (WebText) Substantially improved text generation; demonstration of strong zero-shot behavior; initially staged release of the model
GPT-3 2020 175B ≈570 GB (Common Crawl, WebText2, et al.) Large-scale in-context learning; strong few-shot and zero-shot capabilities without fine-tuning
GPT-3.5 2022 Not disclosed (davinci versions presumably ~175B) >570 GB + additional corpora and instruction tuning Improved stability and instruction following; basis of early ChatGPT versions
GPT-4 2023 Not disclosed[27] Not disclosed Multimodality (text + images); improved accuracy and hallucination resistance; 8k/32k token context
GPT-4 Turbo 2023 Not disclosed Based on GPT-4 training (details not disclosed) Context increase to 128,000 tokens; optimization of generation speed and cost
GPT-4o 2024 Not disclosed Multimodal data (text, images, audio) Unified neural multimodal processing; high response speed
GPT-4.5 2025 Not disclosed Expanded textual and multimodal corpora Research preview based on GPT-4o; error reduction; deprecated by 2026
GPT-4.1 2025 Not disclosed Updated corpora Context up to 1,047,576 tokens; text + images as input, text as output
GPT-5 2025 (August) Not disclosed Large-scale multimodal corpora Unified system with fast response and reasoning modes; ~400K token context; reduced hallucinations
GPT-5.1 2025 (November) Not disclosed Expanded GPT-5 corpora + RLHF Adaptive reasoning; 24h prompt caching; coding improvements
GPT-5.2 2025 (December) Not disclosed Knowledge cutoff August 2025 Pro mode; professional knowledge work; GPT-5.2-Codex (agentic coding)
GPT-5.3-Codex 2026 (February) Not disclosed Updated + self-improvement data 25% faster; full-spectrum agent; interactive steering
GPT-5.3-Codex-Spark 2026 (February) Not disclosed Compact >1000 t/s on Cerebras; real-time coding; 128K context
GPT-5.3 Instant 2026 (March) Not disclosed Not disclosed Update to the most widely used conversational ChatGPT model; improved factuality, web search, and conversational flow
GPT-5.4 2026 (March) Not disclosed Not disclosed New frontier model for professional work; native computer use; default API model for general-purpose and most coding tasks
GPT-5.4 Pro 2026 (March) Not disclosed Not disclosed GPT-5.4 variant with more compute for the most complex tasks

Architectural parameters of GPT models

Architectural parameters of GPT models
Model Release year Parameter count Number of layers Hidden state size Number of attention heads Context window Training corpus size
GPT-1 2018 ≈117–124M 12 768 12 512 tokens ≈5 GB (BooksCorpus)
GPT-2 2019 1.5B 48 1,600 25 1,024 tokens ≈40 GB (WebText)
GPT-3 2020 175B 96 12,288 96 2,048 tokens ≈570 GB (Common Crawl + WebText2 + others)
GPT-3.5 2022 Not disclosed (davinci versions presumably ~175B) (estimated close to GPT-3) (estimated close to GPT-3) (not disclosed) Up to 4,096 tokens (early); up to 16,385 tokens (later) Expanded Common Crawl + additional datasets and instruction tuning
GPT-4 2023 Not disclosed (not disclosed) (not disclosed) (not disclosed) 8,192 tokens (base); 32,768 (GPT-4-32k) Not disclosed
GPT-4 Turbo 2023 (not disclosed) (not disclosed) (not disclosed) (not disclosed) Up to 128,000 tokens Optimized version of GPT-4 (corpus details not disclosed)
GPT-4o 2024 (not disclosed) (not disclosed) (not disclosed) (not disclosed) Up to 128,000 tokens Multimodal data: text, images, audio
GPT-4.5 2025 (not disclosed) (not disclosed) (not disclosed) (not disclosed) Up to 128,000 tokens Updated textual and multimodal corpora
GPT-4.1 2025 (not disclosed) (not disclosed) (not disclosed) (not disclosed) Up to 1,047,576 tokens Multimodality; scaled training with emphasis on long contexts
GPT-5 2025 (not disclosed) (not disclosed) (not disclosed) (not disclosed) Up to ≈400,000 tokens (total context) Large-scale multimodal corpora (details not disclosed)
GPT-5.4 2026 (not disclosed) (not disclosed) (not disclosed) (not disclosed) 1,050,000 tokens; 128,000 max output Not disclosed

References

  1. OpenAI. "Introducing GPT-5" (August 7, 2025). https://openai.com/index/introducing-gpt-5/
  2. OpenAI. "Introducing GPT-4.5" (2025). https://openai.com/index/introducing-gpt-4-5/
  3. 3.0 3.1 OpenAI. GPT-4.5 System Card (February 27, 2025). https://cdn.openai.com/gpt-4-5-system-card-2272025.pdf
  4. 4.0 4.1 OpenAI Developers. Deprecations. https://developers.openai.com/api/docs/deprecations/
  5. OpenAI. "Introducing GPT-4.1 in the API" (2025).
  6. OpenAI. "GPT-5.3 Instant: Smoother, more useful everyday conversations" (March 3, 2026). https://openai.com/index/gpt-5-3-instant/
  7. 7.0 7.1 7.2 OpenAI. "Introducing GPT-5.4" (March 5, 2026). https://openai.com/index/introducing-gpt-5-4/
  8. 8.0 8.1 OpenAI Developers. "Using GPT-5.4". https://developers.openai.com/api/docs/guides/latest-model/
  9. Radford, A. et al. (2018). Improving Language Understanding by Generative Pre-Training. https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf
  10. OpenAI noted that the updated GPT-3.5 Turbo "now comes by default with 16k context."
  11. OpenAI. "GPT-4 Technical Report" (2023). arXiv:2303.08774.
  12. These estimates are based on data published by SemiAnalysis and corroborated by a number of independent sources.
  13. Announced at OpenAI DevDay on November 6, 2023; general availability from April 9, 2024.
  14. The codename Orion and the characterization "last model without chain-of-thought" appeared in Sam Altman's roadmap communications and several media publications (Reuters, The Verge), but not in the GPT-4.5 launch post itself.
  15. OpenAI. "Introducing GPT-4.1 in the API" (2025).
  16. OpenAI. "Introducing GPT-5" (August 7, 2025).
  17. 17.0 17.1 17.2 OpenAI Help Center. "GPT-5.3 and GPT-5.4 in ChatGPT". https://help.openai.com/en/articles/11909943-gpt-53-and-54-in-chatgpt
  18. OpenAI API documentation. Models: GPT-5.
  19. OpenAI. "Introducing GPT-5" (2025). Testing conditions: "with web search enabled on anonymized prompts representative of ChatGPT production traffic."
  20. OpenAI Developers. Models: GPT-5.1. https://developers.openai.com/api/docs/models/gpt-5.1
  21. 21.0 21.1 21.2 OpenAI. "GPT-5.1: A smarter, more conversational ChatGPT" (November 12, 2025). https://openai.com/index/gpt-5-1/
  22. OpenAI. "GPT-5.1 for developers" (2025). https://openai.com/index/gpt-5-1-for-developers/
  23. OpenAI. "Introducing GPT-5.3-Codex" (February 5, 2026). https://openai.com/index/introducing-gpt-5-3-codex/
  24. OpenAI. "Introducing GPT-5.3-Codex-Spark" (February 12, 2026). https://openai.com/index/introducing-gpt-5-3-codex-spark/
  25. OpenAI Developers. Models: GPT-5.4. https://developers.openai.com/api/docs/models/gpt-5.4
  26. The exact parameter count of GPT-1 varies across sources; the original publication does not state the number explicitly. The figure ≈117M is widely cited, while ≈124M appears in some later materials.
  27. According to unofficial external estimates (SemiAnalysis et al.), possibly MoE architecture with a total scale of ~1.8T parameters; OpenAI has not confirmed these figures.

Bibliography

  • Radford, A. et al. (2018). Improving Language Understanding by Generative Pre-Training. PDF.
  • Radford, A. et al. (2019). Language Models are Unsupervised Multitask Learners. PDF.
  • Brown, T. B. et al. (2020). Language Models are Few-Shot Learners. arXiv:2005.14165.
  • Kaplan, J. et al. (2020). Scaling Laws for Neural Language Models. arXiv:2001.08361.
  • Chen, M. et al. (2021). Evaluating Large Language Models Trained on Code. arXiv:2107.03374.
  • Ouyang, L. et al. (2022). Training Language Models to Follow Instructions with Human Feedback. arXiv:2203.02155.
  • Hoffmann, J. et al. (2022). Training Compute-Optimal Large Language Models. arXiv:2203.15556.
  • Bai, Y. et al. (2022). Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback. arXiv:2204.05862.
  • OpenAI (2023). GPT-4 Technical Report. arXiv:2303.08774.
  • Bubeck, S. et al. (2023). Sparks of Artificial General Intelligence: Early Experiments with GPT-4. arXiv:2303.12712.