FlashAttention-3
FlashAttention-3 is an algorithm for optimizing the attention mechanism in transformer neural networks, designed to maximize the hardware capabilities of the NVIDIA Hopper (H100) GPU architecture[1]. The algorithm was introduced in 2024 by a group of researchers from Colfax Research, Meta, NVIDIA, Georgia Tech, Princeton University, and Together AI. The paper was accepted to the NeurIPS 2024 conference and highlighted as a spotlight paper[2].
FlashAttention-3 is the third iteration in a family of algorithms, following FlashAttention (2022) and FlashAttention-2 (2023). Its primary goal is to significantly accelerate the training and inference of large language models (LLMs) while maintaining computational accuracy.
Introduction and Background
The Problem with the Attention Mechanism
The key component of transformers is the self-attention mechanism; however, its computational complexity and memory consumption grow quadratically (O(n²)) with the length of the input sequence (n)[1]. This creates a major bottleneck, as modern GPUs are optimized for fast matrix multiplications, but computing exponential functions (e.g., in Softmax) is orders of magnitude slower. Furthermore, a naive implementation requires storing a large intermediate attention tensor in GPU memory, which limits the scalability of models.
FlashAttention and FlashAttention-2
To address this problem, FlashAttention was proposed in 2022. It reduced the number of accesses to slow global memory (HBM) using two techniques:
- Tiling: Computations are broken down into blocks (tiles) that are processed in fast on-chip memory (SRAM).
- Kernel Fusion: All operations (matrix multiplication, Softmax) are performed within a single GPU kernel without writing intermediate results to global memory.
This allowed for a reduction in memory complexity from quadratic to linear and accelerated computations by 2–4 times.
In 2023, an improved version, FlashAttention-2, was introduced, which optimized the parallelization of computations. On GPUs with the NVIDIA Ampere (A100) architecture, it achieved ~70% of the peak theoretical performance[3]. However, on the newer NVIDIA Hopper (H100) architecture, its efficiency was significantly lower—around 35%[1]. This was because the algorithm did not leverage Hopper's new hardware capabilities, which prompted the development of FlashAttention-3.
New Hardware Capabilities of the Hopper GPU (H100)
The NVIDIA Hopper architecture introduced several new features that FlashAttention-3 utilizes to achieve maximum performance[4]:
- WGMMA (Warpgroup Matrix Multiply-Accumulate): A new type of instruction for Tensor Cores that performs matrix multiplications with nearly double the performance compared to the Ampere architecture.
- TMA (Tensor Memory Accelerator): A hardware unit that accelerates data transfer between global (HBM) and shared memory. TMA automatically handles address calculations, offloading the compute cores.
- FP8 Format: Hardware support for the 8-bit floating-point data format, which doubles theoretical performance compared to FP16 but carries the risk of precision loss due to its limited dynamic range.
Technical Innovations of FlashAttention-3
The algorithm implements three key optimization methods specifically designed for the Hopper architecture[4]:
1. Asynchronous Execution and Warp Specialization
FlashAttention-3 employs warp specialization, where different groups of threads (warps) on the GPU specialize in different tasks:
- Producer warps: Load data from global memory using TMA.
- Consumer warps: Perform matrix multiplications on the Tensor Cores.
Thanks to Hopper's hardware asynchrony, these operations overlap in time. While one group of warps performs computations, another group concurrently loads data for the next block. This pipelined approach, organized using ping-pong scheduling, helps hide latencies from slow operations (like Softmax) and maximizes the utilization of all functional units of the GPU.
2. Minimizing Memory Operations
The algorithm retains the tiling philosophy from previous versions but actively uses TMA to asynchronously load subsequent data blocks in parallel with current computations. The data transfer from slow HBM to fast SRAM effectively occurs in the background of the main computations, which reduces GPU idle time spent waiting for data.
3. Low Precision (FP8) with Quantization Error Reduction
Switching to FP8 doubles speed but can lead to significant precision loss due to quantization. To combat this, the developers implemented a method called incoherent processing[4]. Its essence is as follows:
- Before the attention computation, the feature vectors (queries Q and keys K) are multiplied by a random orthogonal matrix (e.g., a Hadamard matrix).
- This transformation "smears" values with anomalously large magnitudes (outliers) across all coordinates, evening out their distribution.
- After this, quantization to FP8 is performed, which now occurs with less error.
- Since the transformation is orthogonal, it does not distort the final attention result (QKᵀ), as the effect of the matrix is canceled out during the multiplication.
This technique reduced the attention computation error in FP8 by approximately 2.6 times compared to standard FP8 usage without the transformation[4].
Performance and Significance
The application of these techniques has allowed FlashAttention-3 to achieve a significant performance advantage over previous versions on the H100 GPU:
- A 1.5–2x speedup compared to FlashAttention-2.
- High GPU utilization: Achieves ~75–85% of the H100's theoretical peak performance.
- Throughput:
- Up to 740–840 TFLOPS for half-precision (FP16/BF16).
- Up to 1.2–1.3 PFLOPS (petaflops) when using 8-bit precision (FP8)[2].
The high efficiency of FlashAttention-3 has a direct impact on the development and application of LLMs:
- Reduced training time: A 75–100% speedup in attention significantly cuts down model training time, which can take weeks or months.
- Increased context window: Models can efficiently process longer sequences (hundreds of thousands of tokens), which is crucial for analyzing large documents or codebases[1].
- Efficient resource utilization: Allows for achieving the same performance with fewer GPUs or higher speed on the same hardware, thereby reducing the cost of model deployment.
Availability and Integration
The authors have released the FlashAttention-3 source code under an open-source license on GitHub[4]. It is expected to be integrated into leading deep learning frameworks like PyTorch and libraries such as Hugging Face Transformers, making the technology accessible to a wide range of developers and researchers. Previous versions have already become the de facto standard in the industry, and FlashAttention-3 is likely to continue this trend.
Links
Literature
- Shah, J. et al. (2024). FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision. arXiv:2407.08608.
- Dao, T. (2023). FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning. arXiv:2307.08691.
- Dao, T. et al. (2022). FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness. arXiv:2205.14135.
- Kwon, W. et al. (2023). Efficient Memory Management for Large Language Model Serving with PagedAttention. arXiv:2309.06180.
- Ye, Z. et al. (2025). FlashInfer: Efficient and Customizable Attention Engine for LLM Inference Serving. arXiv:2501.01005.
- Chen, Y. et al. (2023). FlashDecoding++: Faster Large Language Model Inference on GPUs. arXiv:2311.01282.
- Liu, Y. et al. (2024). FastAttention: Extending FlashAttention-2 to NPUs and Low-Resource GPUs. OpenReview: 76NYyOrnfk.
- Dege, P. et al. (2025). FlashMLA-ETAP: Efficient Transpose Attention Pipeline for Accelerating MLA Inference on NVIDIA H20 GPUs. arXiv:2506.01969.
- Wang, G. et al. (2024). FlashMask: Efficient and Rich Mask Extension of FlashAttention. arXiv:2410.01359.
- Abbott, V.; Zardini, G. (2025). FlashAttention on a Napkin: A Diagrammatic Approach to Deep Learning IO-Awareness. arXiv:2412.03317.
Notes
- ↑ 1.0 1.1 1.2 1.3 "FlashAttention-3 unleashes the power of H100 GPUs for LLMs". VentureBeat. [1]
- ↑ 2.0 2.1 Shah, Jay, et al. "FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision". OpenReview. [2]
- ↑ Shah, Jay, et al. "FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision". arXiv:2407.08608v2 [cs.LG], 15 July 2024. [3]
- ↑ 4.0 4.1 4.2 4.3 4.4 Shah, Jay, et al. "FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision". Together AI Blog. [4]