Together AI's FlashAttention-4 achieves 1,605 TFLOPs/s on B200 GPUs, up to 2.7x faster than Triton. New pipelining overcomes asymmetric hardware scaling bottlenecksTogether AI's FlashAttention-4 achieves 1,605 TFLOPs/s on B200 GPUs, up to 2.7x faster than Triton. New pipelining overcomes asymmetric hardware scaling bottlenecks

FlashAttention-4 Hits 71% GPU Utilization on NVIDIA Blackwell B200

2026/03/05 22:04
3 min read
For feedback or concerns regarding this content, please contact us at crypto.news@mexc.com

FlashAttention-4 Hits 71% GPU Utilization on NVIDIA Blackwell B200

Terrill Dicki Mar 05, 2026 14:04

Together AI's FlashAttention-4 achieves 1,605 TFLOPs/s on B200 GPUs, up to 2.7x faster than Triton. New pipelining overcomes asymmetric hardware scaling bottlenecks.

FlashAttention-4 Hits 71% GPU Utilization on NVIDIA Blackwell B200

Together AI has released FlashAttention-4, achieving up to 1,605 TFLOPs/s on NVIDIA's Blackwell B200 GPUs—representing 71% hardware utilization and marking a 2.7x speedup over Triton implementations. The release addresses a fundamental challenge in modern AI hardware: tensor core throughput is scaling far faster than other critical resources.

For context, NVIDIA's market cap sits at $4.49 trillion as of March 4, 2026, with shares trading at $179.86. The company released its own Flash Attention optimization guide for Blackwell GPUs just yesterday, signaling the growing importance of attention optimization in production AI workloads.

The Asymmetric Scaling Problem

Here's what makes this interesting. From Hopper H100 to Blackwell B200, BF16 tensor core throughput jumped from 1 to 2.25 PFLOPs. But special function units for exponential operations and shared memory bandwidth? Unchanged. That creates a bottleneck nobody was expecting.

The Together AI team discovered that the forward pass isn't compute-bound at all on B200—it's bottlenecked by exponential calculations in softmax. The backward pass? Shared memory traffic dominates. Traditional attention optimization focused on the wrong constraints.

How FA4 Solves It

The forward pass uses a ping-pong schedule processing two query tiles per CTA, with dedicated warpgroups handling softmax while others issue matrix operations. The clever bit: software emulation of the exponential function using FMA units alongside hardware MUFU.EX2, effectively doubling exponential throughput.

Conditional online softmax rescaling skips small corrections entirely. If the max jump stays below a threshold, the kernel avoids unnecessary vector operations. Final normalization still produces correct results—but the critical path shrinks considerably.

The backward pass exploits Blackwell's new 2-CTA MMA mode, partitioning output accumulators across CTA pairs. Each CTA stages half of operand B while keeping only its accumulator slice, roughly halving shared memory traffic. Global atomic reductions for dQ gradients also drop by half.

Performance Numbers

Against cuDNN 9.13, FlashAttention-4 delivers 1.1-1.3x improvement on forward passes and consistent gains on backward passes at large sequence lengths. The Triton comparison shows the starkest difference—up to 2.7x faster forward performance.

Deterministic mode, which serializes global reductions for reproducible training, still achieves 85-90% of non-deterministic throughput. That's significant for teams requiring exact reproducibility across training runs.

The Broader Picture

FlashAttention has evolved rapidly since its May 2022 debut. Version 1 achieved 25-40% utilization on A100s. FA2 pushed that to 50-73% in July 2023. FA3 targeted Hopper GPUs specifically, hitting 75% utilization with FP16 and nearly 1.2 PFLOPS with FP8.

FA4 represents a philosophical shift—algorithm and kernel co-design that accounts for asymmetric hardware evolution. The techniques have already been partially incorporated into cuDNN 9.13 and 9.14 through collaboration with NVIDIA's teams.

The implementation uses CuTe-DSL, CUTLASS's Python kernel DSL, cutting compile times by 20-30x versus C++ templates. For teams running large-scale training on Blackwell hardware, the efficiency gains compound across millions of attention operations daily.

Image source: Shutterstock
  • flashattention-4
  • nvidia
  • blackwell
  • ai infrastructure
  • gpu optimization
Market Opportunity
4 Logo
4 Price(4)
$0,008039
$0,008039$0,008039
-1,78%
USD
4 (4) Live Price Chart
Disclaimer: The articles reposted on this site are sourced from public platforms and are provided for informational purposes only. They do not necessarily reflect the views of MEXC. All rights remain with the original authors. If you believe any content infringes on third-party rights, please contact crypto.news@mexc.com for removal. MEXC makes no guarantees regarding the accuracy, completeness, or timeliness of the content and is not responsible for any actions taken based on the information provided. The content does not constitute financial, legal, or other professional advice, nor should it be considered a recommendation or endorsement by MEXC.
Tags:

You May Also Like

Vitalik Buterin to Ethereum Developers: Build It Like It Has to Last Without You

Vitalik Buterin to Ethereum Developers: Build It Like It Has to Last Without You

Key Takeaways Vitalik Buterin wants Ethereum apps built to survive without developers, corporate servers, or trusted third parties Two major […] The post Vitalik
Share
Coindoo2026/03/07 15:49
Non-Opioid Painkillers Have Struggled–Cannabis Drugs Might Be The Solution

Non-Opioid Painkillers Have Struggled–Cannabis Drugs Might Be The Solution

The post Non-Opioid Painkillers Have Struggled–Cannabis Drugs Might Be The Solution appeared on BitcoinEthereumNews.com. In this week’s edition of InnovationRx, we look at possible pain treatments from cannabis, risks of new vaccine restrictions, virtual clinical trials at the Mayo Clinic, GSK’s $30 billion U.S. manufacturing commitment, and more. To get it in your inbox, subscribe here. Despite their addictive nature, opioids continue to be a major treatment for pain due to a lack of effective alternatives. In an effort to boost new drugs, the FDA released new guidelines for non-opioid painkillers last week. But making these drugs hasn’t been easy. Vertex Pharmaceuticals received FDA approval for its non-opioid Journavx in January, then abandoned a next generation drug after a failed clinical trial earlier this summer. Acadia similarly abandoned a promising candidate after a failed trial in 2022. One possible basis for non-opioids might be cannabis. Earlier this year, researchers at Washington University at St. Louis and Stanford published a study showing that a cannabis-derived compound successfully eased pain in mice with minimal side effects. Munich-based pharmaceutical company Vertanical is perhaps the furthest along in this quest. It is developing a cannabinoid-based extract to treat chronic pain it hopes will soon become an approved medicine, first in the European Union and eventually in the United States. The drug, currently called Ver-01, packs enough low levels of cannabinoids (including THC) to relieve pain, but not so much that patients get high. Founder Clemens Fischer, a 50-year-old medical doctor and serial pharmaceutical and supplement entrepreneur, hopes it will become the first cannabis-based painkiller prescribed by physicians and covered by insurance. Fischer founded Vertanical, with his business partner Madlena Hohlefelder, in 2017, and has invested more than $250 million of his own money in it. With a cannabis cultivation site and drug manufacturing plant in Denmark, Vertanical has successfully passed phase III clinical trials in Germany and expects…
Share
BitcoinEthereumNews2025/09/18 05:26
Short-term profit-taking pushes Bitcoin back below key $70K level – What next?

Short-term profit-taking pushes Bitcoin back below key $70K level – What next?

The post Short-term profit-taking pushes Bitcoin back below key $70K level – What next? appeared on BitcoinEthereumNews.com. Bitcoin [BTC] rallied as high as $74
Share
BitcoinEthereumNews2026/03/07 16:09