Sunday, 14 Dec 2025
Subscribe
logo
  • Global
  • AI
  • Cloud Computing
  • Edge Computing
  • Security
  • Investment
  • Sustainability
  • More
    • Colocation
    • Quantum Computing
    • Regulation & Policy
    • Infrastructure
    • Power & Cooling
    • Design
    • Innovations
    • Blog
Font ResizerAa
Data Center NewsData Center News
Search
  • Global
  • AI
  • Cloud Computing
  • Edge Computing
  • Security
  • Investment
  • Sustainability
  • More
    • Colocation
    • Quantum Computing
    • Regulation & Policy
    • Infrastructure
    • Power & Cooling
    • Design
    • Innovations
    • Blog
Have an existing account? Sign In
Follow US
© 2022 Foxiz News Network. Ruby Design Company. All Rights Reserved.
Data Center News > Blog > AI > FlashAttention-3 unleashes the power of H100 GPUs for LLMs
AI

FlashAttention-3 unleashes the power of H100 GPUs for LLMs

Last updated: July 15, 2024 10:57 pm
Published July 15, 2024
Share
FlashAttention-3 unleashes the power of H100 GPUs for LLMs
SHARE

Be a part of our every day and weekly newsletters for the most recent updates and unique content material on industry-leading AI protection. Be taught Extra


Consideration is a core element of the transformer structure utilized in massive language fashions (LLMs). However as LLMs develop bigger and deal with longer enter sequences, the computational value of consideration turns into a bottleneck. 

To deal with this problem, researchers from Colfax Analysis, Meta, Nvidia, Georgia Tech, Princeton College, and Collectively AI have launched FlashAttention-3, a brand new approach that considerably hastens consideration computation on Nvidia Hopper GPUs (H100 and H800).

FlashAttention-3 builds upon earlier work on FlashAttention and FlashAttention-2 and additional optimizes using assets on Nvidia Hopper GPUs to maximise efficiency and effectivity for LLM coaching and inference. 

The problem of consideration computation in LLMs

One of many key improvements of transformers is the eye mechanism, which allows the mannequin to compute the connection between totally different tokens in an enter sequence.

Whereas the eye mechanism may be very efficient, it is usually computationally costly. The price of consideration computation grows quadratically with the size of the enter sequence. As LLMs are scaled to deal with longer and longer enter sequences, the eye mechanism turns into a serious bottleneck. 

Moreover, trendy {hardware} accelerators resembling GPUs are optimized for matrix multiplication (matmul) operations, that are the constructing blocks of deep studying fashions. These accelerators even have computational items for different varieties of operations resembling exponentiation, however these items are tons of of instances slower than the matmul parts.

See also  Google launches its latest open AI models

Consideration computations use a mixture of matrix multiplications and different particular capabilities that aren’t as optimized for GPUs.

For instance, the softmax operate, which is used to normalize the eye weights, is computationally dearer than matrix multiplication. Because of this, despite the fact that matrix multiplications account for a lot of the computations in consideration, the general computation can get slowed down by a small variety of particular capabilities.

One of many vital elements of optimizing consideration computation is to schedule the workloads in a manner that operations don’t get blocked by one another and make environment friendly use of various kinds of reminiscence parts. 

Making higher use of {hardware} assets

FlashAttention, launched in 2022, addressed the challenges of computing consideration by lowering the variety of reminiscence reads and writes between GPU excessive bandwidth reminiscence (HBM) and GPU on-chip static random access memory (SRAM) when doing consideration computation. As a substitute of computing the eye weights for all the sequence without delay, FlashAttention breaks down the computation into smaller chunks, known as “tiles,” that may be processed extra effectively on GPUs.

FlashAttention has been broadly adopted and has contributed to growing the context window of LLMs from a number of thousand tokens to tons of of 1000’s and even hundreds of thousands of tokens. 

Nevertheless, as {hardware} has improved, so have the probabilities of optimizing LLM computations. FlashAttention-2, launched in 2023, additional optimized using GPU assets, reaching as much as 70% of the declared most efficiency on Nvidia A100 GPUs. Nevertheless, the identical optimizations didn’t switch to the newer H100 GPUs. FlashAttention-2 solely used 35% of H100’s most capability.

See also  Hospital cyber attacks cost $600K/hour. Here's how AI is changing the math

FlashAttention-3

FlashAttention-3 takes benefit of latest options in Nvidia Hopper GPUs to maximise efficiency. These options allow increased throughput on matrix multiplication operations, sooner information switch throughout totally different reminiscence segments, and higher effectivity on low-precision operations.

FlashAttention-3 introduces a number of improvements to enhance the efficiency of consideration computation on H100 GPUs.

FlashAttention-3 schedules operations in a manner that maximizes the overlap between computation and the motion of knowledge between totally different reminiscence segments of the GPU. This reduces the time the GPU spends idle ready for information to be transferred. It additionally interleaves the matrix multiplication and softmax operations to scale back the potential of bottlenecks in computing consideration values.

FlashAttention-3 additionally makes use of a particular association of operations for sooner and extra correct computations of consideration in quantized models. Quantization is a well-liked approach that reduces the dimensions of fashions by utilizing low-bit numbers to retailer their weights. The tradeoff of quantization is the potential lack of accuracy. FlashAttention-3 addresses this drawback by fastidiously arranging the computations to reduce the affect of quantization on accuracy.

In line with the researchers, FlashAttention-3 achieves as much as 75% utilization of the H100 GPU’s most capabilities. This interprets to a 1.5–2x speedup in comparison with earlier variations of FlashAttention for each coaching and operating LLMs.

The advantages of FlashAttention-3

The sooner consideration computation supplied by FlashAttention-3 has a number of implications for LLM improvement and purposes.

Coaching LLMs is a computationally costly course of that may take weeks and even months. The quick consideration computation supplied by FlashAttention-3 can considerably cut back the time it takes to coach LLMs, which might allow researchers and builders to experiment with bigger fashions and datasets.

See also  Power shortages to restrict 40% of AI Data Centres by 2027

FlashAttention-3 may also assist prolong the context window of LLMs by enabling them to course of longer sequences extra effectively. This will unlock new purposes for LLMs in areas resembling long-form doc understanding and many-shot in-context studying.

And by utilizing a better share of GPU capability, FlashAttention-3 can cut back the variety of accelerators required to run LLMs and slash the price of operating fashions in manufacturing.

The researchers have open-sourced FlashAttention-3 below a permissive license and plan to combine it into in style deep studying libraries resembling PyTorch and Hugging Face Transformers. It will make it simpler for researchers and builders to make the most of the efficiency advantages of FlashAttention-3.
“Now we have seen that designing algorithms that make the most of the {hardware} they run on can deliver vital effectivity positive factors and unlock new mannequin capabilities resembling lengthy context,” the researchers wrote in a weblog submit revealed by Together AI. “We look ahead to future work on optimization for LLM inference, in addition to generalizing our strategies to different {hardware} architectures.”


Source link
TAGGED: FlashAttention3, GPUs, H100, LLMs, Power, unleashes
Share This Article
Twitter Email Copy Link Print
Previous Article Data Center Automation Market is Driven by Rising Cloud Adoption and Data Explosion | Global CAP to Reach: US$ 33.42 Billion By 2032, Research by SNS Insider Data Center Automation Market is Driven by Rising Cloud Adoption and Data Explosion | Global CAP to Reach: US$ 33.42 Billion By 2032, Research by SNS Insider
Next Article Satoshi Protocol Satoshi Protocol Raises $2M in Seed Funding
Leave a comment

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Your Trusted Source for Accurate and Timely Updates!

Our commitment to accuracy, impartiality, and delivering breaking news as it happens has earned us the trust of a vast audience. Stay ahead with real-time updates on the latest events, trends.
FacebookLike
TwitterFollow
InstagramFollow
YoutubeSubscribe
LinkedInFollow
MediumFollow
- Advertisement -
Ad image

Popular Posts

MonacoSol Acquires virtualDCS

MonacoSol, a Manchester, UK-based non-public fairness agency, has taken a majority stake in virtualDCS, a…

March 24, 2025

ServiceNow to acquire Logik.ai to boost CRM portfolio

“With CPQ extra seamlessly embedded into the gross sales and order administration capabilities, sellers can…

April 7, 2025

Knostic Raises $3.3M in Pre-Seed Funding

Gadi Evron, Sounil Yu – Knostic Founders Knostic, a Reston, VA- and Tel Aviv, Israel-based…

April 14, 2024

OneNeck IT and OVHcloud US Forge Nutanix Solutions Partnership

A strategic partnership has been signed by OneNeck IT Options, a supplier of hybrid IT…

March 27, 2024

GSMA and Automotive Edge Computing Consortium forge alliance to advance 5G-connected vehicles

GSMA and Automotive Edge Computing Consortium (AECC) have signed an settlement to boost related automobile…

October 31, 2024

You Might Also Like

Nous Research just released Nomos 1, an open-source AI that ranks second on the notoriously brutal Putnam math exam
AI

Nous Research just released Nomos 1, an open-source AI that ranks second on the notoriously brutal Putnam math exam

By saad
Enterprise users swap AI pilots for deep integrations
AI

Enterprise users swap AI pilots for deep integrations

By saad
Why most enterprise AI coding pilots underperform (Hint: It's not the model)
AI

Why most enterprise AI coding pilots underperform (Hint: It's not the model)

By saad
Newsweek: Building AI-resilience for the next era of information
AI

Newsweek: Building AI-resilience for the next era of information

By saad
Data Center News
Facebook Twitter Youtube Instagram Linkedin

About US

Data Center News: Stay informed on the pulse of data centers. Latest updates, tech trends, and industry insights—all in one place. Elevate your data infrastructure knowledge.

Top Categories
  • Global Market
  • Infrastructure
  • Innovations
  • Investments
Usefull Links
  • Home
  • Contact
  • Privacy Policy
  • Terms & Conditions

© 2024 – datacenternews.tech – All rights reserved

Welcome Back!

Sign in to your account

Lost your password?
We use cookies to ensure that we give you the best experience on our website. If you continue to use this site we will assume that you are happy with it.
You can revoke your consent any time using the Revoke consent button.