It’s a giant value play, he identified, and it “has to occur in every single place, on a regular basis, for all customers.”
The subsequent section of inferencing
The brand new Groq 3 language processing models (LPUs) are primarily based on mental property (IP) from Groq, which signed a $20 billion licensing settlement with Nvidia late final 12 months. In response to the chip firm, a fleet of LPUs can operate as a “large single processor.”
Whereas Rubin GPUs will proceed to deal with prefill (immediate processing), Groq’s LPX will now deal with latency-sensitive parts of decode (response). Collectively, they will ship a “new class of inference efficiency,” Nvidia says.
Every LPX rack options 256 LPUs with 128 GB of on-chip static random-access reminiscence (SRAM), 150 terabyte per second (TB/s) bandwidth, chip-to-chip hyperlinks and high-speed connections to NVL72, Nvidia’s liquid-cooled AI supercomputer. Mixed, these can scale back latency to “close to zero,” Nvidia claims.
The LPX integration with Vera Rubin AI factories will probably be accessible within the second half of this 12 months.
Coaching versus inferencing
Coaching and inference stress infrastructure in very alternative ways, famous Sanchit Vir Gogia, chief analyst at Greyhound Analysis. Whereas coaching rewards “huge parallelism and brute-force scale,” inferencing (particularly for lengthy context and interactive reasoning) is much extra delicate to latency, reminiscence motion, cache conduct, concurrency, and cost per delivered token.
