You’ve doubtless heard the time period ‘AI factories’ thrown round, however what does it actually imply? Thus far, the idea has been hyped greater than outlined – principally by Nvidia. The corporate’s imaginative and prescient is knowledge facilities full of high-end AI accelerators, however is that imaginative and prescient lifelike, or simply strategic advertising?
Merely put, an AI manufacturing unit is a specialised knowledge heart designed for AI processing somewhat than conventional workloads like internet hosting databases, file storage, enterprise purposes, or net companies. An AI manufacturing unit is constructed round GPUs, which outperform CPUs in pace and energy when dealing with AI workloads.
AI factories are amenities designed to course of large quantities of knowledge for generative AI use, practice AI fashions and generate AI outputs like textual content, photographs, movies, or audio content material, and replace AI programs and management different programs like robots or supercomputers.
As a result of GPUs run so scorching and eat a lot energy, AI factories require extra power and cooling in comparison with conventional knowledge facilities. They’re doubtless positioned the place power is reasonable and there’s a prepared provide of water for liquid cooling.
One instance is Elon Musk’s xAI knowledge heart, which homes 100,000 Nvidia H100 GPUs for superior AI processing. At an estimated $40,000 per GPU, that represents an funding of over $4 billion from a single buyer – maybe illustrating why Nvidia CEO Jensen Huang continues to champion the idea of AI factories.
Inside an AI Manufacturing unit: Excessive-performance GPUs drive large AI workloads, however can these amenities scale sustainably? Picture: Alamy
AI Factories: Hype vs. Actuality
Whereas the idea is compelling, will we see this wave of AI factories that Jensen is promising? In all probability not at scale. AI {hardware} shouldn’t be solely pricey to accumulate and function, nevertheless it additionally doesn’t run repeatedly like a database server. As soon as a mannequin is educated, it could not want updates for months, leaving this costly infrastructure sitting idle.
For that motive, Alan Howard, senior analyst at Omdia specializing in infrastructure and knowledge facilities, believes most AI {hardware} deployments will happen in multipurpose knowledge facilities. These amenities will doubtless function devoted ‘AI zones’ alongside areas for normal compute and different workloads.
“It’s our feeling, actually, that there can be some devoted AI knowledge facilities, however unlikely it’s going to be as pervasive as we’re being led to imagine,” Howard instructed DCN.
“If I’ve a 50,000 sq.ft knowledge corridor in a knowledge heart, and I’ve ample energy, then I can create an space or a collection that may meet these actually excessive energy calls for for a deployment of AI gear. You’re not going to see very many knowledge facilities simply filled with AI gear… It’s going to be part of a much bigger knowledge heart.
Too Costly for Most
Ram Palaniappan, chief know-how officer with consultancy TEKsystems, agrees with the concept devoted AI knowledge facilities will stay restricted, largely because of the excessive prices concerned.
“Enterprises are doing lot extra inference than truly coaching with their knowledge,” he mentioned. “In the event you can partition inside your knowledge heart the place some parts are devoted to AI, you need to use that GPU capability for coaching the mannequin, after which the remaining CPUs can be leveraged for inferencing the mannequin. That’s how we’re seeing how the information heart world is tuning in the direction of primarily based on the enterprise consumption and the utilization of the AI.”
Anthony Goonetilleke, group and head of technique and know-how for telecom digital transformation supplier Amdocs, believes that many of those next-generation AI factories will change into obtainable for patrons to lease via an AI-as-a-Service mannequin, which main cloud service suppliers like Amazon Net Companies provide.
“Individuals are making an attempt to construct out AI factories to basically create a mannequin the place they will promote AI capability as a service, as a few of our clients wish to do,” Goonetilleke instructed DCN. “On the finish of the day, consider it as Gen AI Infrastructure-as-a-Service. I feel AI as a service has received quite a lot of potential upsides as a result of the funding in AI {hardware} is enormously costly, and in lots of instances, it’s possible you’ll not want it once more, or it’s possible you’ll not want to make use of it as a lot.”
AI tech advances quickly, and maintaining with the competitors is prohibitively costly, Palaniappan added. “Whenever you begin taking a look at how a lot every of those GPUs price, and it will get outdated fairly fairly rapidly, that turns into bottleneck,” he mentioned. “If you’re making an attempt to leverage a knowledge heart, you’re at all times on the lookout for the most recent chip within the within the facility, so many of those knowledge facilities are dropping cash due to these efforts.”
Don’t Overlook the Community
Along with the price of the GPUs, vital funding is required for networking {hardware}, as all of the GPUs want to speak with one another effectively. Tom Traugott, senior vice chairman of technique at EdgeCore Digital Infrastructure, explains that in a typical eight-GPU Nvidia DGX system, the GPUs talk through NVlink. Nevertheless, to share knowledge with different GPUs, they depend on Ethernet or InfiniBand, requiring substantial networking {hardware} to help the connection.
“Whenever you’re doing a coaching run, it’s like people on a crew,” Traugott mentioned. “They’re all engaged on the identical challenge, and so they collectively come again collectively periodically and type of commerce notes.”
In smaller clusters, networking prices are much like these of conventional knowledge facilities. Nevertheless, in clusters with 5,000, 10,000, or 20,000 GPUs, networking can account for round 15% of the general CapEx, he mentioned. A number of community connections are wanted as a result of the information units are so monumental {that a} single NIC is well saturated. To keep away from bottlenecking, a number of NICs are wanted – and the price quickly provides up.
“Apparently, which may be as excessive as 30% to 40% of the general spend, which is disproportionate to prior generations,” Traugott instructed DCN. “So, from a knowledge heart standpoint, we could not see it proper the ability house cooling, if it’s all GPUs, there could also be completely different densities.”
Learn extra of the most recent AI knowledge heart information
The Way forward for AI Factories
That is nonetheless very new tech. There’s just one identified AI manufacturing unit presently in improvement – the xAI facility. Nvidia has solely lately launched blueprints for constructing AI factories, known as enterprise reference design, to assist information the constructing course of. A lot is topic to alter, and a few readability is required because the idea develops.
“So, is it going to be a small pattern the place there’s a handful of corporations that do a handful of devoted AI factories, or is it going to be larger? My private hypothesis is it’s going to in all probability be a couple of yr earlier than we get a greater bead on whether or not new knowledge heart development has basically a brand new face to it on this planet of AI factories,” mentioned Howard.