Azure CTO Mark Russinovich’s annual Azure infrastructure shows at Construct are at all times fascinating as he explores the previous, current, and way forward for the {hardware} that underpins the cloud. This yr’s speak was no completely different, specializing in the identical AI platform touted in the remainder of the occasion.
Over time it’s been clear that Azure’s {hardware} has grown more and more advanced. At the beginning, it was a first-rate instance of utility computing, utilizing a single commonplace server design. Now it’s many alternative server sorts, capable of assist all lessons of workloads. GPUs had been added and now AI accelerators.
That final innovation, launched in 2023, reveals how a lot Azure’s infrastructure has advanced together with the workloads it hosts. Russinovich’s first slide confirmed how shortly fashionable AI fashions had been rising, from 110 million parameters with GPT in 2018, to over a trillion in right this moment’s GPT-4o. That progress has led to the event of huge distributed supercomputers to coach these fashions, together with {hardware} and software program to make them environment friendly and dependable.
Constructing the AI supercomputer
The dimensions of the techniques wanted to run these AI platforms is gigantic. Microsoft’s first huge AI-training supercomputer was detailed in Could 2020. It had 10,000 Nvidia V100 GPUs and clocked in at quantity 5 within the world supercomputer rankings. Solely three years later, in November 2023, the most recent iteration had 14,400 H100 GPUs and ranked third.
In June 2024, Microsoft has greater than 30 related supercomputers in information facilities world wide. Russinovich talked concerning the open supply Llama-3-70B mannequin, which takes 6.4 million GPU hours to coach. On one GPU that might take 730 years, however with one in every of Microsoft’s AI supercomputers, a coaching run takes roughly 27 days.
Coaching is just half the issue. As soon as a mannequin has been constructed, it must be used, and though inference doesn’t want supercomputer-levels of compute for coaching, it nonetheless wants plenty of energy. As Russinovich notes, a single floating-point parameter wants two bytes of reminiscence, a one-billion-parameter mannequin wants 2GB of RAM, and a 175-billion-parameter mannequin requires 350GB. That’s earlier than you add in any vital overhead, corresponding to caches, which may add greater than 40% to already-hefty reminiscence necessities.
All because of this Azure wants plenty of GPUS with very particular traits to push via plenty of information as shortly as attainable. Fashions like GPT-4 require vital quantities of high-bandwidth reminiscence. Compute and reminiscence all want substantial quantities of energy. An Nvidia H100 GPU requires 700 watts, and with 1000’s in operation at any time, Azure information facilities have to dump plenty of warmth.
Past coaching, design for inference
Microsoft has developed its personal inference accelerator within the form of its Maia {hardware}, which is pioneering a brand new directed-liquid cooling system, sheathing the Maia accelerators in a closed-loop cooling system that has required an entire new rack design with a secondary cupboard that incorporates the cooling tools’s warmth exchangers.
Designing information facilities for coaching has proven Microsoft the right way to provision for inference. Coaching quickly ramps as much as 100% and holds there during a run. Utilizing the identical energy monitoring on an inferencing rack, it’s attainable to see how energy draw varies at completely different factors throughout an inferencing operation.
Azure’s Mission POLCA goals to make use of this info to extend efficiencies. It permits a number of inferencing operations to run on the similar time by provisioning for peak energy draw, giving round 20% overhead. That lets Microsoft put 30% extra servers in a knowledge middle by throttling each server frequency and energy. The result’s a extra environment friendly and extra sustainable strategy to the compute, energy, and thermal calls for of an AI information middle.
Managing the information for coaching fashions brings its personal set of issues; there’s plenty of information, and it must be distributed throughout the nodes of these Azure supercomputers. Microsoft has been engaged on what it calls Storage Accelerator to handle this information, distributing it throughout clusters with a cache that determines if required information is obtainable regionally or whether or not it must be fetched, utilizing accessible bandwidth to keep away from interfering with present operations. Utilizing parallel reads to load information permits giant quantities of coaching information to be loaded nearly twice as quick as conventional file hundreds.
AI wants high-bandwidth networks
Compute and storage are vital, however networking stays essential, particularly with huge data-parallel workloads working throughout many a whole bunch of GPUs. Right here, Microsoft has invested considerably in high-bandwidth InfiniBand connections, utilizing 1.2TBps of inner connectivity in its servers, linking 8 GPUs, and on the similar time 400Gbps between particular person GPUs in separate servers.
Microsoft has invested quite a bit in InfiniBand, each for its Open AI coaching supercomputers and for its customer support. Curiously Russinovich famous that “actually, the one distinction between the supercomputers we construct for OpenAI and what we make accessible publicly, is the dimensions of the InfiniBand area. Within the case of OpenAI, the InfiniBand area covers the complete supercomputer, which is tens of 1000’s of servers.” For different prospects who don’t have the identical coaching calls for, the domains are smaller, however nonetheless at supercomputer scale, “1,000 to 2,000 servers in dimension, connecting 10,000 to twenty,000 GPUs.”
All that networking infrastructure requires some surprisingly low-tech options, corresponding to 3D-printed sleds to effectively pull giant quantities of cables. They’re positioned within the cable cabinets above the server racks and pulled alongside. It’s a easy approach to minimize cabling occasions considerably, a necessity whenever you’re constructing 30 supercomputers each six months.
Making AI dependable: Mission Forge and One Pool
{Hardware} is just a part of the Azure supercomputer story. The software program stack supplies the underlying platform orchestration and assist instruments. That is the place Mission Forge is available in. You may consider it as an equal to one thing like Kubernetes, a means of scheduling operations throughout a distributed infrastructure whereas offering important useful resource administration and spreading hundreds throughout various kinds of AI compute.
The Mission Forge scheduler treats all of the accessible AI accelerators in Azure as a single pool of digital GPU capability, one thing Microsoft calls One Pool. Masses have precedence ranges that management entry to those digital GPUs. The next-priority load can evict a lower-priority one, transferring it to a special class of accelerator or to a different area altogether. The intention is to offer a constant degree of utilization throughout the complete Azure AI platform so Microsoft can higher plan and handle its energy and networking price range.
Like Kubernetes, Mission Forge is designed to assist run a extra resilient service, detecting failures, restarting jobs, and repairing the host platform. By automating these processes, Azure can keep away from having to restart costly and sophisticated jobs, treating them as a substitute as a set of batches that may run individually and orchestrate inputs and outputs as wanted.
Consistency and safety: prepared for AI functions
As soon as an AI mannequin has been constructed it must be used. Once more, Azure wants a means of balancing utilization throughout various kinds of fashions and completely different prompts inside these fashions. If there’s no orchestration (or lazy orchestration), it’s straightforward to get right into a place the place one immediate finally ends up blocking different operations. By profiting from its digital, fractional GPUs, Azure’s Mission Flywheel can assure efficiency, interleaving operations from a number of prompts throughout digital GPUs, permitting constant operations on the host bodily GPU whereas nonetheless offering a relentless throughput.
One other low-level optimization is confidential computing capabilities when coaching customized fashions. You may run code and host information in trusted execution environments. Azure is now capable of have full confidential VMs, together with GPUs, with encrypted messages between CPU and GPU trusted environments. You should use this for coaching or securing your personal information used for retrieval-augmented era.
From Russinovich’s presentation, it’s clear that Microsoft is investing closely in making its AI infrastructure environment friendly and responsive for coaching and inference. The Azure infrastructure and platform groups have put plenty of work into constructing out {hardware} and software program that may assist coaching the biggest fashions, whereas offering a safe and dependable place to make use of AI in your functions.
Operating Open AI on Azure has given these groups plenty of expertise, and it’s good to see that have paying off in offering the identical instruments and strategies for the remainder of us—even when we don’t want our personal TOP500 supercomputers.
Copyright © 2024 IDG Communications, .