Sunday, 14 Dec 2025
Subscribe
logo
  • Global
  • AI
  • Cloud Computing
  • Edge Computing
  • Security
  • Investment
  • Sustainability
  • More
    • Colocation
    • Quantum Computing
    • Regulation & Policy
    • Infrastructure
    • Power & Cooling
    • Design
    • Innovations
    • Blog
Font ResizerAa
Data Center NewsData Center News
Search
  • Global
  • AI
  • Cloud Computing
  • Edge Computing
  • Security
  • Investment
  • Sustainability
  • More
    • Colocation
    • Quantum Computing
    • Regulation & Policy
    • Infrastructure
    • Power & Cooling
    • Design
    • Innovations
    • Blog
Have an existing account? Sign In
Follow US
© 2022 Foxiz News Network. Ruby Design Company. All Rights Reserved.
Data Center News > Blog > Cloud Computing > Inside today’s Azure AI cloud data centers
Cloud Computing

Inside today’s Azure AI cloud data centers

Last updated: June 24, 2024 6:32 pm
Published June 24, 2024
Share
Inside today’s Azure AI cloud data centers
SHARE

Azure CTO Mark Russinovich’s annual Azure infrastructure shows at Construct are at all times fascinating as he explores the previous, current, and way forward for the {hardware} that underpins the cloud. This yr’s speak was no completely different, specializing in the identical AI platform touted in the remainder of the occasion.

Over time it’s been clear that Azure’s {hardware} has grown more and more advanced. At the beginning, it was a first-rate instance of utility computing, utilizing a single commonplace server design. Now it’s many alternative server sorts, capable of assist all lessons of workloads. GPUs had been added and now AI accelerators.

That final innovation, launched in 2023, reveals how a lot Azure’s infrastructure has advanced together with the workloads it hosts. Russinovich’s first slide confirmed how shortly fashionable AI fashions had been rising, from 110 million parameters with GPT in 2018, to over a trillion in right this moment’s GPT-4o. That progress has led to the event of huge distributed supercomputers to coach these fashions, together with {hardware} and software program to make them environment friendly and dependable.

Constructing the AI supercomputer

The dimensions of the techniques wanted to run these AI platforms is gigantic. Microsoft’s first huge AI-training supercomputer was detailed in Could 2020. It had 10,000 Nvidia V100 GPUs and clocked in at quantity 5 within the world supercomputer rankings. Solely three years later, in November 2023, the most recent iteration had 14,400 H100 GPUs and ranked third.

In June 2024, Microsoft has greater than 30 related supercomputers in information facilities world wide. Russinovich talked concerning the open supply Llama-3-70B mannequin, which takes 6.4 million GPU hours to coach. On one GPU that might take 730 years, however with one in every of Microsoft’s AI supercomputers, a coaching run takes roughly 27 days.

Coaching is just half the issue. As soon as a mannequin has been constructed, it must be used, and though inference doesn’t want supercomputer-levels of compute for coaching, it nonetheless wants plenty of energy. As Russinovich notes, a single floating-point parameter wants two bytes of reminiscence, a one-billion-parameter mannequin wants 2GB of RAM, and a 175-billion-parameter mannequin requires 350GB. That’s earlier than you add in any vital overhead, corresponding to caches, which may add greater than 40% to already-hefty reminiscence necessities.

See also  Ardent begins Pittsburgh data centre upgrade

All because of this Azure wants plenty of GPUS with very particular traits to push via plenty of information as shortly as attainable. Fashions like GPT-4 require vital quantities of high-bandwidth reminiscence. Compute and reminiscence all want substantial quantities of energy. An Nvidia H100 GPU requires 700 watts, and with 1000’s in operation at any time, Azure information facilities have to dump plenty of warmth.

Past coaching, design for inference

Microsoft has developed its personal inference accelerator within the form of its Maia {hardware}, which is pioneering a brand new directed-liquid cooling system, sheathing the Maia accelerators in a closed-loop cooling system that has required an entire new rack design with a secondary cupboard that incorporates the cooling tools’s warmth exchangers.

Designing information facilities for coaching has proven Microsoft the right way to provision for inference. Coaching quickly ramps as much as 100% and holds there during a run. Utilizing the identical energy monitoring on an inferencing rack, it’s attainable to see how energy draw varies at completely different factors throughout an inferencing operation.

Azure’s Mission POLCA goals to make use of this info to extend efficiencies. It permits a number of inferencing operations to run on the similar time by provisioning for peak energy draw, giving round 20% overhead. That lets Microsoft put 30% extra servers in a knowledge middle by throttling each server frequency and energy. The result’s a extra environment friendly and extra sustainable strategy to the compute, energy, and thermal calls for of an AI information middle.

Managing the information for coaching fashions brings its personal set of issues; there’s plenty of information, and it must be distributed throughout the nodes of these Azure supercomputers. Microsoft has been engaged on what it calls Storage Accelerator to handle this information, distributing it throughout clusters with a cache that determines if required information is obtainable regionally or whether or not it must be fetched, utilizing accessible bandwidth to keep away from interfering with present operations. Utilizing parallel reads to load information permits giant quantities of coaching information to be loaded nearly twice as quick as conventional file hundreds.

AI wants high-bandwidth networks

Compute and storage are vital, however networking stays essential, particularly with huge data-parallel workloads working throughout many a whole bunch of GPUs. Right here, Microsoft has invested considerably in high-bandwidth InfiniBand connections, utilizing 1.2TBps of inner connectivity in its servers, linking 8 GPUs, and on the similar time 400Gbps between particular person GPUs in separate servers.

See also  Startup Installs New York’s First Quantum Computer

Microsoft has invested quite a bit in InfiniBand, each for its Open AI coaching supercomputers and for its customer support. Curiously Russinovich famous that “actually, the one distinction between the supercomputers we construct for OpenAI and what we make accessible publicly, is the dimensions of the InfiniBand area. Within the case of OpenAI, the InfiniBand area covers the complete supercomputer, which is tens of 1000’s of servers.” For different prospects who don’t have the identical coaching calls for, the domains are smaller, however nonetheless at supercomputer scale, “1,000 to 2,000 servers in dimension, connecting 10,000 to twenty,000 GPUs.”

All that networking infrastructure requires some surprisingly low-tech options, corresponding to 3D-printed sleds to effectively pull giant quantities of cables. They’re positioned within the cable cabinets above the server racks and pulled alongside. It’s a easy approach to minimize cabling occasions considerably, a necessity whenever you’re constructing 30 supercomputers each six months.

Making AI dependable: Mission Forge and One Pool

{Hardware} is just a part of the Azure supercomputer story. The software program stack supplies the underlying platform orchestration and assist instruments. That is the place Mission Forge is available in. You may consider it as an equal to one thing like Kubernetes, a means of scheduling operations throughout a distributed infrastructure whereas offering important useful resource administration and spreading hundreds throughout various kinds of AI compute.

The Mission Forge scheduler treats all of the accessible AI accelerators in Azure as a single pool of digital GPU capability, one thing Microsoft calls One Pool. Masses have precedence ranges that management entry to those digital GPUs. The next-priority load can evict a lower-priority one, transferring it to a special class of accelerator or to a different area altogether. The intention is to offer a constant degree of utilization throughout the complete Azure AI platform so Microsoft can higher plan and handle its energy and networking price range.

See also  Steampipe dashboards and benchmarks for your data

Like Kubernetes, Mission Forge is designed to assist run a extra resilient service, detecting failures, restarting jobs, and repairing the host platform. By automating these processes, Azure can keep away from having to restart costly and sophisticated jobs, treating them as a substitute as a set of batches that may run individually and orchestrate inputs and outputs as wanted.

Consistency and safety: prepared for AI functions

As soon as an AI mannequin has been constructed it must be used. Once more, Azure wants a means of balancing utilization throughout various kinds of fashions and completely different prompts inside these fashions. If there’s no orchestration (or lazy orchestration), it’s straightforward to get right into a place the place one immediate finally ends up blocking different operations. By profiting from its digital, fractional GPUs, Azure’s Mission Flywheel can assure efficiency, interleaving operations from a number of prompts throughout digital GPUs, permitting constant operations on the host bodily GPU whereas nonetheless offering a relentless throughput.

One other low-level optimization is confidential computing capabilities when coaching customized fashions. You may run code and host information in trusted execution environments. Azure is now capable of have full confidential VMs, together with GPUs, with encrypted messages between CPU and GPU trusted environments. You should use this for coaching or securing your personal information used for retrieval-augmented era.

From Russinovich’s presentation, it’s clear that Microsoft is investing closely in making its AI infrastructure environment friendly and responsive for coaching and inference. The Azure infrastructure and platform groups have put plenty of work into constructing out {hardware} and software program that may assist coaching the biggest fashions, whereas offering a safe and dependable place to make use of AI in your functions.

Operating Open AI on Azure has given these groups plenty of expertise, and it’s good to see that have paying off in offering the identical instruments and strategies for the remainder of us—even when we don’t want our personal TOP500 supercomputers.

Copyright © 2024 IDG Communications, .

Contents
Constructing the AI supercomputerPast coaching, design for inferenceAI wants high-bandwidth networksMaking AI dependable: Mission Forge and One PoolConsistency and safety: prepared for AI functions

Source link

TAGGED: Azure, Centers, cloud, data, todays
Share This Article
Twitter Email Copy Link Print
Previous Article Singtel, KKR spend $1.3B in data center segment with two deals Singtel, KKR spend $1.3B in data center segment with two deals
Next Article Codicent Raises First Funding Inventive Raises $6.5M in Seed Funding
Leave a comment

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Your Trusted Source for Accurate and Timely Updates!

Our commitment to accuracy, impartiality, and delivering breaking news as it happens has earned us the trust of a vast audience. Stay ahead with real-time updates on the latest events, trends.
FacebookLike
TwitterFollow
InstagramFollow
YoutubeSubscribe
LinkedInFollow
MediumFollow
- Advertisement -
Ad image

Popular Posts

Fortune 500 companies to shift $500 billion from energy opex to microgrids through 2027

“We're seeing warning alerts that electrical energy demand is outstripping provide as a consequence of…

November 6, 2024

2025 Data Centre Review Excellence Awards winners announced

Framed by sundown views of the Thames and St Paul’s, London’s landmark OXO Tower supplied…

May 18, 2025

Dataminr Receives $100M Investment from Fortress

Dataminr, a New York-based supplier of a real-time AI platform for detecting occasions, dangers and…

April 27, 2025

Using Inspektor Gadget for Kubernetes observability

Platform engineering is changing into a compelling idea for enterprises, as they’re devoting more and…

April 29, 2024

Alibaba unveils research on tools to cut outages and cloud costs

Alibaba says its new low-level software program has diminished community outages, lowered load balancing prices,…

September 2, 2025

You Might Also Like

shutterstock 2291065933 space satellite in orbit above the Earth white clouds and blue sea below
Global Market

Aetherflux joins the race to launch orbital data centers by 2027

By saad
Why data centre megadeals must prove their value
Global Market

Why data centre megadeals must prove their value

By saad
atNorth's Iceland data centre epitomises circular economy
Cloud Computing

atNorth’s Iceland data centre epitomises circular economy

By saad
photo illustration of clouds in the shape of dollar signs above a city
Global Market

Cloud providers continue to push EU court to undo Broadcom-VMware merger

By saad
Data Center News
Facebook Twitter Youtube Instagram Linkedin

About US

Data Center News: Stay informed on the pulse of data centers. Latest updates, tech trends, and industry insights—all in one place. Elevate your data infrastructure knowledge.

Top Categories
  • Global Market
  • Infrastructure
  • Innovations
  • Investments
Usefull Links
  • Home
  • Contact
  • Privacy Policy
  • Terms & Conditions

© 2024 – datacenternews.tech – All rights reserved

Welcome Back!

Sign in to your account

Lost your password?
We use cookies to ensure that we give you the best experience on our website. If you continue to use this site we will assume that you are happy with it.
You can revoke your consent any time using the Revoke consent button.