Thursday, 29 Jan 2026
Subscribe
logo
  • Global
  • AI
  • Cloud Computing
  • Edge Computing
  • Security
  • Investment
  • Sustainability
  • More
    • Colocation
    • Quantum Computing
    • Regulation & Policy
    • Infrastructure
    • Power & Cooling
    • Design
    • Innovations
    • Blog
Font ResizerAa
Data Center NewsData Center News
Search
  • Global
  • AI
  • Cloud Computing
  • Edge Computing
  • Security
  • Investment
  • Sustainability
  • More
    • Colocation
    • Quantum Computing
    • Regulation & Policy
    • Infrastructure
    • Power & Cooling
    • Design
    • Innovations
    • Blog
Have an existing account? Sign In
Follow US
© 2022 Foxiz News Network. Ruby Design Company. All Rights Reserved.
Data Center News > Blog > AI > Agentic AI scaling requires new memory architecture
AI

Agentic AI scaling requires new memory architecture

Last updated: January 7, 2026 7:08 pm
Published January 7, 2026
Share
Agentic AI scaling requires new memory architecture
SHARE

Agentic AI represents a definite evolution from stateless chatbots towards complicated workflows, and scaling it requires new reminiscence structure.

As basis fashions scale towards trillions of parameters and context home windows attain tens of millions of tokens, the computational value of remembering historical past is rising sooner than the flexibility to course of it.

Organisations deploying these programs now face a bottleneck the place the sheer quantity of “long-term reminiscence” (technically often called Key-Worth (KV) cache) overwhelms present {hardware} architectures.

Present infrastructure forces a binary alternative: retailer inference context in scarce, high-bandwidth GPU reminiscence (HBM) or relegate it to gradual, general-purpose storage. The previous is prohibitively costly for big contexts; the latter creates latency that renders real-time agentic interactions unviable.

To handle this widening disparity that’s holding again the scaling of agentic AI, NVIDIA has launched the Inference Context Reminiscence Storage (ICMS) platform inside its Rubin structure, proposing a brand new storage tier designed particularly to deal with the ephemeral and high-velocity nature of AI reminiscence.

“AI is revolutionising your complete computing stack—and now, storage,” Huang mentioned. “AI is not about one-shot chatbots however clever collaborators that perceive the bodily world, purpose over lengthy horizons, keep grounded in details, use instruments to do actual work, and retain each short- and long-term reminiscence.”

The operational problem lies within the particular behaviour of transformer-based fashions. To keep away from recomputing a complete dialog historical past for each new phrase generated, fashions retailer earlier states within the KV cache. In agentic workflows, this cache acts as persistent reminiscence throughout instruments and periods, rising linearly with sequence size.

This creates a definite information class. In contrast to monetary data or buyer logs, KV cache is derived information; it’s important for instant efficiency however doesn’t require the heavy sturdiness ensures of enterprise file programs. Basic-purpose storage stacks, operating on commonplace CPUs, expend power on metadata administration and replication that agentic workloads don’t require.

See also  Open-source AI that hones its reasoning skills

The present hierarchy, spanning from GPU HBM (G1) to shared storage (G4), is turning into inefficient:

(Credit score: NVIDIA)

As context spills from the GPU (G1) to system RAM (G2) and ultimately to shared storage (G4), effectivity plummets. Transferring lively context to the G4 tier introduces millisecond-level latency and will increase the ability value per token, leaving costly GPUs idle whereas they await information.

For the enterprise, this manifests as a bloated Complete Value of Possession (TCO), the place energy is wasted on infrastructure overhead relatively than lively reasoning.

A brand new reminiscence tier for the AI manufacturing unit

The business response entails inserting a purpose-built layer into this hierarchy. The ICMS platform establishes a “G3.5” tier—an Ethernet-attached flash layer designed explicitly for gigascale inference.

This method integrates storage immediately into the compute pod. By utilising the NVIDIA BlueField-4 information processor, the platform offloads the administration of this context information from the host CPU. The system gives petabytes of shared capability per pod, boosting the scaling of agentic AI by permitting brokers to retain huge quantities of historical past with out occupying costly HBM.

The operational profit is quantifiable in throughput and power. By preserving related context on this intermediate tier – which is quicker than commonplace storage, however cheaper than HBM – the system can “prestage” reminiscence again to the GPU earlier than it’s wanted. This reduces the idle time of the GPU decoder, enabling as much as 5x larger tokens-per-second (TPS) for long-context workloads.

From an power perspective, the implications are equally measurable. As a result of the structure removes the overhead of general-purpose storage protocols, it delivers 5x higher energy effectivity than conventional strategies.

See also  OpenAI now lets enterprises choose where to host their data

Integrating the info aircraft

Implementing this structure requires a change in how IT groups view storage networking. The ICMS platform depends on NVIDIA Spectrum-X Ethernet to offer the high-bandwidth, low-jitter connectivity required to deal with flash storage nearly as if it had been native reminiscence.

For enterprise infrastructure groups, the combination level is the orchestration layer. Frameworks equivalent to NVIDIA Dynamo and the Inference Switch Library (NIXL) handle the motion of KV blocks between tiers.

These instruments coordinate with the storage layer to make sure that the right context is loaded into the GPU reminiscence (G1) or host reminiscence (G2) precisely when the AI mannequin requires it. The NVIDIA DOCA framework additional helps this by offering a KV communication layer that treats context cache as a first-class useful resource.

Main storage distributors are already aligning with this structure. Firms together with AIC, Cloudian, DDN, Dell Applied sciences, HPE, Hitachi Vantara, IBM, Nutanix, Pure Storage, Supermicro, VAST Knowledge, and WEKA are constructing platforms with BlueField-4. These options are anticipated to be accessible within the second half of this 12 months.

Redefining infrastructure for scaling agentic AI

Adopting a devoted context reminiscence tier impacts capability planning and datacentre design.

  • Reclassifying information: CIOs should recognise KV cache as a novel information kind. It’s “ephemeral however latency-sensitive,” distinct from “sturdy and chilly” compliance information. The G3.5 tier handles the previous, permitting sturdy G4 storage to give attention to long-term logs and artifacts.
  • Orchestration maturity: Success depends upon software program that may intelligently place workloads. The system makes use of topology-aware orchestration (through NVIDIA Grove) to position jobs close to their cached context, minimising information motion throughout the material.
  • Energy density: By becoming extra usable capability into the identical rack footprint, organisations can prolong the lifetime of present amenities. Nonetheless, this will increase the density of compute per sq. metre, requiring sufficient cooling and energy distribution planning.
See also  Medical training's AI leap: How agentic RAG, open-weight LLMs and real-time case insights are shaping a new generation of doctors at NYU Langone

The transition to agentic AI forces a bodily reconfiguration of the datacentre. The prevailing mannequin of separating compute fully from gradual, persistent storage is incompatible with the real-time retrieval wants of brokers with photographic reminiscences.

By introducing a specialised context tier, enterprises can decouple the expansion of mannequin reminiscence from the price of GPU HBM. This structure for agentic AI permits a number of brokers to share an enormous low-power reminiscence pool to cut back the price of serving complicated queries and boosts scaling by enabling high-throughput reasoning.

As organisations plan their subsequent cycle of infrastructure funding, evaluating the effectivity of the reminiscence hierarchy can be as very important as choosing the GPU itself.

See additionally: 2025’s AI chip wars: What enterprise leaders discovered about provide chain actuality

Banner for AI & Big Data Expo by TechEx events.

Need to study extra about AI and large information from business leaders? Try AI & Big Data Expo happening in Amsterdam, California, and London. The excellent occasion is a part of TechEx and is co-located with different main expertise occasions. Click on here for extra info.

AI Information is powered by TechForge Media. Discover different upcoming enterprise expertise occasions and webinars here.

Source link

TAGGED: agentic, architecture, memory, Requires, Scaling
Share This Article
Twitter Email Copy Link Print
Previous Article SHASAI project to protect AI systems against cybersecurity threats SHASAI project to protect AI systems against cybersecurity threats
Next Article Scale Computing brings HyperCore virtualization to Lenovo ThinkEdge SE100 Scale Computing brings HyperCore virtualization to Lenovo ThinkEdge SE100
Leave a comment

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Your Trusted Source for Accurate and Timely Updates!

Our commitment to accuracy, impartiality, and delivering breaking news as it happens has earned us the trust of a vast audience. Stay ahead with real-time updates on the latest events, trends.
FacebookLike
TwitterFollow
InstagramFollow
YoutubeSubscribe
LinkedInFollow
MediumFollow
- Advertisement -
Ad image

Popular Posts

METASCALE improves LLM reasoning with adaptive strategies

Be part of our every day and weekly newsletters for the most recent updates and…

March 31, 2025

Datacloud Global Congress returns to Cannes

techoraco, the trusted supplier of worldwide digital infrastructure occasions, has as soon as once more…

May 17, 2025

Korean Governor signs $35B AI deal, Alphabet chairman joins board

In an unprecedented transfer, Governor Kim Yung-Rok of South Korea's Jeollanam-do Province travelled to the…

February 27, 2025

Google's upgraded Nano Banana Pro AI image model hailed as 'absolutely bonkers' for enterprises and users

Infographics rendered with no single spelling error. Advanced diagrams one-shotted from paragraph prompts. Logos restored…

November 21, 2025

Microsoft Azure Outage: Web Services Down

Microsoft is investigating a DNS subject affecting its Azure Entrance Door content material supply community,…

October 29, 2025

You Might Also Like

White House predicts AI growth will boost GDP
AI

White House predicts AI growth will boost GDP

By saad
Franny Hsiao, Salesforce: Scaling enterprise AI
AI

Franny Hsiao, Salesforce: Scaling enterprise AI

By saad
Deloittes guide to agentic AI stresses governance
AI

Deloittes guide to agentic AI stresses governance

By saad
Masumi Network: How AI-blockchain fusion adds trust to burgeoning agent economy
AI

Masumi Network: How AI-blockchain fusion adds trust to burgeoning agent economy

By saad
Data Center News
Facebook Twitter Youtube Instagram Linkedin

About US

Data Center News: Stay informed on the pulse of data centers. Latest updates, tech trends, and industry insights—all in one place. Elevate your data infrastructure knowledge.

Top Categories
  • Global Market
  • Infrastructure
  • Innovations
  • Investments
Usefull Links
  • Home
  • Contact
  • Privacy Policy
  • Terms & Conditions

© 2024 – datacenternews.tech – All rights reserved

Welcome Back!

Sign in to your account

Lost your password?
We use cookies to ensure that we give you the best experience on our website. If you continue to use this site we will assume that you are happy with it.
You can revoke your consent any time using the Revoke consent button.