Storage is an ignored aspect of AI that has been overshadowed by all of the emphasis on processors, particularly GPUs. Massive language fashions (LLMs) measure within the terabytes of measurement and all that must be moved round to be processed. So the sooner you’ll be able to transfer knowledge, the higher, in order that the GPUs aren’t sitting round ready for knowledge to be fed to them.
Nvidia says it has examined out these Spectrum-4 options with its Israel-1 AI supercomputer. The testing course of measured the learn and write bandwidth generated by Nvidia HGX H100 GPU server shoppers accessing the storage, first with the community configured as a regular RoCE v2 material, after which with the adaptive routing and congestion management from Spectrum-X turned on, Nvidia acknowledged.
Assessments have been run utilizing a variety of GPU servers as shoppers, from 40 to 800 GPUs. In each case, the improved Spectrum-X networking carried out higher than the usual model, with the modified learn bandwidth bettering from 20% to 48% and write bandwidth bettering from 9% to 41% over customary RoCE networking, in line with Nvidia.
One other technique for bettering effectivity is checkpointing, the place the processing job state is saved periodically in order that if the coaching run fails for any purpose, it may be restarted from a saved checkpoint state relatively than beginning it over from the start.
Storage distributors DDN, VAST Knowledge, and WEKA are partnering with Nvidia to combine and optimize their options for Spectrum-X.