As a result of the GPUs are working in parallel, the implications are totally different than in a classical community.
“If we have been on video and an error burst happens, TCP/IP does a fairly good job of bridging that and retransmitting,” Gartner stated. “However in AI infrastructure, as a result of the GPUs are working in parallel, it’s very delicate to points that may happen on one hyperlink. All these GPUs are exchanging data and synchronized, and so mainly, it’s a must to type of cease the workload and again as much as a checkpoint and restart the workload. And that may end up in a 40% discount within the efficiency of the cluster when you have got these hyperlink errors occurring.”
“It actually suggests to our prospects that they should be targeted rather more on reliability of the optic,” Gartner stated.
Reliability testing reveals weaknesses
Cisco previously performed a reliability take a look at for which it acquired 20 totally different optics from totally different suppliers, Gartner recalled. “These have been 100G and 400G optics on the time,” and all have been compliant with business requirements, and but “none of these optics handed our stress take a look at,” he stated.
Cisco’s testing environments make modifications to totally different situations, such because the temperature or humidity degree, or the voltage degree that the optic is seeing on the host, or the skew between the alerts coming from the host. “We do all of these issues in numerous combos,” Gartner stated.
Whereas optics would possibly technically adjust to business requirements, “what we all know is that in the event that they have been put right into a aggravating setting … they wouldn’t carry out,” he stated, “and in order that’s the factor that we’re making an attempt to lift consciousness of for our prospects.”
