Datadog has launched GPU Monitoring, now obtainable to clients globally. The product is designed to deal with challenges organisations face in managing rising AI-related prices.
“GPU situations account for 14 p.c of compute prices—which is a large challenge as firms are struggling to construct AI-first know-how in scalable and sensible methods. Whereas these firms can see their prices climbing, they will’t chargeback GPU spend throughout enterprise items, see workload context or determine clear subsequent steps for enchancment. In consequence, it is rather difficult to funds and plan in considerate methods,” mentioned Yanbing Li, Chief Product Officer at Datadog.
The launch comes as firms search more practical methods to handle GPU spending linked to AI workloads. Many organisations face difficulties allocating GPU prices throughout enterprise items, and restricted workload context could make budgeting and planning extra complicated.
GPU Monitoring goals to offer a unified view throughout AI infrastructure, linking GPU fleet well being, value, and efficiency to the groups utilizing these sources. This helps quicker troubleshooting of slower workloads and goals to enhance value visibility.
As AI deployments scale, managing compute sources more and more entails broader organisational planning, significantly the place capability is misallocated or the place coaching and inference workloads are affected by value or efficiency constraints. Many organisations presently work with fragmented visibility into GPU utilization. GPU Monitoring is meant to consolidate this view.
Present GPU monitoring instruments usually present primary {hardware} well being metrics however could not present cross-team useful resource competition, causes for failed workloads, or determine underused units. This may sluggish investigations and result in overprovisioning as a precaution, contributing to increased useful resource utilization.
By connecting GPU fleet telemetry with workload information, GPU Monitoring supplies a shared view for platform engineering and machine studying groups.
- Scale AI with out overspending: Utilization insights assist information capability planning, assist selections on new GPU purchases versus reallocation, and enhance value predictability.
- Speed up AI supply: Linking efficiency points to particular GPUs and processes helps determine bottlenecks extra rapidly.
- Keep away from pricey disruptions: Early detection of unhealthy GPUs may help cut back the chance of broader system failures.
- Maximise ROI on GPU spend: Visibility into utilisation permits groups to determine underused or overprovisioned sources and regulate allocation accordingly.
Total, GPU Monitoring is positioned as a software to enhance visibility and useful resource administration for AI workloads throughout organisations.
