GPU Observability

Overview

Observability provides live and historical metrics for your GPU clusters and nodes so you can monitor performance and troubleshoot workloads. Use the dashboard to view GPU metrics (utilization, power, temperature, DRAM, SM clocks), CPU metrics, memory usage, and time-series charts that can be zoomed or maximized for deeper inspection.

Access Observability

In the left navigation, open More → Observability.
On the Observability landing page, click on the View Observability on the GPU Clusters card.

The Observability dashboard opens and shows a set of metric panels and controls for selecting clusters, nodes, timeframes and refresh options.

Explore the dashboard

Once Observability is open, use these controls to get the data you need:

Control	What it does
`Select Cluster`	Choose a cluster id from the dropdown. The metrics on the page reflect the selected cluster.
`Select Node`	Narrow metrics to a specific node within the cluster (if available).
Time range buttons (`6h`, `12h`, `24h`, `Custom`)	Adjust the time window shown by the charts. Use `Custom` for arbitrary ranges.
`Auto refresh` / refresh interval	Choose how frequently the dashboard refreshes the data.
`Aggregation` / data granularity	Select the aggregation interval (rollup / sampling) for the charts to control granularity vs. query volume.
Panel maximize	Click the expand / maximize control on any metric panel to focus on that single chart.

Provided metrics

The dashboard includes the following metric panels. Use the table below to quickly find what each panel measures, the typical units, and when to investigate.

Metric panel	What it measures	Typical units
`GPU Utilization`	Percent of GPU processing resources active (compute + memory usage by the GPU).	`%`
`GPU Power`	Power draw of the GPU.	`W` (watts)
`GPU Temperature`	GPU die temperature.	`°C`
`GPU SM Clocks`	Streaming multiprocessor (SM) clock frequency.	`MHz`
`GPU DRAM`	GPU memory utilization and bandwidth activity.	`MiB`
`CPU Usage (Current)`	Host CPU usage (user + system).	`%`
`Memory Usage (Current)`	Host memory usage.	`%`

Troubleshooting & tips

If panels show No Data Available:
- Verify you selected the correct cluster and node.
- Ensure the time window includes the period when the workload ran.
- Check that cluster agents / exporters are running on the nodes (for on-prem or self-hosted configurations).
If metrics appear sparse or too noisy:
- Increase the aggregation interval in the dashboard to reduce noise.
- Use a longer time window to observe trends rather than instantaneous spikes.

Permissions: Observability data visibility depends on your account permissions. If you cannot see a cluster, confirm you have access to that project and cluster.

​Overview

​Access Observability

​Explore the dashboard

​Provided metrics

​Troubleshooting & tips

Overview

Access Observability

Explore the dashboard

Provided metrics

Troubleshooting & tips