Overview
Observability provides live and historical metrics for your GPU clusters and nodes so you can monitor performance and troubleshoot workloads. Use the dashboard to view GPU metrics (utilization, power, temperature, DRAM, SM clocks), CPU metrics, memory usage, and time-series charts that can be zoomed or maximized for deeper inspection.Access Observability
- In the left navigation, open
More → Observability. - On the Observability landing page, click on the
View Observabilityon theGPU Clusterscard.
Explore the dashboard
Once Observability is open, use these controls to get the data you need:| Control | What it does |
|---|---|
Select Cluster | Choose a cluster id from the dropdown. The metrics on the page reflect the selected cluster. |
Select Node | Narrow metrics to a specific node within the cluster (if available). |
Time range buttons (6h, 12h, 24h, Custom) | Adjust the time window shown by the charts. Use Custom for arbitrary ranges. |
Auto refresh / refresh interval | Choose how frequently the dashboard refreshes the data. |
Aggregation / data granularity | Select the aggregation interval (rollup / sampling) for the charts to control granularity vs. query volume. |
| Panel maximize | Click the expand / maximize control on any metric panel to focus on that single chart. |
Provided metrics
The dashboard includes the following metric panels. Use the table below to quickly find what each panel measures, the typical units, and when to investigate.| Metric panel | What it measures | Typical units |
|---|---|---|
GPU Utilization | Percent of GPU processing resources active (compute + memory usage by the GPU). | % |
GPU Power | Power draw of the GPU. | W (watts) |
GPU Temperature | GPU die temperature. | °C |
GPU SM Clocks | Streaming multiprocessor (SM) clock frequency. | MHz |
GPU DRAM | GPU memory utilization and bandwidth activity. | MiB |
CPU Usage (Current) | Host CPU usage (user + system). | % |
Memory Usage (Current) | Host memory usage. | % |
Troubleshooting & tips
- If panels show
No Data Available:- Verify you selected the correct cluster and node.
- Ensure the time window includes the period when the workload ran.
- Check that cluster agents / exporters are running on the nodes (for on-prem or self-hosted configurations).
- If metrics appear sparse or too noisy:
- Increase the aggregation interval in the dashboard to reduce noise.
- Use a longer time window to observe trends rather than instantaneous spikes.
Permissions: Observability data visibility depends on your account permissions. If you cannot see a cluster, confirm you have access to that project and cluster.