Skip to main content

Overview

Observability provides live and historical metrics for your GPU clusters and nodes so you can monitor performance and troubleshoot workloads. Use the dashboard to view GPU metrics (utilization, power, temperature, DRAM, SM clocks), CPU metrics, memory usage, and time-series charts that can be zoomed or maximized for deeper inspection.

Access Observability

  1. In the left navigation, open More → Observability.
  2. On the Observability landing page, click on the View Observability on the GPU Clusters card.
The Observability dashboard opens and shows a set of metric panels and controls for selecting clusters, nodes, timeframes and refresh options.

Explore the dashboard

Once Observability is open, use these controls to get the data you need:
ControlWhat it does
Select ClusterChoose a cluster id from the dropdown. The metrics on the page reflect the selected cluster.
Select NodeNarrow metrics to a specific node within the cluster (if available).
Time range buttons (6h, 12h, 24h, Custom)Adjust the time window shown by the charts. Use Custom for arbitrary ranges.
Auto refresh / refresh intervalChoose how frequently the dashboard refreshes the data.
Aggregation / data granularitySelect the aggregation interval (rollup / sampling) for the charts to control granularity vs. query volume.
Panel maximizeClick the expand / maximize control on any metric panel to focus on that single chart.

Provided metrics

The dashboard includes the following metric panels. Use the table below to quickly find what each panel measures, the typical units, and when to investigate.
Metric panelWhat it measuresTypical units
GPU UtilizationPercent of GPU processing resources active (compute + memory usage by the GPU).%
GPU PowerPower draw of the GPU.W (watts)
GPU TemperatureGPU die temperature.°C
GPU SM ClocksStreaming multiprocessor (SM) clock frequency.MHz
GPU DRAMGPU memory utilization and bandwidth activity.MiB
CPU Usage (Current)Host CPU usage (user + system).%
Memory Usage (Current)Host memory usage.%

Troubleshooting & tips

  • If panels show No Data Available:
    • Verify you selected the correct cluster and node.
    • Ensure the time window includes the period when the workload ran.
    • Check that cluster agents / exporters are running on the nodes (for on-prem or self-hosted configurations).
  • If metrics appear sparse or too noisy:
    • Increase the aggregation interval in the dashboard to reduce noise.
    • Use a longer time window to observe trends rather than instantaneous spikes.
Permissions: Observability data visibility depends on your account permissions. If you cannot see a cluster, confirm you have access to that project and cluster.