Skip to main content
This guide covers GPU debugging, burn tests, NCC tests, and using the observability dashboard to monitor and troubleshoot your cluster.

Table of Contents

Observability Dashboard

Real-time visualization of cluster health metrics, allowing you to track resource utilization across specific nodes and timeframes to ensure optimal performance. Access the Observability dashboard from the left sidebar under MORE > Observability.

Selecting Nodes to Monitor

  1. Select Cluster: Choose your cluster from the dropdown (e.g., k8s-7xdu304ghh-v1)
  2. Select Nodes: Choose one or more nodes to monitor (e.g., k8s-7xdu304ghh-v1-node-1)
  3. Set Time Range: Use the time range buttons to select the monitoring window:
    • Quick options: 6h, 12h, 24h
    • Or use Custom for a specific range

GPU Metrics (Time Series)

MetricDescription
GPU UtilisationPercentage of time the GPU was actively processing CUDA kernels (0-100%)
GPU PowerPower consumption in watts
GPU TemperatureTemperature in °C
GPU SM ClocksStreaming Multiprocessor clock frequency in MHz
GPU DRAMGPU memory usage in MiB

CPU Metrics (Gauges)

MetricDescription
CPU Usage (Current)Real-time CPU utilization percentage
Memory Usage (Current)Real-time system RAM consumption

GPU Debugging

Systematic workflows to identify performance bottlenecks, diagnose thermal throttling, and resolve common training failures by correlating resource metrics.

Steps to Debug on the Observability Screen

  1. Navigate to Observability
    • Click Observability in the left sidebar under MORE
  2. Select Your Target
    • Choose the cluster from Select Cluster dropdown
    • Select the specific node(s) from Select Nodes dropdown
  3. Set Appropriate Time Range
    • For recent issues: Use 6h, 12h, or 24h
    • For historical analysis: Use Custom
  4. Analyze GPU Utilisation
    • Look for patterns correlating with your task progress
    • Sustained ~100% indicates the GPU is consistently executing CUDA kernels
    • Periodic dips suggest idle time between kernel launches (data loading, synchronization)
  5. Check GPU Temperature and SM Clocks
    • If temperature approaches 80-85°C and SM Clocks drop, thermal throttling is occurring
    • Normal operation: SM Clocks stable around 2100 MHz or as per GPU specs
  6. Monitor GPU DRAM
    • Gradual increase without release may indicate memory leak
    • Sudden spikes near capacity risk OOM errors
  7. Correlate with CPU Metrics
    • High CPU Usage + Low GPU Utilisation = Data pipeline bottleneck (GPU waiting for data)
    • Low CPU Usage + Low GPU Utilisation = Task may be idle or blocked

Common Workload Patterns

What You SeeGPU UtilisationOther MetricsLikely Cause
Training running smoothlyHigh (80-100%), steadyStable SM Clocks, Temperature < 80°CWorkload is GPU-bound as expected
Training slower than expectedPeriodic dips to 0%High CPU UsageData loading can’t keep up with GPU
Job crashes with OOMDrops before crashGPU DRAM near maxBatch size too large for GPU memory
Training stalls or hangsDrops to 0%, stays lowLow CPU, low GPU PowerDeadlock, NCCL timeout, or I/O wait
Inconsistent training speedFluctuatingVariable SM ClocksContention from other workloads on node
Gradual slowdown over timeDecreasing trendRising GPU DRAMMemory leak in training code
Multi-GPU scaling issuesLow on some GPUsUneven GPU DRAM usageImbalanced data distribution or model parallelism
Job fails after hoursWas normal, then dropsTemperature was elevatedPotential hardware instability

Burn Test

Stress-testing protocols designed to push GPUs to their thermal and power limits to ensure hardware stability before production deployment.

Overview

Burn tests stress-test GPUs to verify stability under sustained high load. This helps identify hardware issues before deploying production workloads.

Steps to Perform Burn Test

  1. SSH into the Node
    ssh <node-address>
    
  2. Install gpu-burn
    git clone https://github.com/wilicc/gpu-burn.git
    cd gpu-burn
    make
    
  3. Run the Burn Test
    # Run for 60 seconds (adjust duration as needed)
    ./gpu_burn 60
    
  4. Monitor on Observability Dashboard Open the Observability dashboard and watch for:
    MetricExpected During Burn Test
    GPU Utilisation~100% sustained (continuous CUDA kernel execution)
    GPU PowerNear TDP (max power)
    GPU TemperatureRising, should stay < 85°C
    GPU SM ClocksStable at boost frequency (~2100 MHz)
    GPU DRAMElevated but stable
  5. Interpret Results
    • PASS: All metrics stable, no SM Clock drops, temperature under control
    • FAIL: SM Clocks dropping (throttling), temperature exceeding limits, or computation errors in terminal

What to Capture

Observability dashboard GPU metrics

NCC Test

Benchmarks for validating inter-GPU communication bandwidth and latency (NVLink/PCIe), which are essential for distributed training performance.

Overview

NCC (NCCL - NVIDIA Collective Communications Library) tests verify GPU-to-GPU communication performance, essential for distributed training workloads.

Steps to Perform NCC Test

  1. SSH into the Node
    ssh <node-address>
    
  2. Install nccl-tests
    git clone https://github.com/NVIDIA/nccl-tests.git
    cd nccl-tests
    make
    
  3. Run All-Reduce Test
    # Test with all available GPUs
    ./build/all_reduce_perf -b 8 -e 256M -f 2 -g <num_gpus>
    
  4. Monitor on Observability Dashboard
    MetricExpected During NCC Test
    GPU UtilisationPeriodic bursts during collective operations
    GPU DRAMActive memory transfers visible
    CPU UsageLow (GPU-to-GPU communication)
  5. Analyze Terminal Output The test outputs bandwidth numbers. Compare against expected values:
    InterconnectExpected Bandwidth
    NVLink 3.0~600 GB/s
    NVLink 4.0~900 GB/s
    PCIe 4.0~25 GB/s