Skip to main content

DS01 Monitoring Operations Guide

Practical reference for day-to-day monitoring tasks on the DS01 observability stack. Stack: Prometheus · Grafana · Alertmanager · cAdvisor · DCGM Exporter · node-exporter · ds01-exporter


1. Accessing Grafana

Grafana is localhost-only. Access via SSH tunnel:

ssh -L 3000:localhost:3000 datasciencelab@ds01

Then open: http://localhost:3000

Login: admin / ds01admin (override with GRAFANA_ADMIN_USER / GRAFANA_ADMIN_PASSWORD env vars in monitoring/.env)

Prometheus UI: http://localhost:9090 (same SSH tunnel session, or add -L 9090:localhost:9090) Alertmanager UI: http://localhost:9093


2. Dashboard Guide

Four dashboards are provisioned automatically from monitoring/grafana/provisioning/dashboards/dashboards/.

DS01 Overview (ds01-overview)

Real-time operational view. Use this for day-to-day monitoring.

RowPanelsPurpose
Status at a GlanceGPU slots used/free, active users, containers, system GPU util %, highest temp, SSH sessionsInstant health check
GPU Device MetricsUtilisation, memory %, power (W), temperature per devicePer-GPU deep dive
Container & EnforcementAllocations by user, unmanaged containers, --gpus all usage, enforcement events 24h, SSH sessions by userAccess and policy compliance
System HealthCPU load average, memory available %, disk usage, scrape target status tableInfrastructure health

Key indicators:

  • Unmanaged GPU Containers > 0 — containers bypassing DS01 wrapper (red alert)
  • Unrestricted GPU Access > 0--gpus all usage (dark-red alert)
  • Scrape Targets table — any DOWN rows need investigation

DS01 User Detail (ds01-user)

Admin tool for inspecting a specific user's GPU usage. Select user from the dropdown at the top.

Shows: current GPU allocations, running containers, GPU utilisation, SSH sessions, GPU-hours (7d/30d), efficiency score, per-container allocation table.

DS01 Historical (ds01-historical)

Long-term trend analysis. Default range: last 7 days.

RowWhat to look for
GPU Utilisation TrendsHourly avg/max trends, utilisation heatmap (orange = high demand)
GPU Cost AttributionGPU-hours by user (stacked bar), total GPU-hours, efficiency bargauge, GPU-hours over time
Capacity & DemandGPU slot allocation over time, active users over time, GPU waste % trend
Temperature & HealthDaily max GPU temp, enforcement activity over time (idle/runtime kills)

NVIDIA DCGM GPU Metrics (nvidia-dcgm)

Raw NVIDIA DCGM metrics: temperature, power, utilisation, memory, SM clocks, tensor core utilisation. Use for GPU hardware debugging. Keep as-is — managed upstream by NVIDIA.


3. Alert Management

Viewing Active Alerts

  • Grafana: Alerting tab in left sidebar
  • Alertmanager UI: http://localhost:9093
  • Prometheus alerts: http://localhost:9090/alerts

Alert Delivery

Alerts route to Microsoft Teams via Power Automate webhook (configured in monitoring/alertmanager/alertmanager.yml).

Two receivers:

  • ds01-teams — warning/info alerts (group_wait: 5m, repeat: 4h)
  • ds01-teams-critical — critical alerts (group_wait: 30s, repeat: 1h)

Silencing Alerts

Via Alertmanager UI at http://localhost:9093SilencesNew Silence.

Or via CLI (install amtool if not present):

# Silence DS01UserGPUWaste for 4 hours (user on planned downtime)
amtool silence add --alertmanager.url=http://localhost:9093 \
alertname=DS01UserGPUWaste user=alice \
--comment="Alice on annual leave" --duration=4h

# List active silences
amtool silence query --alertmanager.url=http://localhost:9093

# Expire a silence by ID
amtool silence expire --alertmanager.url=http://localhost:9093 <silence-id>

Testing Alert Delivery

Send a test alert to verify Teams webhook is working:

curl -s -X POST http://localhost:9093/api/v2/alerts \
-H 'Content-Type: application/json' \
-d '[{
"labels": {"alertname": "DS01TestAlert", "severity": "warning", "job": "test"},
"annotations": {"summary": "Test alert", "description": "Manual test from admin"}
}]'

Check Alertmanager UI for delivery status. Alert rules live at: monitoring/prometheus/rules/ds01_alerts.yml


4. Stack Management

Working directory for all commands: /opt/ds01-infra/monitoring

Restart All Services

cd /opt/ds01-infra/monitoring && docker compose restart

Restart Single Service

docker compose restart grafana
docker compose restart prometheus
docker compose restart alertmanager
docker compose restart cadvisor
docker compose restart node-exporter
docker compose restart dcgm-exporter

View Logs

docker logs ds01-grafana --tail 50
docker logs ds01-prometheus --tail 50
docker logs ds01-alertmanager --tail 50
docker logs ds01-cadvisor --tail 50
docker logs ds01-node-exporter --tail 50

Check Scrape Targets

http://localhost:9090/targets

All 7 targets should be UP: dcgm-exporter, ds01-exporter, node-exporter, prometheus, cadvisor, grafana, alertmanager.

ds01-exporter runs as a systemd service (not docker):

sudo systemctl status ds01-exporter
sudo journalctl -u ds01-exporter -n 50

Reload Configuration (without restart)

# Hot-reload Prometheus + Alertmanager config (self-heals file permissions first)
monitoring-manage reload

Use after editing prometheus.yml, ds01_alerts.yml, ds01_recording.yml, or alertmanager.yml.

Always reload via monitoring-manage reload, not raw curl. The deploy account's umask (0077) makes git pull write updated config files mode 600, which the Prometheus/Grafana container users can't read — a raw curl -X POST .../-/reload then fails with HTTP 500 and silently keeps the old rules. monitoring-manage reload re-asserts world-read on the config trees first (sudo deploy also does, via permissions-manifest.sh).

Start/Stop Full Stack

cd /opt/ds01-infra/monitoring
docker compose up -d # Start all
docker compose down # Stop all (data preserved in named volumes)

5. Troubleshooting

Container crash-looping

Check permissions — config files must be readable (644):

ls -la /opt/ds01-infra/monitoring/prometheus/
ls -la /opt/ds01-infra/monitoring/alertmanager/
ls -la /opt/ds01-infra/monitoring/grafana/provisioning/
# Fix: chmod -R 644 <file>; chmod 755 <dir>

View crash logs:

docker logs ds01-grafana --tail 100

Panel shows "No data"

  1. Check the metric exists in Prometheus: http://localhost:9090/graph
  2. Enter the metric name (e.g., ds01_gpu_allocated) and execute
  3. If empty — check the relevant scrape target is UP (/targets)
  4. If DCGM panels are empty — verify ds01-dcgm-exporter container is running:
    docker ps | grep dcgm
    docker logs --tail 50 ds01-dcgm-exporter

Alerts not firing

  1. Check Prometheus alert state: http://localhost:9090/alerts
  2. Verify Alertmanager is healthy: http://localhost:9093/-/healthy
  3. Confirm Teams webhook URL is configured (not placeholder):
    grep PLACEHOLDER /opt/ds01-infra/monitoring/alertmanager/alertmanager.yml
    # Should return nothing if webhook is configured
  4. Check inhibition rules — a critical alert may be suppressing related warnings

Alerts firing but no Teams message

  1. Send a test alert (see section 3)
  2. Check Alertmanager logs: docker logs ds01-alertmanager --tail 50
  3. Verify Power Automate webhook URL is still valid (they expire)

High disk usage from Prometheus

Check volume sizes:

du -sh /var/lib/docker/volumes/ds01-prometheus-data/
df -h /

Prometheus retention: 90 days / 15GB cap (whichever comes first). If disk is critical, restart Prometheus with a lower retention:

# Edit docker-compose.yaml to reduce --storage.tsdb.retention.time=30d
cd /opt/ds01-infra/monitoring && docker compose up -d prometheus

ds01-exporter not scraped

sudo systemctl status ds01-exporter
curl -s http://127.0.0.1:9101/metrics | head -20
sudo journalctl -u ds01-exporter -n 50

6. Configuration File Reference

All monitoring config lives in /opt/ds01-infra/monitoring/.

FilePurpose
docker-compose.yamlService definitions, image versions, resource limits, volume mounts
prometheus/prometheus.ymlScrape targets, intervals, alertmanager endpoint
prometheus/rules/ds01_alerts.ymlAlert rules (24 rules across 5 groups)
prometheus/rules/ds01_recording.ymlRecording rules — pre-computed aggregates (10 groups, ~45 rules)
alertmanager/alertmanager.ymlAlert routing, inhibition rules, Teams webhook receivers
grafana/provisioning/datasources/Prometheus datasource auto-provisioning
grafana/provisioning/dashboards/Dashboard auto-provisioning config
grafana/provisioning/dashboards/dashboards/Dashboard JSON files (ds01_overview, ds01_historical, ds01_user, nvidia_dcgm)

After editing any config file, reload the relevant service (see section 4). For docker-compose changes, run docker compose up -d.


7. Key Metrics Reference

MetricSourceDescription
ds01_gpu_allocatedds01-exporterPer-slot GPU allocation (labels: gpu_slot, user, container)
ds01_ssh_sessions_activeds01-exporterSSH sessions per user (from who)
ds01_lifecycle_events_totalds01-exporterEnforcement events last 24h (label: action)
ds01:system_gpu_utilization_avgrecording ruleSystem-wide GPU utilisation avg (0–100 scale)
ds01:user_gpu_secondsrecording ruleGauge proxy for GPU-hours attribution (GPU-seconds; integrate via sum_over_time)
DCGM_FI_DEV_GPU_TEMPdcgm-exporterPer-GPU temperature (°C)
DCGM_FI_PROF_GR_ENGINE_ACTIVEdcgm-exporterGPU compute engine active ratio (0–1 scale)
container_cpu_usage_seconds_totalcAdvisorPer-container CPU usage
container_memory_usage_bytescAdvisorPer-container memory usage
node_load15node-exporterSystem 15-minute load average