DS01 Monitoring Operations Guide

Practical reference for day-to-day monitoring tasks on the DS01 observability stack. Stack: Prometheus · Grafana · Alertmanager · cAdvisor · DCGM Exporter · node-exporter · ds01-exporter

1. Accessing Grafana

Grafana is localhost-only. Access via SSH tunnel:

ssh -L 3000:localhost:3000 datasciencelab@ds01

Then open: http://localhost:3000

Login: admin / ds01admin (override with GRAFANA_ADMIN_USER / GRAFANA_ADMIN_PASSWORD env vars in monitoring/.env)

Prometheus UI: http://localhost:9090 (same SSH tunnel session, or add -L 9090:localhost:9090) Alertmanager UI: http://localhost:9093

2. Dashboard Guide

Four dashboards are provisioned automatically from monitoring/grafana/provisioning/dashboards/dashboards/.

DS01 Overview (`ds01-overview`)

Real-time operational view. Use this for day-to-day monitoring.

Row	Panels	Purpose
Status at a Glance	GPU slots used/free, active users, containers, system GPU util %, highest temp, SSH sessions	Instant health check
GPU Device Metrics	Utilisation, memory %, power (W), temperature per device	Per-GPU deep dive
Container & Enforcement	Allocations by user, unmanaged containers, `--gpus all` usage, enforcement events 24h, SSH sessions by user	Access and policy compliance
System Health	CPU load average, memory available %, disk usage, scrape target status table	Infrastructure health

Key indicators:

Unmanaged GPU Containers > 0 — containers bypassing DS01 wrapper (red alert)
Unrestricted GPU Access > 0 — --gpus all usage (dark-red alert)
Scrape Targets table — any DOWN rows need investigation

DS01 User Detail (`ds01-user`)

Admin tool for inspecting a specific user's GPU usage. Select user from the dropdown at the top.

Shows: current GPU allocations, running containers, GPU utilisation, SSH sessions, GPU-hours (7d/30d), efficiency score, per-container allocation table.

DS01 Historical (`ds01-historical`)

Long-term trend analysis. Default range: last 7 days.

Row	What to look for
GPU Utilisation Trends	Hourly avg/max trends, utilisation heatmap (orange = high demand)
GPU Cost Attribution	GPU-hours by user (stacked bar), total GPU-hours, efficiency bargauge, GPU-hours over time
Capacity & Demand	GPU slot allocation over time, active users over time, GPU waste % trend
Temperature & Health	Daily max GPU temp, enforcement activity over time (idle/runtime kills)

NVIDIA DCGM GPU Metrics (`nvidia-dcgm`)

Raw NVIDIA DCGM metrics: temperature, power, utilisation, memory, SM clocks, tensor core utilisation. Use for GPU hardware debugging. Keep as-is — managed upstream by NVIDIA.

3. Alert Management

Viewing Active Alerts

Grafana: Alerting tab in left sidebar
Alertmanager UI: http://localhost:9093
Prometheus alerts: http://localhost:9090/alerts

Alert Delivery

Alerts route to Microsoft Teams via Power Automate webhook (configured in monitoring/alertmanager/alertmanager.yml).

Two receivers:

ds01-teams — warning/info alerts (group_wait: 5m, repeat: 4h)
ds01-teams-critical — critical alerts (group_wait: 30s, repeat: 1h)

Silencing Alerts

Via Alertmanager UI at http://localhost:9093 → Silences → New Silence.

Or via CLI (install amtool if not present):

# Silence DS01UserGPUWaste for 4 hours (user on planned downtime)
amtool silence add --alertmanager.url=http://localhost:9093 \
  alertname=DS01UserGPUWaste user=alice \
  --comment="Alice on annual leave" --duration=4h

# List active silences
amtool silence query --alertmanager.url=http://localhost:9093

# Expire a silence by ID
amtool silence expire --alertmanager.url=http://localhost:9093 <silence-id>

Testing Alert Delivery

Send a test alert to verify Teams webhook is working:

curl -s -X POST http://localhost:9093/api/v2/alerts \
  -H 'Content-Type: application/json' \
  -d '[{
    "labels": {"alertname": "DS01TestAlert", "severity": "warning", "job": "test"},
    "annotations": {"summary": "Test alert", "description": "Manual test from admin"}
  }]'

Check Alertmanager UI for delivery status. Alert rules live at: monitoring/prometheus/rules/ds01_alerts.yml

4. Stack Management

Working directory for all commands: /opt/ds01-infra/monitoring

Restart All Services

cd /opt/ds01-infra/monitoring && docker compose restart

Restart Single Service

docker compose restart grafana
docker compose restart prometheus
docker compose restart alertmanager
docker compose restart cadvisor
docker compose restart node-exporter
docker compose restart dcgm-exporter

View Logs

docker logs ds01-grafana --tail 50
docker logs ds01-prometheus --tail 50
docker logs ds01-alertmanager --tail 50
docker logs ds01-cadvisor --tail 50
docker logs ds01-node-exporter --tail 50

Check Scrape Targets

http://localhost:9090/targets

All 7 targets should be UP: dcgm-exporter, ds01-exporter, node-exporter, prometheus, cadvisor, grafana, alertmanager.

ds01-exporter runs as a systemd service (not docker):

sudo systemctl status ds01-exporter
sudo journalctl -u ds01-exporter -n 50

Reload Configuration (without restart)

# Hot-reload Prometheus + Alertmanager config (self-heals file permissions first)
monitoring-manage reload

Use after editing prometheus.yml, ds01_alerts.yml, ds01_recording.yml, or alertmanager.yml.

Always reload via monitoring-manage reload, not raw curl. The deploy account's umask (0077) makes git pull write updated config files mode 600, which the Prometheus/Grafana container users can't read — a raw curl -X POST .../-/reload then fails with HTTP 500 and silently keeps the old rules. monitoring-manage reload re-asserts world-read on the config trees first (sudo deploy also does, via permissions-manifest.sh).

Start/Stop Full Stack

cd /opt/ds01-infra/monitoring
docker compose up -d      # Start all
docker compose down       # Stop all (data preserved in named volumes)

5. Troubleshooting

Container crash-looping

Check permissions — config files must be readable (644):

ls -la /opt/ds01-infra/monitoring/prometheus/
ls -la /opt/ds01-infra/monitoring/alertmanager/
ls -la /opt/ds01-infra/monitoring/grafana/provisioning/
# Fix: chmod -R 644 <file>; chmod 755 <dir>

View crash logs:

docker logs ds01-grafana --tail 100

Panel shows "No data"

Check the metric exists in Prometheus: http://localhost:9090/graph
Enter the metric name (e.g., ds01_gpu_allocated) and execute
If empty — check the relevant scrape target is UP (/targets)
If DCGM panels are empty — verify ds01-dcgm-exporter container is running:
```
docker ps | grep dcgm
docker logs --tail 50 ds01-dcgm-exporter
```

Alerts not firing

Check Prometheus alert state: http://localhost:9090/alerts
Verify Alertmanager is healthy: http://localhost:9093/-/healthy

Confirm Teams webhook URL is configured (not placeholder):

grep PLACEHOLDER /opt/ds01-infra/monitoring/alertmanager/alertmanager.yml
# Should return nothing if webhook is configured

Check inhibition rules — a critical alert may be suppressing related warnings

Alerts firing but no Teams message

Send a test alert (see section 3)
Check Alertmanager logs: docker logs ds01-alertmanager --tail 50
Verify Power Automate webhook URL is still valid (they expire)

High disk usage from Prometheus

Check volume sizes:

du -sh /var/lib/docker/volumes/ds01-prometheus-data/
df -h /

Prometheus retention: 90 days / 15GB cap (whichever comes first). If disk is critical, restart Prometheus with a lower retention:

# Edit docker-compose.yaml to reduce --storage.tsdb.retention.time=30d
cd /opt/ds01-infra/monitoring && docker compose up -d prometheus

ds01-exporter not scraped

sudo systemctl status ds01-exporter
curl -s http://127.0.0.1:9101/metrics | head -20
sudo journalctl -u ds01-exporter -n 50

6. Configuration File Reference

All monitoring config lives in /opt/ds01-infra/monitoring/.

File	Purpose
`docker-compose.yaml`	Service definitions, image versions, resource limits, volume mounts
`prometheus/prometheus.yml`	Scrape targets, intervals, alertmanager endpoint
`prometheus/rules/ds01_alerts.yml`	Alert rules (24 rules across 5 groups)
`prometheus/rules/ds01_recording.yml`	Recording rules — pre-computed aggregates (10 groups, ~45 rules)
`alertmanager/alertmanager.yml`	Alert routing, inhibition rules, Teams webhook receivers
`grafana/provisioning/datasources/`	Prometheus datasource auto-provisioning
`grafana/provisioning/dashboards/`	Dashboard auto-provisioning config
`grafana/provisioning/dashboards/dashboards/`	Dashboard JSON files (ds01_overview, ds01_historical, ds01_user, nvidia_dcgm)

After editing any config file, reload the relevant service (see section 4). For docker-compose changes, run docker compose up -d.

7. Key Metrics Reference

Metric	Source	Description
`ds01_gpu_allocated`	ds01-exporter	Per-slot GPU allocation (labels: gpu_slot, user, container)
`ds01_ssh_sessions_active`	ds01-exporter	SSH sessions per user (from `who`)
`ds01_lifecycle_events_total`	ds01-exporter	Enforcement events last 24h (label: action)
`ds01:system_gpu_utilization_avg`	recording rule	System-wide GPU utilisation avg (0–100 scale)
`ds01:user_gpu_seconds`	recording rule	Gauge proxy for GPU-hours attribution (GPU-seconds; integrate via `sum_over_time`)
`DCGM_FI_DEV_GPU_TEMP`	dcgm-exporter	Per-GPU temperature (°C)
`DCGM_FI_PROF_GR_ENGINE_ACTIVE`	dcgm-exporter	GPU compute engine active ratio (0–1 scale)
`container_cpu_usage_seconds_total`	cAdvisor	Per-container CPU usage
`container_memory_usage_bytes`	cAdvisor	Per-container memory usage
`node_load15`	node-exporter	System 15-minute load average

1. Accessing Grafana​

2. Dashboard Guide​

DS01 Overview (ds01-overview)​

DS01 User Detail (ds01-user)​

DS01 Historical (ds01-historical)​

NVIDIA DCGM GPU Metrics (nvidia-dcgm)​

3. Alert Management​

Viewing Active Alerts​

Alert Delivery​

Silencing Alerts​

Testing Alert Delivery​

4. Stack Management​

Restart All Services​

Restart Single Service​

View Logs​

Check Scrape Targets​

Reload Configuration (without restart)​

Start/Stop Full Stack​

5. Troubleshooting​

Container crash-looping​

Panel shows "No data"​

Alerts not firing​

Alerts firing but no Teams message​

High disk usage from Prometheus​

ds01-exporter not scraped​

6. Configuration File Reference​

7. Key Metrics Reference​