Resource Management
Understanding quotas, fair sharing, and resource allocation in shared computing.
Part of Educational Computing Context - Career-relevant knowledge beyond DS01 basics.
Resource management is a core concept in cloud computing, HPC, and multi-tenant systems. This guide explains DS01's resource allocation and how these patterns apply to production systems.
Your Resource Limits
check-limits
This command shows your current limits and resource usage. Typical limits:
- Max GPUs: 1-2
- Max Containers: 2-3
- Memory: 32-128GB per container
- Idle Timeout: 30min-2h (varies by user)
- Max Runtime: 24h-72h (varies by user)
How Limits Work
Per-user limits prevent:
-
One user monopolising all GPUs: Without limits, a user could grab all available GPUs. Everyone else waits indefinitely. Per-user caps (typically 1-2 GPUs) ensure resources are distributed across users.
-
Resource exhaustion: A single runaway process could consume all system memory, crashing everyone's containers. Memory limits per container (cgroups) mean your container gets OOM-killed before it affects others.
-
Unfair allocation: Users who request resources but don't use them (idle containers) block others. Idle timeouts automatically reclaim unused allocations, ensuring active users get priority over idle ones.
System enforcement:
-
GPU allocation (gpu_allocator.py): When you request a container, the allocator checks your current usage against your limits. Already at your GPU cap? Request denied. The allocator also tracks which GPU-slot (a full GPU, or a MIG slice if enabled) each container uses.
-
Memory limits (systemd cgroups): Every container runs inside a cgroup with hard memory limits. Request 64GB, use 65GB = OOM killer terminates your process. This isn't punitive - it's protecting other users' containers from your memory leak.
-
Automatic cleanup (cron jobs): Background jobs check for idle containers (low CPU/GPU usage for extended periods) and containers exceeding max runtime. These get warnings, then automatic shutdown. Freed resources become available to waiting users.
Priority System
Priority levels (1-100):
- Default users: 50
- Power users: 75
- Admins: 100
Higher priority means:
-
Allocated GPUs first when scarce: If 3 users request GPUs simultaneously and only 2 are available, higher-priority users get them first. Lower-priority users queue until resources free up.
-
May pre-empt lower priority (rarely): In extreme cases, a high-priority job can terminate a lower-priority idle container to claim its GPU. This is rare - DS01 prefers waiting to pre-emption. But it means admin maintenance tasks can always run.
Resource Efficiency
Be a good citizen:
# Done working? Free GPU
container-retire my-project
# Idle for lunch? Retire
container-retire my-project
# Switching projects? Retire
container-retire old-project
container-deploy new-project