GPU Container Guide

This guide explains how GPU access works in DS01 and what to expect when running containers with GPU.

Current state: MIG is disabled — the server runs 4 full A100-40GB GPUs, so one GPU-slot = one full GPU. Quotas are counted in GPU-equivalents (a full GPU = 1.0).

Quick Summary

Container Type	GPU?	Lifecycle
Dev container (no GPU)	No	Permanent - runs until you stop it
Dev container (with GPU)	Yes	Ephemeral - auto-stops when idle
ds01 container	Yes	Ephemeral - dedicated, auto-cleanup
docker-compose (with GPU)	Yes	Ephemeral - auto-stops when idle
docker run (with GPU)	Yes	Ephemeral - auto-stops when idle

The Golden Rule

GPU access = ephemeral container

If your container has GPU access, it WILL be stopped when:

GPU is idle for 30 minutes (no GPU activity)
Container exceeds its maximum runtime (varies by container type)

This ensures GPUs are available for all users and aren't hoarded by idle containers.

Container Lifecycle by Type

DS01 Native Containers (Recommended)

Created via container deploy:

Idle timeout: 30 minutes (configurable by group)
Max runtime: 24-72 hours (varies by user group)
Best for: Training runs, batch jobs, GPU-intensive work

# Deploy a GPU container
container deploy myproject

# When done
container retire myproject

Dev Containers (VS Code/Cursor)

Containers created by VS Code Remote Containers extension.

With GPU (--gpus in devcontainer.json):

Idle timeout: 30 minutes
Max runtime: 7 days
Lifecycle: Ephemeral - will be auto-stopped

Without GPU:

Idle timeout: None
Max runtime: None
Lifecycle: Permanent - runs until you stop it

Docker Compose Containers

With GPU:

Idle timeout: 30 minutes
Max runtime: 3 days
Lifecycle: Ephemeral

Direct docker run

With GPU:

Idle timeout: 30 minutes
Max runtime: 2 days
Lifecycle: Ephemeral

Recommendations

For Code Editing (No GPU Needed)

Use a dev container WITHOUT GPU:

Remove --gpus from your devcontainer.json
Container can run indefinitely
Perfect for: editing, git operations, lightweight work

Example devcontainer.json for permanent dev environment:

{
  "name": "My Project",
  "image": "python:3.11",
  "postCreateCommand": "pip install -r requirements.txt"
  // Note: No "runArgs": ["--gpus", "all"]
}

For GPU Work

Use ds01 ephemeral containers for best experience:

# Launch GPU container for training
container deploy training-run

# Do your training/inference
# ...

# When done, release GPU for others
container retire training-run

Data Safety

Always mount volumes for important data:

Your home directory is automatically mounted at /home/yourusername/
Work in your home directory to persist files
Container contents outside mounted volumes are lost on stop

Safe locations (persist across container restarts):

/home/yourusername/ (your home directory)
/workspace/ (if using ds01 containers)

Unsafe locations (lost on container stop):

/tmp/
/root/
Any path not mounted from host

GPU Allocation

When you request a GPU container (via any method), DS01:

Checks your quota - You have a maximum number of GPUs based on your group
Allocates a specific GPU - You get a dedicated GPU-slot (a full GPU today, or a MIG instance if MIG is enabled), not "all GPUs"
Tracks your container - For quota enforcement and cleanup
Enforces limits - Idle timeout and max runtime

If No GPU Available

If all GPUs are allocated:

The system waits up to 3 minutes for one to become available
If timeout is reached, container creation fails with a helpful message

GPU allocation failed after 3 minutes.

Suggestions:
  1. Stop an idle container: container stop <name>
  2. Wait for system cleanup (runs every 30 min)
  3. Check GPU status: gpu-status

Checking GPU Status

# See current GPU allocations
gpu-status

# See your containers
docker ps

# See your GPU usage
container ls

FAQ

Q: My dev container keeps getting stopped

A: Your container has GPU access, which makes it ephemeral. Options:

Remove GPU access if you don't need it (edit devcontainer.json)
Keep your GPU active (training running)
Use ds01 containers for GPU work, dev containers for editing

Q: How do I prevent idle timeout?

A: Keep your GPU active:

Running training/inference
Active CUDA processes

For ds01 containers, you can also:

# Create keep-alive file (prevents idle timeout)
touch /workspace/.keep-alive

Q: Can I have multiple GPU containers?

A: Yes, up to your group limit:

Students: up to 2 GPU-slots total (2.0 GPU-equivalents)
Researchers / Faculty: up to 4 GPU-slots (varies by group)
Admins: Unlimited

Q: What happens to my work when container stops?

Persisted: Files in /home/yourusername/, /workspace/
Lost: Files elsewhere, running processes, container state

Q: Why was my --gpus all rewritten?

A: DS01 allocates you a specific GPU-slot (a full GPU, or a MIG instance if enabled) instead of "all GPUs":

Ensures fair resource sharing
Enables quota tracking
Prevents GPU hoarding

Your container still has full GPU access - just to your allocated device.

Architecture Overview

                Container Creation
                       |
         +-------------+-------------+
         |                           |
    CLI-based                    API-based
    (docker run,                 (bypasses
     compose, etc.)               wrapper)
         |                           |
         v                           |
+------------------+                 |
| docker-wrapper   |                 |
| - Allocate GPU   |                 |
| - Inject labels  |                 |
| - Rewrite --gpus |                 |
+--------+---------+                 |
         |                           |
         +-------------+-------------+
                       |
                       v
              Docker Daemon
                       |
                       v
        +-----------------------------+
        | container-owner-tracker     |
        | - Track ALL containers      |
        | - Detect GPU usage          |
        | - Flag unmanaged containers |
        +-----------------------------+
                       |
                       v
        +-----------------------------+
        | Lifecycle Enforcement       |
        | - Idle timeout (cron)       |
        | - Max runtime (cron)        |
        | - Applies to ALL GPU        |
        |   containers                |
        +-----------------------------+

Summary

GPU access = ephemeral - GPU containers will be auto-stopped when idle
No GPU = permanent - Non-GPU containers can run indefinitely
Save work to mounted volumes - Home directory always persists
Use ds01 containers for GPU work - Best experience and quota tracking
Use dev containers for editing - Without GPU for permanent workspace

Quick Summary​

The Golden Rule​

Container Lifecycle by Type​

DS01 Native Containers (Recommended)​

Dev Containers (VS Code/Cursor)​

Docker Compose Containers​

Direct docker run​

Recommendations​

For Code Editing (No GPU Needed)​

For GPU Work​

Data Safety​

GPU Allocation​

If No GPU Available​

Checking GPU Status​

FAQ​

Q: My dev container keeps getting stopped​

Q: How do I prevent idle timeout?​

Q: Can I have multiple GPU containers?​

Q: What happens to my work when container stops?​

Q: Why was my --gpus all rewritten?​

Architecture Overview​

Summary​