AIME + DS01 Integration Strategy

Date: 2025-11-11 Purpose: Comprehensive integration plan to merge AIME framework with DS01 infrastructure Goal: Maximum AIME reuse + Minimal DS01 patches = Unified robust system

1. Executive Summary

Current State

DS01 (Current):
┌─────────────────────────────────────┐
│ image-create                         │
│   ↓                                  │
│ FROM pytorch/pytorch:2.5.1  ✗       │ Not using AIME!
│   ↓                                  │
│ Build custom Dockerfile              │
│   ↓                                  │
│ mlc-create-wrapper                   │
│   ↓                                  │
│ AIME mlc-create (fails on custom!) ✗ │
└─────────────────────────────────────┘

Target State

DS01 (Integrated):
┌──────────────────────────────────────────┐
│ Tier 1: AIME Framework (Base System)     │
│   • ml_images.repo (76 frameworks)       │
│   • mlc-create-patched NEW            │
│   • mlc-open (unchanged)                 │
|   * the rest of mlc-* commands           |
│   • aime.mlc.* labels                    │
├──────────────────────────────────────────┤
│ Tier 2: DS01 Modular Commands            │
│   • image-create (uses AIME base)      │
│   • container-create (calls patched)   │
│   • resource limits (get_resource_*.py)  │
│   • GPU allocation (gpu_allocator.py)    │
|   * all Tier 2 call on Tier 1 mlc-* commands
├──────────────────────────────────────────┤
│ Tier 3: DS01 Orchestrators               │
│   • project-init                         │
│   • user-setup                           │
│   • All existing workflows               │
└──────────────────────────────────────────┘
= Tier 1 is the engine: AIME `mlc-*` commands
= Tier 2 is light CLI layer over them
= Tier 3 is orchestration layer

TO CHANGE: make sure all Tier 1 AIME functionality is leveraged by ds01! Currently the AIME audit suggests it is not all being used at present.

Unified Workflow

USER: image-create my-cv-project
┌─────────────────────────────────────────────────────────┐
│ 1. Framework Selection → Looks up in ml_images.repo     │
│    Result: aimehub/pytorch-2.5.1-aime-cuda12.1.1 ✓     │
├─────────────────────────────────────────────────────────┤
│ 2. Generate Dockerfile                                  │
│    FROM aimehub/pytorch-2.5.1-aime-cuda12.1.1 ✓        │
│    RUN pip install jupyter pandas ... (DS01 3-tier)     │
├─────────────────────────────────────────────────────────┤
│ 3. Build Custom Image                                   │
│    docker build -t my-cv-project-john-image             │
│    Result: AIME base + DS01 customization ✓            │
└─────────────────────────────────────────────────────────┘

USER: container-create my-cv-project
┌─────────────────────────────────────────────────────────┐
│ 1. Detect Custom Image Exists                           │
│    docker images | grep my-cv-project-john-image ✓     │
├─────────────────────────────────────────────────────────┤
│ 2. Get Resource Limits                                  │
│    get_resource_limits.py john --docker-args            │
│    → --cpus=16 --memory=32g --shm-size=16g ...          │
├─────────────────────────────────────────────────────────┤
│ 3. Allocate GPU                                         │
│    gpu_allocator.py allocate john my-cv-project 1 10    │
│    → GPU 0:1 (MIG instance) ✓                          │
├─────────────────────────────────────────────────────────┤
│ 4. Create Container (mlc-create-patched)                │
│    mlc-create-patched my-cv-project pytorch \           │
│      --image=my-cv-project-john-image \                 │
│      --cpus=16 --memory=32g --shm-size=16g \            │
│      --gpu=0:1 --cgroup-parent=ds01-student.slice ✓    │
├─────────────────────────────────────────────────────────┤
│ Container Created! ✓                                    │
│   • Based on AIME framework                             │
│   • Customized with DS01 packages                       │
│   • Resource limits enforced                            │
│   • GPU allocated fairly                                │
└─────────────────────────────────────────────────────────┘

2. Integration Principles

Principle 1: Maximum AIME Reuse

✓ Use AIME for:

Framework catalog (ml_images.repo)
Base image selection
Container naming convention (name._.uid)
Label system (aime.mlc.*)
User environment setup (UID/GID matching)
Lifecycle management (mlc-open unchanged)

✗ Don't reinvent AIME:

No custom framework catalog
No custom naming scheme
No custom label namespace

Principle 2: Minimal DS01 Patches

✓ Add to AIME only what's essential:

Custom image support (bypass catalog)
Resource limit application (at creation)
GPU allocation integration

✗ Don't over-engineer:

Keep patches small (<100 lines)
Document every deviation
Maintain AIME compatibility

Principle 3: Preserve DS01 UX

✓ Keep what works:

3-tier package system (framework → base → use case → custom)
Interactive wizards (--guided mode)
Phase-based workflows (1/3, 2/3, 3/3)
Clear educational prompts

✗ Don't break existing:

All current DS01 commands work unchanged
No breaking changes to user workflows
Backward compatible (can recreate old containers)

3. Core Changes Required

Change 1: AIME Base Images in DS01

File: scripts/user/image-create

Before:

get_base_image() {
    case $framework in
        tensorflow|tf) echo "tensorflow/tensorflow:2.14.0-gpu" ;;
        pytorch|*)     echo "pytorch/pytorch:2.5.1-cuda11.8-cudnn9-runtime" ;;
    esac
}

After:

get_base_image() {
    local framework="$1"
    local version="$2"

    # Look up in AIME catalog
    local AIME_REPO="/opt/aime-ml-containers/ml_images.repo"

    if [ ! -f "$AIME_REPO" ]; then
        log_error "AIME catalog not found: $AIME_REPO"
        log_info "Falling back to Docker Hub images"
        case $framework in
            tensorflow|tf) echo "tensorflow/tensorflow:2.14.0-gpu" ;;
            pytorch|*)     echo "pytorch/pytorch:2.5.1-cuda11.8-cudnn9-runtime" ;;
        esac
        return
    fi

    # Parse AIME catalog (CSV format)
    # Framework, Version, Arch, Repo
    local framework_capital=$(echo "$framework" | sed 's/\b\(.\)/\u\1/')  # Capitalize
    local arch="CUDA_ADA"  # For Ada Lovelace GPUs (RTX 40xx, A100, etc.)

    local image=$(awk -F', ' \
        -v fw="$framework_capital" \
        -v ver="$version" \
        -v arch="$arch" \
        '$1 == fw && $2 == ver && $3 == arch {print $4; exit}' \
        "$AIME_REPO")

    if [ -n "$image" ]; then
        echo "$image"
        log_success "Using AIME image: $image"
    else
        # Fallback: get latest version for framework
        image=$(awk -F', ' \
            -v fw="$framework_capital" \
            -v arch="$arch" \
            '$1 == fw && $3 == arch {print $4; exit}' \
            "$AIME_REPO")

        if [ -n "$image" ]; then
            echo "$image"
            log_warning "Version $version not found, using: $image"
        else
            log_error "Framework '$framework' not found in AIME catalog"
            exit 1
        fi
    fi
}

Lines Changed: ~15 lines Risk: Low (fallback to Docker Hub if AIME unavailable)

Change 2: Create `mlc-create-patched`

File: scripts/docker/mlc-create-patched (NEW)

Strategy: Copy aime-ml-containers/mlc-create → scripts/docker/mlc-create-patched and make minimal edits

Patch Section 1: Add Custom Image Support

Lines 44-46 (add new flags):

CONTAINER_NAME=${ARGS[0]}
FRAMEWORK_NAME=${ARGS[1]}
FRAMEWORK_VERSION=${ARGS[2]}
CUSTOM_IMAGE=""          # NEW: Support --image flag
DS01_RESOURCE_LIMITS=""  # NEW: Support --cpus, --memory, etc.

Lines 20-42 (add argument parsing):

for i in "$@"
do
case $i in
    # ... existing AIME flags ...

    --image=*)                           # NEW
    CUSTOM_IMAGE="${i#*=}"
    ;;
    --cpus=*|--memory=*|--shm-size=*|--pids-limit=*|--cgroup-parent=*)  # NEW
    DS01_RESOURCE_LIMITS="$DS01_RESOURCE_LIMITS $i"
    ;;
esac
done

Lines 176-188 (modify image resolution):

# NEW: Check if custom image provided
if [ -n "$CUSTOM_IMAGE" ]; then
    log_info "[DS01] Using custom image: $CUSTOM_IMAGE"

    # Validate image exists
    if ! docker image inspect "$CUSTOM_IMAGE" &>/dev/null; then
        echo -e "\nError: Custom image not found: $CUSTOM_IMAGE"
        echo -e "Build it first with: image-create"
        exit -3
    fi

    IMAGE="$CUSTOM_IMAGE"
else
    # ORIGINAL AIME LOGIC: Look up in ml_images.repo
    IMAGE=$(find_image $FRAMEWORK_NAME $FRAMEWORK_VERSION)

    if [[ $IMAGE == "" ]]; then
        # ... existing error handling ...
    fi
fi

Patch Section 2: Apply DS01 Resource Limits

Lines 230-235 (modify docker create):

# Parse DS01 resource limits
RESOURCE_ARGS=""
if [ -n "$DS01_RESOURCE_LIMITS" ]; then
    RESOURCE_ARGS="$DS01_RESOURCE_LIMITS"
    log_info "[DS01] Applying resource limits: $RESOURCE_ARGS"
fi

# ORIGINAL AIME docker create + DS01 additions
OUT=$(docker create -it \
    $VOLUMES \
    -w $WORKSPACE \
    --name=$CONTAINER_TAG \

    # AIME labels (keep all)
    --label=$CONTAINER_LABEL=$USER \
    --label=$CONTAINER_LABEL.NAME=$CONTAINER_NAME \
    --label=$CONTAINER_LABEL.USER=$USER \
    --label=$CONTAINER_LABEL.VERSION=$MLC_VERSION \
    --label=$CONTAINER_LABEL.WORK_MOUNT=$WORKSPACE_MOUNT \
    --label=$CONTAINER_LABEL.DATA_MOUNT=$DATA_MOUNT \
    --label=$CONTAINER_LABEL.FRAMEWORK=$FRAMEWORK_NAME-$FRAMEWORK_VERSION \
    --label=$CONTAINER_LABEL.GPUS=$GPUS \

    # DS01 labels (add these)
    --label=$CONTAINER_LABEL.DS01_MANAGED=true \          # NEW
    --label=$CONTAINER_LABEL.CUSTOM_IMAGE=$CUSTOM_IMAGE \ # NEW (if used)

    # User & security (AIME)
    --user $USER_ID:$GROUP_ID \
    --tty --privileged --interactive \
    --group-add video \
    --group-add sudo \

    # GPU & devices (AIME)
    --gpus=$GPUS \
    --device /dev/video0 \
    --device /dev/snd \

    # Networking & IPC (AIME)
    --network=host \
    --ipc=host \
    -v /tmp/.X11-unix:/tmp/.X11-unix \

    # Resource limits (AIME defaults)
    --ulimit memlock=-1 \
    --ulimit stack=67108864 \

    # DS01 resource limits (NEW - applied at creation!)
    $RESOURCE_ARGS \

    # Image & command
    $IMAGE:$CONTAINER_TAG \
    bash -c "echo \"export PS1='[$CONTAINER_NAME] \\\`whoami\\\`@\\\`hostname\\\`:\\\${PWD#*}$ '\" >> ~/.bashrc; bash")

Documentation Header

Lines 1-30:

#!/bin/bash
#
# mlc-create-patched - DS01-enhanced AIME container creation
#
# BASED ON: aime-ml-containers/mlc-create v3
# MODIFIED BY: DS01 Infrastructure
# DATE: 2025-11-11
#
# ==================== DS01 MODIFICATIONS ====================
#
# This script is a MINIMALLY MODIFIED version of AIME's mlc-create.
# Original AIME code is preserved wherever possible.
#
# DS01 ADDITIONS:
# 1. Custom image support (--image flag)
# - Allows using user-built Docker images
# - Falls back to AIME catalog if not specified
#
# 2. Resource limits (--cpus, --memory, --shm-size, etc.)
# - Applied at container creation (not post-update)
# - Integrated with resource-limits.yaml
#
# 3. GPU allocation integration
# - Receives specific GPU ID from gpu_allocator.py
# - Supports MIG instances (e.g., --gpus=device=0:1)
#
# COMPATIBILITY:
# - 100% backward compatible with original mlc-create
# - All AIME commands (mlc-open, etc.) work unchanged
# - Uses same naming convention (name._.uid)
# - Uses same label system (aime.mlc.*)
#
# DEVIATIONS FROM AIME:
# Lines 108-128: Custom image support added
# Lines 176-195: Image resolution modified
# Lines 230-250: Resource limits added to docker create
#
# TOTAL CHANGES: ~65 lines added/modified out of 238
#
# UPSTREAM: To contribute back to AIME, these changes could be
# upstreamed as optional flags (--image, --cpus, etc.)
#
# ===========================================================

# Original AIME copyright notice
# AIME MLC - Machine Learning Container Management
# Copyright (c) AIME GmbH and affiliates
# MIT LICENSE

Total Patch Size:

Header documentation: ~30 lines
Custom image support: ~25 lines
Resource limits: ~10 lines
Total: ~65 lines added to 238-line script = 27% addition

Change 3: Integrate GPU Allocation

File: scripts/docker/mlc-create-wrapper.sh → REPLACE with simpler version

Current: 426 lines (too complex!)

New: ~150 lines (streamlined)

Responsibilities:

Parse user input
Get resource limits (via get_resource_limits.py)
Allocate GPU (via gpu_allocator.py)
Call mlc-create-patched with all arguments
Handle errors gracefully

New Workflow:

#!/bin/bash
# mlc-create-wrapper.sh - Simplified DS01 wrapper for mlc-create-patched

# 1. Get resource limits
LIMITS=$(python3 get_resource_limits.py $USER --docker-args)
# → --cpus=16 --memory=32g --shm-size=16g --cgroup-parent=ds01-student.slice

# 2. Allocate GPU (if needed)
if [ "$CPU_ONLY" != true ]; then
    GPU_ID=$(python3 gpu_allocator.py allocate $USER $CONTAINER_TAG 1 $PRIORITY)
    GPU_ARG="--gpus=device=$GPU_ID"
else
    GPU_ARG=""
fi

# 3. Call mlc-create-patched with everything
if [ -n "$CUSTOM_IMAGE" ]; then
    # Using custom image
    bash mlc-create-patched $CONTAINER_NAME pytorch \
        --image=$CUSTOM_IMAGE \
        $LIMITS \
        $GPU_ARG \
        -w=$WORKSPACE_DIR
else
    # Using AIME catalog
    bash mlc-create-patched $CONTAINER_NAME $FRAMEWORK $VERSION \
        $LIMITS \
        $GPU_ARG \
        -w=$WORKSPACE_DIR
fi

# 4. Done! (no post-processing needed)

Lines Changed: Entire file rewritten (simpler!) Risk: Medium (but old wrapper didn't work for custom images anyway)

Change 4: Update `container-create`

File: scripts/user/container-create

Minimal changes needed:

Line 29 (update MLC wrapper path):

MLC_WRAPPER="$INFRA_ROOT/scripts/docker/mlc-create-wrapper.sh"
MLC_PATCHED="$INFRA_ROOT/scripts/docker/mlc-create-patched"  # NEW: Can call directly

Lines 651-665 (update wrapper call):

# Determine if using custom image or framework
if [ -n "$IMAGE_NAME" ]; then
    # Custom image exists - pass it to wrapper
    WRAPPER_ARGS+=("--image=$IMAGE_NAME")
fi

# Call wrapper (wrapper will call mlc-create-patched)
if bash "$MLC_WRAPPER" "${WRAPPER_ARGS[@]}"; then
    # Success!
else
    # Error handling
fi

Lines Changed: ~10 lines Risk: Low (just passing image name)

Change 5: Standardize Labels

Files: Multiple

Strategy: Remove all ds01.* labels, use aime.mlc.* only

Changes:

image-create Dockerfile generation (lines 443-448):

# BEFORE:
LABEL maintainer="$USERNAME"
LABEL maintainer.id="$USER_ID"
LABEL ds01.image="$image_name"

# AFTER:
LABEL aime.mlc.MAINTAINER="$USERNAME"
LABEL aime.mlc.MAINTAINER_ID="$USER_ID"
LABEL aime.mlc.CUSTOM_IMAGE="$image_name"

container-list (line 109):

# BEFORE:
docker ps -a --filter "label=maintainer=$USERNAME"

# AFTER:
docker ps -a --filter "label=aime.mlc.USER=$USERNAME"

All monitoring scripts: Replace ds01.* label checks with aime.mlc.*

Lines Changed: ~30 lines across 6 files Risk: Low (just label renaming)

4. Implementation Roadmap

Phase 1: Foundation (1-2 hours)

✓ Task 1.1: Create mlc-create-patched

Copy aime-ml-containers/mlc-create to scripts/docker/mlc-create-patched
Add header documentation
Test: bash mlc-create-patched test pytorch 2.5.1 (should work like original)

✓ Task 1.2: Add custom image support to mlc-create-patched

Add --image flag parsing
Modify image resolution logic
Test: bash mlc-create-patched test pytorch --image=pytorch/pytorch:2.5.1

✓ Task 1.3: Add resource limits to mlc-create-patched

Add --cpus, --memory, etc. flag parsing
Pass to docker create
Test: bash mlc-create-patched test pytorch --cpus=8 --memory=16g

Deliverable: Working mlc-create-patched that accepts custom images + resource limits

Phase 2: Integration (2-3 hours)

✓ Task 2.1: Update image-create to use AIME catalog

Modify get_base_image() function
Add AIME catalog lookup
Test: image-create test-img (should use AIME base)

✓ Task 2.2: Simplify mlc-create-wrapper

Rewrite to call mlc-create-patched directly
Integrate GPU allocator
Test: bash mlc-create-wrapper test-container pytorch

✓ Task 2.3: Update container-create

Pass custom image to wrapper
Test full workflow: container-create test

Deliverable: End-to-end workflow works (image-create → container-create)

Phase 3: Polish (1-2 hours)

✓ Task 3.1: Standardize labels

Update all scripts to use aime.mlc.*
Remove ds01.* labels
Test: docker inspect shows correct labels

✓ Task 3.2: Update documentation

README.md
Command help text

✓ Task 3.3: Test matrix

Deliverable: Production-ready integrated system

Phase 4: Validation (1 hour)

✓ Task 4.1: User acceptance testing

Run through user-setup wizard
Create project with custom image
Verify resource limits
Check GPU allocation

✓ Task 4.2: Backward compatibility

Test that old containers still work
Verify mlc-open works on old + new containers
Check monitoring scripts work

Deliverable: Validated, production-ready system

5. Testing Matrix

Test Case 1: AIME Framework (Catalog Only)

# Should work like original AIME
mlc-create-patched pytorch-test Pytorch 2.5.1

# Verify:
docker inspect pytorch-test._.$(id -u) | grep -i "aime.mlc"
docker inspect pytorch-test._.$(id -u) | grep -i "aimehub/pytorch"

Expected: Container created from AIME base image

Test Case 2: Custom Image (DS01 Workflow)

# Create custom image
image-create my-cv-project -f pytorch -t cv

# Verify base image
docker history my-cv-project-john-image | grep aimehub

# Create container from custom image
container-create my-cv-project

# Verify:
docker inspect my-cv-project._.$(id -u) | grep "aime.mlc.CUSTOM_IMAGE"

Expected: Container created from custom image, still uses AIME labels

Test Case 3: Resource Limits

# Create container
container-create test-limits

# Verify:
docker inspect test-limits._.$(id -u) --format '{{.HostConfig.NanoCpus}}'
# Should show: 16000000000 (16 CPUs)

docker inspect test-limits._.$(id -u) --format '{{.HostConfig.Memory}}'
# Should show: 34359738368 (32 GB)

docker inspect test-limits._.$(id -u) --format '{{.HostConfig.ShmSize}}'
# Should show: 17179869184 (16 GB) - set at creation!

docker inspect test-limits._.$(id -u) --format '{{.HostConfig.CgroupParent}}'
# Should show: ds01-student.slice

Expected: All limits applied correctly at creation time

Test Case 4: GPU Allocation

# Create container
container-create gpu-test

# Check allocation state
python3 scripts/docker/gpu_allocator.py status
# Should show: gpu-test._.1001 → GPU 0:1

# Verify in container
docker exec gpu-test._.$(id -u) nvidia-smi
# Should show only allocated GPU

# Release
docker stop gpu-test._.$(id -u)
python3 scripts/docker/gpu_allocator.py release gpu-test._.$(id -u)

Expected: GPU allocated, tracked, and released properly

Test Case 5: mlc-open (Compatibility)

# Create via new system
container-create compat-test

# Open via AIME command (unchanged)
mlc-open compat-test

# Verify:
# - Container starts
# - Shell opens
# - User is correct UID/GID
# - Workspace mounted

Expected: AIME's mlc-open works with DS01-created containers

Test Case 6: Full Workflow (End-to-End)

# User onboarding workflow
user-setup

# Follow prompts:
# - Create project: my-thesis
# - Framework: PyTorch
# - Use case: Computer Vision
# - Additional: wandb

# Verify:
# 1. Image created from AIME base
docker images | grep my-thesis-john-image

# 2. Container created with limits
docker inspect my-thesis._.$(id -u)

# 3. GPU allocated
python3 scripts/docker/gpu_allocator.py user-status john

# 4. Can enter container
container-run my-thesis

# 5. Packages installed
pip list | grep -E "torch|timm|wandb"

Expected: Entire workflow works seamlessly

6. Risk Assessment

High Risk (must test thoroughly)

mlc-create-patched stability
- Risk: Breaks existing AIME workflows
- Mitigation: Keep original logic intact, add only optional features
- Test: Both AIME catalog and custom image paths
Resource limits at creation
- Risk: shm-size not applied correctly
- Mitigation: Pass ALL limits to docker create, not docker update
- Test: Verify shm-size in containers
GPU allocator integration
- Risk: GPU not allocated or tracked
- Mitigation: Extensive logging, state file validation
- Test: Full allocation/release cycle

Medium Risk

Label migration
- Risk: Scripts break when looking for old labels
- Mitigation: Comprehensive grep for all label references
- Test: All container-list, container-stats, etc.
Image catalog lookup
- Risk: Framework name mismatch (pytorch vs Pytorch)
- Mitigation: Case-insensitive lookup, clear error messages
- Test: Various framework name formats

Low Risk

Documentation updates
- Risk: Docs out of sync
- Mitigation: Update docs incrementally with code changes
Backward compatibility
- Risk: Old containers don't work
- Mitigation: AIME labels preserved, same naming
- Test: Create old-style container, verify mlc-open works

7. Rollback Plan

If integration fails:

Restore scripts

git checkout main -- scripts/docker/mlc-create-wrapper.sh
git checkout main -- scripts/user/image-create
git checkout main -- scripts/user/container-create

Remove mlc-create-patched
```
rm scripts/docker/mlc-create-patched
```

Restore symlinks

# If any symlinks were changed
sudo rm /usr/local/bin/mlc-create
sudo ln -s /opt/aime-ml-containers/mlc-create /usr/local/bin/

Existing containers continue to work
- They use docker directly, not dependent on scripts

9. Success Criteria

✓ Integration successful if:

AIME base images used
- All new containers use aimehub/* images
- 76 framework versions available
Custom images work
- Users can build on AIME base
- Dockerfile workflow preserved
Resource limits enforced
- All limits from YAML applied
- shm-size set at creation (not after)
GPU allocation integrated
- gpu_allocator.py called before container creation
- State tracking works
Backward compatible
- mlc-open works on all containers
- Old and new containers coexist
Labels standardized
- All use aime.mlc.* namespace
- No ds01.* labels remain
Documentation updated
- README reflects new workflow
All tests pass
- Test matrix completed
- User acceptance successful

QUESTIONS FROM HENRY

does mlc-create-patched pass to mlc.py?
how does the custom image vs AIME images workflow differ
can i use image inheritance to make it cleaner? so the parent AIME images are handled according to AIME workflow, then ds01 takes over and inherits the image to add custom packages, resource allocation? or is that already happening?
- Build image chains: AIME -> ds01

1. Executive Summary​

Current State​

Target State​

Unified Workflow​

2. Integration Principles​

Principle 1: Maximum AIME Reuse​

Principle 2: Minimal DS01 Patches​

Principle 3: Preserve DS01 UX​

3. Core Changes Required​

Change 1: AIME Base Images in DS01​

Change 2: Create mlc-create-patched​

Patch Section 1: Add Custom Image Support​

Patch Section 2: Apply DS01 Resource Limits​

Documentation Header​

Change 3: Integrate GPU Allocation​

Change 4: Update container-create​

Change 5: Standardize Labels​

4. Implementation Roadmap​

Phase 1: Foundation (1-2 hours)​

Phase 2: Integration (2-3 hours)​

Phase 3: Polish (1-2 hours)​

Phase 4: Validation (1 hour)​

5. Testing Matrix​

Test Case 1: AIME Framework (Catalog Only)​

Test Case 2: Custom Image (DS01 Workflow)​

Test Case 3: Resource Limits​

Test Case 4: GPU Allocation​

Test Case 5: mlc-open (Compatibility)​

Test Case 6: Full Workflow (End-to-End)​

6. Risk Assessment​

High Risk (must test thoroughly)​

Medium Risk​

Low Risk​

7. Rollback Plan​

9. Success Criteria​