AIME + DS01 Integration Strategy
Date: 2025-11-11 Purpose: Comprehensive integration plan to merge AIME framework with DS01 infrastructure Goal: Maximum AIME reuse + Minimal DS01 patches = Unified robust system
1. Executive Summary
Current State
DS01 (Current):
┌─────────────────────────────────────┐
│ image-create │
│ ↓ │
│ FROM pytorch/pytorch:2.5.1 ✗ │ Not using AIME!
│ ↓ │
│ Build custom Dockerfile │
│ ↓ │
│ mlc-create-wrapper │
│ ↓ │
│ AIME mlc-create (fails on custom!) ✗ │
└─────────────────────────────────────┘
Target State
DS01 (Integrated):
┌──────────────────────────────────────────┐
│ Tier 1: AIME Framework (Base System) │
│ • ml_images.repo (76 frameworks) │
│ • mlc-create-patched NEW │
│ • mlc-open (unchanged) │
| * the rest of mlc-* commands |
│ • aime.mlc.* labels │
├──────────────────────────────────────────┤
│ Tier 2: DS01 Modular Commands │
│ • image-create (uses AIME base) │
│ • container-create (calls patched) │
│ • resource limits (get_resource_*.py) │
│ • GPU allocation (gpu_allocator.py) │
| * all Tier 2 call on Tier 1 mlc-* commands
├──────────────────────────────────────────┤
│ Tier 3: DS01 Orchestrators │
│ • project-init │
│ • user-setup │
│ • All existing workflows │
└──────────────────────────────────────────┘
= Tier 1 is the engine: AIME `mlc-*` commands
= Tier 2 is light CLI layer over them
= Tier 3 is orchestration layer
TO CHANGE: make sure all Tier 1 AIME functionality is leveraged by ds01! Currently the AIME audit suggests it is not all being used at present.
Unified Workflow
USER: image-create my-cv-project
┌─────────────────────────────────────────────────────────┐
│ 1. Framework Selection → Looks up in ml_images.repo │
│ Result: aimehub/pytorch-2.5.1-aime-cuda12.1.1 ✓ │
├─────────────────────────────────────────────────────────┤
│ 2. Generate Dockerfile │
│ FROM aimehub/pytorch-2.5.1-aime-cuda12.1.1 ✓ │
│ RUN pip install jupyter pandas ... (DS01 3-tier) │
├─────────────────────────────────────────────────────────┤
│ 3. Build Custom Image │
│ docker build -t my-cv-project-john-image │
│ Result: AIME base + DS01 customization ✓ │
└─────────────────────────────────────────────────────────┘
USER: container-create my-cv-project
┌─────────────────────────────────────────────────────────┐
│ 1. Detect Custom Image Exists │
│ docker images | grep my-cv-project-john-image ✓ │
├─────────────────────────────────────────────────────────┤
│ 2. Get Resource Limits │
│ get_resource_limits.py john --docker-args │
│ → --cpus=16 --memory=32g --shm-size=16g ... │
├─────────────────────────────────────────────────────────┤
│ 3. Allocate GPU │
│ gpu_allocator.py allocate john my-cv-project 1 10 │
│ → GPU 0:1 (MIG instance) ✓ │
├─────────────────────────────────────────────────────────┤
│ 4. Create Container (mlc-create-patched) │
│ mlc-create-patched my-cv-project pytorch \ │
│ --image=my-cv-project-john-image \ │
│ --cpus=16 --memory=32g --shm-size=16g \ │
│ --gpu=0:1 --cgroup-parent=ds01-student.slice ✓ │
├─────────────────────────────────────────────────────────┤
│ Container Created! ✓ │
│ • Based on AIME framework │
│ • Customized with DS01 packages │
│ • Resource limits enforced │
│ • GPU allocated fairly │
└─────────────────────────────────────────────────────────┘
2. Integration Principles
Principle 1: Maximum AIME Reuse
✓ Use AIME for:
- Framework catalog (
ml_images.repo) - Base image selection
- Container naming convention (
name._.uid) - Label system (
aime.mlc.*) - User environment setup (UID/GID matching)
- Lifecycle management (
mlc-openunchanged)
✗ Don't reinvent AIME:
- No custom framework catalog
- No custom naming scheme
- No custom label namespace
Principle 2: Minimal DS01 Patches
✓ Add to AIME only what's essential:
- Custom image support (bypass catalog)
- Resource limit application (at creation)
- GPU allocation integration
✗ Don't over-engineer:
- Keep patches small (<100 lines)
- Document every deviation
- Maintain AIME compatibility
Principle 3: Preserve DS01 UX
✓ Keep what works:
- 3-tier package system (framework → base → use case → custom)
- Interactive wizards (
--guidedmode) - Phase-based workflows (1/3, 2/3, 3/3)
- Clear educational prompts
✗ Don't break existing:
- All current DS01 commands work unchanged
- No breaking changes to user workflows
- Backward compatible (can recreate old containers)
3. Core Changes Required
Change 1: AIME Base Images in DS01
File: scripts/user/image-create
Before:
get_base_image() {
case $framework in
tensorflow|tf) echo "tensorflow/tensorflow:2.14.0-gpu" ;;
pytorch|*) echo "pytorch/pytorch:2.5.1-cuda11.8-cudnn9-runtime" ;;
esac
}
After:
get_base_image() {
local framework="$1"
local version="$2"
# Look up in AIME catalog
local AIME_REPO="/opt/aime-ml-containers/ml_images.repo"
if [ ! -f "$AIME_REPO" ]; then
log_error "AIME catalog not found: $AIME_REPO"
log_info "Falling back to Docker Hub images"
case $framework in
tensorflow|tf) echo "tensorflow/tensorflow:2.14.0-gpu" ;;
pytorch|*) echo "pytorch/pytorch:2.5.1-cuda11.8-cudnn9-runtime" ;;
esac
return
fi
# Parse AIME catalog (CSV format)
# Framework, Version, Arch, Repo
local framework_capital=$(echo "$framework" | sed 's/\b\(.\)/\u\1/') # Capitalize
local arch="CUDA_ADA" # For Ada Lovelace GPUs (RTX 40xx, A100, etc.)
local image=$(awk -F', ' \
-v fw="$framework_capital" \
-v ver="$version" \
-v arch="$arch" \
'$1 == fw && $2 == ver && $3 == arch {print $4; exit}' \
"$AIME_REPO")
if [ -n "$image" ]; then
echo "$image"
log_success "Using AIME image: $image"
else
# Fallback: get latest version for framework
image=$(awk -F', ' \
-v fw="$framework_capital" \
-v arch="$arch" \
'$1 == fw && $3 == arch {print $4; exit}' \
"$AIME_REPO")
if [ -n "$image" ]; then
echo "$image"
log_warning "Version $version not found, using: $image"
else
log_error "Framework '$framework' not found in AIME catalog"
exit 1
fi
fi
}
Lines Changed: ~15 lines Risk: Low (fallback to Docker Hub if AIME unavailable)
Change 2: Create mlc-create-patched
File: scripts/docker/mlc-create-patched (NEW)
Strategy: Copy aime-ml-containers/mlc-create → scripts/docker/mlc-create-patched and make minimal edits
Patch Section 1: Add Custom Image Support
Lines 44-46 (add new flags):
CONTAINER_NAME=${ARGS[0]}
FRAMEWORK_NAME=${ARGS[1]}
FRAMEWORK_VERSION=${ARGS[2]}
CUSTOM_IMAGE="" # NEW: Support --image flag
DS01_RESOURCE_LIMITS="" # NEW: Support --cpus, --memory, etc.
Lines 20-42 (add argument parsing):
for i in "$@"
do
case $i in
# ... existing AIME flags ...
--image=*) # NEW
CUSTOM_IMAGE="${i#*=}"
;;
--cpus=*|--memory=*|--shm-size=*|--pids-limit=*|--cgroup-parent=*) # NEW
DS01_RESOURCE_LIMITS="$DS01_RESOURCE_LIMITS $i"
;;
esac
done
Lines 176-188 (modify image resolution):
# NEW: Check if custom image provided
if [ -n "$CUSTOM_IMAGE" ]; then
log_info "[DS01] Using custom image: $CUSTOM_IMAGE"
# Validate image exists
if ! docker image inspect "$CUSTOM_IMAGE" &>/dev/null; then
echo -e "\nError: Custom image not found: $CUSTOM_IMAGE"
echo -e "Build it first with: image-create"
exit -3
fi
IMAGE="$CUSTOM_IMAGE"
else
# ORIGINAL AIME LOGIC: Look up in ml_images.repo
IMAGE=$(find_image $FRAMEWORK_NAME $FRAMEWORK_VERSION)
if [[ $IMAGE == "" ]]; then
# ... existing error handling ...
fi
fi
Patch Section 2: Apply DS01 Resource Limits
Lines 230-235 (modify docker create):
# Parse DS01 resource limits
RESOURCE_ARGS=""
if [ -n "$DS01_RESOURCE_LIMITS" ]; then
RESOURCE_ARGS="$DS01_RESOURCE_LIMITS"
log_info "[DS01] Applying resource limits: $RESOURCE_ARGS"
fi
# ORIGINAL AIME docker create + DS01 additions
OUT=$(docker create -it \
$VOLUMES \
-w $WORKSPACE \
--name=$CONTAINER_TAG \
# AIME labels (keep all)
--label=$CONTAINER_LABEL=$USER \
--label=$CONTAINER_LABEL.NAME=$CONTAINER_NAME \
--label=$CONTAINER_LABEL.USER=$USER \
--label=$CONTAINER_LABEL.VERSION=$MLC_VERSION \
--label=$CONTAINER_LABEL.WORK_MOUNT=$WORKSPACE_MOUNT \
--label=$CONTAINER_LABEL.DATA_MOUNT=$DATA_MOUNT \
--label=$CONTAINER_LABEL.FRAMEWORK=$FRAMEWORK_NAME-$FRAMEWORK_VERSION \
--label=$CONTAINER_LABEL.GPUS=$GPUS \
# DS01 labels (add these)
--label=$CONTAINER_LABEL.DS01_MANAGED=true \ # NEW
--label=$CONTAINER_LABEL.CUSTOM_IMAGE=$CUSTOM_IMAGE \ # NEW (if used)
# User & security (AIME)
--user $USER_ID:$GROUP_ID \
--tty --privileged --interactive \
--group-add video \
--group-add sudo \
# GPU & devices (AIME)
--gpus=$GPUS \
--device /dev/video0 \
--device /dev/snd \
# Networking & IPC (AIME)
--network=host \
--ipc=host \
-v /tmp/.X11-unix:/tmp/.X11-unix \
# Resource limits (AIME defaults)
--ulimit memlock=-1 \
--ulimit stack=67108864 \
# DS01 resource limits (NEW - applied at creation!)
$RESOURCE_ARGS \
# Image & command
$IMAGE:$CONTAINER_TAG \
bash -c "echo \"export PS1='[$CONTAINER_NAME] \\\`whoami\\\`@\\\`hostname\\\`:\\\${PWD#*}$ '\" >> ~/.bashrc; bash")
Documentation Header
Lines 1-30:
#!/bin/bash
#
# mlc-create-patched - DS01-enhanced AIME container creation
#
# BASED ON: aime-ml-containers/mlc-create v3
# MODIFIED BY: DS01 Infrastructure
# DATE: 2025-11-11
#
# ==================== DS01 MODIFICATIONS ====================
#
# This script is a MINIMALLY MODIFIED version of AIME's mlc-create.
# Original AIME code is preserved wherever possible.
#
# DS01 ADDITIONS:
# 1. Custom image support (--image flag)
# - Allows using user-built Docker images
# - Falls back to AIME catalog if not specified
#
# 2. Resource limits (--cpus, --memory, --shm-size, etc.)
# - Applied at container creation (not post-update)
# - Integrated with resource-limits.yaml
#
# 3. GPU allocation integration
# - Receives specific GPU ID from gpu_allocator.py
# - Supports MIG instances (e.g., --gpus=device=0:1)
#
# COMPATIBILITY:
# - 100% backward compatible with original mlc-create
# - All AIME commands (mlc-open, etc.) work unchanged
# - Uses same naming convention (name._.uid)
# - Uses same label system (aime.mlc.*)
#
# DEVIATIONS FROM AIME:
# Lines 108-128: Custom image support added
# Lines 176-195: Image resolution modified
# Lines 230-250: Resource limits added to docker create
#
# TOTAL CHANGES: ~65 lines added/modified out of 238
#
# UPSTREAM: To contribute back to AIME, these changes could be
# upstreamed as optional flags (--image, --cpus, etc.)
#
# ===========================================================
# Original AIME copyright notice
# AIME MLC - Machine Learning Container Management
# Copyright (c) AIME GmbH and affiliates
# MIT LICENSE
Total Patch Size:
- Header documentation: ~30 lines
- Custom image support: ~25 lines
- Resource limits: ~10 lines
- Total: ~65 lines added to 238-line script = 27% addition
Change 3: Integrate GPU Allocation
File: scripts/docker/mlc-create-wrapper.sh → REPLACE with simpler version
Current: 426 lines (too complex!)
New: ~150 lines (streamlined)
Responsibilities:
- Parse user input
- Get resource limits (via
get_resource_limits.py) - Allocate GPU (via
gpu_allocator.py) - Call
mlc-create-patchedwith all arguments - Handle errors gracefully
New Workflow:
#!/bin/bash
# mlc-create-wrapper.sh - Simplified DS01 wrapper for mlc-create-patched
# 1. Get resource limits
LIMITS=$(python3 get_resource_limits.py $USER --docker-args)
# → --cpus=16 --memory=32g --shm-size=16g --cgroup-parent=ds01-student.slice
# 2. Allocate GPU (if needed)
if [ "$CPU_ONLY" != true ]; then
GPU_ID=$(python3 gpu_allocator.py allocate $USER $CONTAINER_TAG 1 $PRIORITY)
GPU_ARG="--gpus=device=$GPU_ID"
else
GPU_ARG=""
fi
# 3. Call mlc-create-patched with everything
if [ -n "$CUSTOM_IMAGE" ]; then
# Using custom image
bash mlc-create-patched $CONTAINER_NAME pytorch \
--image=$CUSTOM_IMAGE \
$LIMITS \
$GPU_ARG \
-w=$WORKSPACE_DIR
else
# Using AIME catalog
bash mlc-create-patched $CONTAINER_NAME $FRAMEWORK $VERSION \
$LIMITS \
$GPU_ARG \
-w=$WORKSPACE_DIR
fi
# 4. Done! (no post-processing needed)
Lines Changed: Entire file rewritten (simpler!) Risk: Medium (but old wrapper didn't work for custom images anyway)
Change 4: Update container-create
File: scripts/user/container-create
Minimal changes needed:
Line 29 (update MLC wrapper path):
MLC_WRAPPER="$INFRA_ROOT/scripts/docker/mlc-create-wrapper.sh"
MLC_PATCHED="$INFRA_ROOT/scripts/docker/mlc-create-patched" # NEW: Can call directly
Lines 651-665 (update wrapper call):
# Determine if using custom image or framework
if [ -n "$IMAGE_NAME" ]; then
# Custom image exists - pass it to wrapper
WRAPPER_ARGS+=("--image=$IMAGE_NAME")
fi
# Call wrapper (wrapper will call mlc-create-patched)
if bash "$MLC_WRAPPER" "${WRAPPER_ARGS[@]}"; then
# Success!
else
# Error handling
fi
Lines Changed: ~10 lines Risk: Low (just passing image name)
Change 5: Standardize Labels
Files: Multiple
Strategy: Remove all ds01.* labels, use aime.mlc.* only
Changes:
-
image-create Dockerfile generation (lines 443-448):
# BEFORE:LABEL maintainer="$USERNAME"LABEL maintainer.id="$USER_ID"LABEL ds01.image="$image_name"# AFTER:LABEL aime.mlc.MAINTAINER="$USERNAME"LABEL aime.mlc.MAINTAINER_ID="$USER_ID"LABEL aime.mlc.CUSTOM_IMAGE="$image_name" -
container-list (line 109):
# BEFORE:docker ps -a --filter "label=maintainer=$USERNAME"# AFTER:docker ps -a --filter "label=aime.mlc.USER=$USERNAME" -
All monitoring scripts: Replace
ds01.*label checks withaime.mlc.*
Lines Changed: ~30 lines across 6 files Risk: Low (just label renaming)
4. Implementation Roadmap
Phase 1: Foundation (1-2 hours)
✓ Task 1.1: Create mlc-create-patched
- Copy
aime-ml-containers/mlc-createtoscripts/docker/mlc-create-patched - Add header documentation
- Test:
bash mlc-create-patched test pytorch 2.5.1(should work like original)
✓ Task 1.2: Add custom image support to mlc-create-patched
- Add
--imageflag parsing - Modify image resolution logic
- Test:
bash mlc-create-patched test pytorch --image=pytorch/pytorch:2.5.1
✓ Task 1.3: Add resource limits to mlc-create-patched
- Add
--cpus,--memory, etc. flag parsing - Pass to
docker create - Test:
bash mlc-create-patched test pytorch --cpus=8 --memory=16g
Deliverable: Working mlc-create-patched that accepts custom images + resource limits
Phase 2: Integration (2-3 hours)
✓ Task 2.1: Update image-create to use AIME catalog
- Modify
get_base_image()function - Add AIME catalog lookup
- Test:
image-create test-img(should use AIME base)
✓ Task 2.2: Simplify mlc-create-wrapper
- Rewrite to call
mlc-create-patcheddirectly - Integrate GPU allocator
- Test:
bash mlc-create-wrapper test-container pytorch
✓ Task 2.3: Update container-create
- Pass custom image to wrapper
- Test full workflow:
container-create test
Deliverable: End-to-end workflow works (image-create → container-create)
Phase 3: Polish (1-2 hours)
✓ Task 3.1: Standardize labels
- Update all scripts to use
aime.mlc.* - Remove
ds01.*labels - Test:
docker inspectshows correct labels
✓ Task 3.2: Update documentation
- README.md
- Command help text
✓ Task 3.3: Test matrix
- AIME catalog workflow (pytorch, tensorflow)
- Custom image workflow
- Resource limits applied
- GPU allocation works
- All DS01 commands work
Deliverable: Production-ready integrated system
Phase 4: Validation (1 hour)
✓ Task 4.1: User acceptance testing
- Run through
user-setupwizard - Create project with custom image
- Verify resource limits
- Check GPU allocation
✓ Task 4.2: Backward compatibility
- Test that old containers still work
- Verify
mlc-openworks on old + new containers - Check monitoring scripts work
Deliverable: Validated, production-ready system
5. Testing Matrix
Test Case 1: AIME Framework (Catalog Only)
# Should work like original AIME
mlc-create-patched pytorch-test Pytorch 2.5.1
# Verify:
docker inspect pytorch-test._.$(id -u) | grep -i "aime.mlc"
docker inspect pytorch-test._.$(id -u) | grep -i "aimehub/pytorch"
Expected: Container created from AIME base image
Test Case 2: Custom Image (DS01 Workflow)
# Create custom image
image-create my-cv-project -f pytorch -t cv
# Verify base image
docker history my-cv-project-john-image | grep aimehub
# Create container from custom image
container-create my-cv-project
# Verify:
docker inspect my-cv-project._.$(id -u) | grep "aime.mlc.CUSTOM_IMAGE"
Expected: Container created from custom image, still uses AIME labels
Test Case 3: Resource Limits
# Create container
container-create test-limits
# Verify:
docker inspect test-limits._.$(id -u) --format '{{.HostConfig.NanoCpus}}'
# Should show: 16000000000 (16 CPUs)
docker inspect test-limits._.$(id -u) --format '{{.HostConfig.Memory}}'
# Should show: 34359738368 (32 GB)
docker inspect test-limits._.$(id -u) --format '{{.HostConfig.ShmSize}}'
# Should show: 17179869184 (16 GB) - set at creation!
docker inspect test-limits._.$(id -u) --format '{{.HostConfig.CgroupParent}}'
# Should show: ds01-student.slice
Expected: All limits applied correctly at creation time
Test Case 4: GPU Allocation
# Create container
container-create gpu-test
# Check allocation state
python3 scripts/docker/gpu_allocator.py status
# Should show: gpu-test._.1001 → GPU 0:1
# Verify in container
docker exec gpu-test._.$(id -u) nvidia-smi
# Should show only allocated GPU
# Release
docker stop gpu-test._.$(id -u)
python3 scripts/docker/gpu_allocator.py release gpu-test._.$(id -u)
Expected: GPU allocated, tracked, and released properly
Test Case 5: mlc-open (Compatibility)
# Create via new system
container-create compat-test
# Open via AIME command (unchanged)
mlc-open compat-test
# Verify:
# - Container starts
# - Shell opens
# - User is correct UID/GID
# - Workspace mounted
Expected: AIME's mlc-open works with DS01-created containers
Test Case 6: Full Workflow (End-to-End)
# User onboarding workflow
user-setup
# Follow prompts:
# - Create project: my-thesis
# - Framework: PyTorch
# - Use case: Computer Vision
# - Additional: wandb
# Verify:
# 1. Image created from AIME base
docker images | grep my-thesis-john-image
# 2. Container created with limits
docker inspect my-thesis._.$(id -u)
# 3. GPU allocated
python3 scripts/docker/gpu_allocator.py user-status john
# 4. Can enter container
container-run my-thesis
# 5. Packages installed
pip list | grep -E "torch|timm|wandb"
Expected: Entire workflow works seamlessly
6. Risk Assessment
High Risk (must test thoroughly)
-
mlc-create-patched stability
- Risk: Breaks existing AIME workflows
- Mitigation: Keep original logic intact, add only optional features
- Test: Both AIME catalog and custom image paths
-
Resource limits at creation
- Risk:
shm-sizenot applied correctly - Mitigation: Pass ALL limits to docker create, not docker update
- Test: Verify shm-size in containers
- Risk:
-
GPU allocator integration
- Risk: GPU not allocated or tracked
- Mitigation: Extensive logging, state file validation
- Test: Full allocation/release cycle
Medium Risk
-
Label migration
- Risk: Scripts break when looking for old labels
- Mitigation: Comprehensive grep for all label references
- Test: All container-list, container-stats, etc.
-
Image catalog lookup
- Risk: Framework name mismatch (pytorch vs Pytorch)
- Mitigation: Case-insensitive lookup, clear error messages
- Test: Various framework name formats
Low Risk
-
Documentation updates
- Risk: Docs out of sync
- Mitigation: Update docs incrementally with code changes
-
Backward compatibility
- Risk: Old containers don't work
- Mitigation: AIME labels preserved, same naming
- Test: Create old-style container, verify mlc-open works
7. Rollback Plan
If integration fails:
-
Restore scripts
git checkout main -- scripts/docker/mlc-create-wrapper.shgit checkout main -- scripts/user/image-creategit checkout main -- scripts/user/container-create -
Remove mlc-create-patched
rm scripts/docker/mlc-create-patched -
Restore symlinks
# If any symlinks were changedsudo rm /usr/local/bin/mlc-createsudo ln -s /opt/aime-ml-containers/mlc-create /usr/local/bin/ -
Existing containers continue to work
- They use docker directly, not dependent on scripts
9. Success Criteria
✓ Integration successful if:
-
AIME base images used
- All new containers use
aimehub/*images - 76 framework versions available
- All new containers use
-
Custom images work
- Users can build on AIME base
- Dockerfile workflow preserved
-
Resource limits enforced
- All limits from YAML applied
- shm-size set at creation (not after)
-
GPU allocation integrated
gpu_allocator.pycalled before container creation- State tracking works
-
Backward compatible
mlc-openworks on all containers- Old and new containers coexist
-
Labels standardized
- All use
aime.mlc.*namespace - No
ds01.*labels remain
- All use
-
Documentation updated
- README reflects new workflow
-
All tests pass
- Test matrix completed
- User acceptance successful
QUESTIONS FROM HENRY
- does mlc-create-patched pass to mlc.py?
- how does the custom image vs AIME images workflow differ
- can i use image inheritance to make it cleaner? so the parent AIME images are handled according to AIME workflow, then ds01 takes over and inherits the image to add custom packages, resource allocation? or is that already happening?
- Build image chains: AIME -> ds01