Complete Dockerfile Guide

Advanced Docker image building for DS01. This guide covers everything from basic Dockerfiles to optimization strategies.

Basic Dockerfile Structure

FROM henrycgbaker/aime-pytorch:2.8.0-cuda12.4-ubuntu22.04

WORKDIR /workspace

# Install system packages
RUN apt-get update && apt-get install -y \
    package1 \
    package2 \
    && rm -rf /var/lib/apt/lists/*

# Install Python packages
RUN pip install --no-cache-dir \
    transformers \
    datasets \
    wandb

# Optional: Copy configuration
COPY config/ /workspace/config/

CMD ["/bin/bash"]

Layer Optimization

Combine Commands

Each RUN instruction creates a layer. Combining commands into single RUN statements reduces layer count and final image size.

# Bad: Multiple layers
RUN pip install numpy
RUN pip install pandas
RUN pip install scikit-learn

# Good: Single layer
RUN pip install --no-cache-dir \
    numpy \
    pandas \
    scikit-learn

Order by Change Frequency

Place rarely-changing commands early (they cache), frequently-changing commands late:

# Rarely changes (cache hit)
FROM henrycgbaker/aime-pytorch:2.8.0-cuda12.4-ubuntu22.04
RUN apt-get update && apt-get install -y system-deps \
    && rm -rf /var/lib/apt/lists/*

# Sometimes changes
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

# Frequently changes (put last)
COPY src/ /workspace/src/

Reducing Image Size

Clean Up in Same Layer

Package manager caches only disappear if cleanup is in the same RUN layer:

# Bad: Cleanup in separate layer (doesn't reduce size)
RUN apt-get update && apt-get install -y build-essential
RUN rm -rf /var/lib/apt/lists/*

# Good: Clean in same layer
RUN apt-get update && apt-get install -y build-essential \
    && rm -rf /var/lib/apt/lists/*

Use --no-cache-dir

Pip caches downloaded wheels. Disabling this saves space in the image (packages are already installed):

RUN pip install --no-cache-dir transformers

Exclude Unnecessary Files with .dockerignore

Create .dockerignore in the same directory as your Dockerfile:

# Large files that don't belong in image
data/
models/
*.csv
*.pt
*.zip

# Development artifacts
__pycache__/
.git/
.pytest_cache/
*.pyc

Multi-Stage Builds

For packages requiring compilation, use multi-stage builds to keep final image small:

# Build stage: Compile packages
FROM python:3.10 AS builder
RUN pip wheel --no-deps --wheel-dir /wheels some-compiled-package

# Final stage: Use pre-built wheels
FROM henrycgbaker/aime-pytorch:2.8.0-cuda12.4-ubuntu22.04
COPY --from=builder /wheels /wheels
RUN pip install --no-cache /wheels/*

Benefit: Final image only contains compiled wheels, not build tools.

Caching & Performance

Requirements File Strategy

# Copy requirements first (caches if unchanged)
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

# Then copy code (changes often, doesn't invalidate pip cache)
COPY src/ /workspace/src/

This way, if only your code changes, Docker reuses the pip-install layer from cache.

Pin Versions for Reproducibility

# requirements.txt
torch==2.0.1
transformers==4.30.0
datasets==2.14.0

Pinned versions ensure:

Same image every time you rebuild
Other team members get identical environment
Production deployments are reproducible

Common Installation Patterns

Installing PyPI Packages

RUN pip install --no-cache-dir \
    package1==1.0.0 \
    package2>=2.0.0 \
    package3

Installing from Git

RUN pip install --no-cache-dir \
    git+https://github.com/user/repo.git@v1.0.0

System Dependencies

RUN apt-get update && apt-get install -y \
    libffi-dev \
    libssl-dev \
    && rm -rf /var/lib/apt/lists/*

Environment Variables

# Prevent Python buffering (see logs immediately)
ENV PYTHONUNBUFFERED=1

# Disable parallel tokenization (can cause issues with large files)
ENV TOKENIZERS_PARALLELISM=false

# Specify GPU (optional, can override at runtime)
ENV CUDA_VISIBLE_DEVICES=0

Debugging Builds

Build with Progress Output

docker build --progress=plain -f my-project.Dockerfile .

Shows all build steps and output in real-time (useful for debugging).

Interactive Debugging

# Build up to failing layer
docker build -t debug-image .

# Enter container and debug interactively
docker run -it debug-image /bin/bash

# Try commands until you find fix
pip install fixed-package

Once you find the fix, add it to your Dockerfile.

Check Layer Sizes

docker history ds01-user/my-project:latest

Shows size of each layer, helps identify which layers are bloated.

DS01 Base Images

Available Bases

henrycgbaker/aime-pytorch:2.8.0-cuda12.4-ubuntu22.04
henrycgbaker/aime-tensorflow:2.16-cuda12.4-ubuntu22.04
henrycgbaker/aime-jax:0.4.23-cuda12.4-ubuntu22.04

What's Included

CUDA drivers & toolkit: GPU drivers pre-installed and version-matched. You don't need to figure out CUDA compatibility - it's already tested and working.
cuDNN: NVIDIA's deep learning library for GPU operations. Required by PyTorch/TensorFlow for GPU training. Version-matched to CUDA.
ML framework: PyTorch, TensorFlow, or JAX pre-compiled with GPU support. torch.cuda.is_available() returns True immediately.
Jupyter Lab: Pre-installed and configured. Run jupyter lab --ip=0.0.0.0 for notebook-based development with GPU access.
Common tools: Git, wget, curl, vim, htop, and utilities. Plus pip and conda ready to use.

Example: ML Project

Training models with transformers, datasets, wandb tracking:

FROM henrycgbaker/aime-pytorch:2.8.0-cuda12.4-ubuntu22.04

WORKDIR /workspace

# System dependencies
RUN apt-get update && apt-get install -y \
    git-lfs \
    && rm -rf /var/lib/apt/lists/*

# Python dependencies (cached layer)
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

# Environment
ENV PYTHONUNBUFFERED=1
ENV TOKENIZERS_PARALLELISM=false

Example: Data Science

Analysis and visualization with pandas, scikit-learn, plotting libraries:

FROM henrycgbaker/aime-pytorch:2.8.0-cuda12.4-ubuntu22.04

WORKDIR /workspace

# Visualization system dependencies
RUN apt-get update && apt-get install -y \
    libcairo2-dev \
    && rm -rf /var/lib/apt/lists/*

# Data science stack
RUN pip install --no-cache-dir \
    pandas \
    numpy \
    scikit-learn \
    matplotlib \
    seaborn \
    plotly \
    jupyter

Building & Testing

Build Your Image

# Basic build
docker build -t ds01-$(whoami)/my-project:latest -f ~/dockerfiles/my-project.Dockerfile .

# Rebuild from scratch (ignore cache)
docker build --no-cache -t ds01-$(whoami)/my-project:latest -f ~/dockerfiles/my-project.Dockerfile .

Test the Image

# Run container to test
docker run -it ds01-$(whoami)/my-project:latest

# Inside container, verify packages installed
python -c "import torch; print(torch.cuda.is_available())"
pip list | grep transformers

Basic Dockerfile Structure​

Layer Optimization​

Combine Commands​

Order by Change Frequency​

Reducing Image Size​

Clean Up in Same Layer​

Use --no-cache-dir​

Exclude Unnecessary Files with .dockerignore​

Multi-Stage Builds​

Caching & Performance​

Requirements File Strategy​

Pin Versions for Reproducibility​

Common Installation Patterns​

Installing PyPI Packages​

Installing from Git​

System Dependencies​

Environment Variables​

Debugging Builds​

Build with Progress Output​

Interactive Debugging​

Check Layer Sizes​

DS01 Base Images​

Available Bases​

What's Included​

Example: ML Project​

Example: Data Science​

Building & Testing​

Build Your Image​

Test the Image​

See Also​