Industry Parallels

How DS01 prepares you for AWS, Kubernetes, and production ML.

Part of Educational Computing Context - Career-relevant knowledge beyond DS01 basics.

DS01 is designed around industry-standard practices used in production ML systems, cloud platforms, and enterprise software. This guide shows how your DS01 skills transfer to professional work.

The Big Picture

DS01 skills transfer directly to:

Cloud platforms (AWS, Google Cloud, Azure)
Container orchestration (Kubernetes, Docker Swarm)
ML platforms (SageMaker, Vertex AI, Databricks)
Production ML systems (MLOps, model serving)
Software engineering (microservices, CI/CD)

Core Industry Concepts in DS01

1. Containerisation (Docker)

Industry usage:

Development: Dev containers, consistent environments
CI/CD: Build and test in containers
Production: Deploy applications in containers
ML: Training jobs, model serving APIs

DS01 parallel:

# DS01
container-deploy my-project

# AWS Elastic Container Service
aws ecs run-task --task-definition my-task

# Kubernetes
kubectl run my-pod --image=my-image

Skills you're learning:

Docker images and containers
Dockerfiles (reproducible environments)
Container lifecycle management
Volume mounts (persistent storage)
Resource limits (CPU, memory, GPU)

Where you'll use this:

Data Scientist: Deploy models as containerised APIs
ML Engineer: Build training pipelines in containers
Software Engineer: Microservices architecture
DevOps: Container orchestration

2. Ephemeral Compute

Industry principle: Compute is temporary and cheap, storage is permanent

DS01:

container-deploy my-project  # Spin up compute
# Do work
container-retire my-project  # Terminate compute
# Workspace persists

AWS EC2:

aws ec2 run-instances        # Launch instance
# Do work
aws ec2 terminate-instances  # Stop paying
# EBS volumes persist

Kubernetes:

kind: Job                    # Ephemeral pod
spec:
  restartPolicy: Never       # Run once, terminate
  volumes:
    - persistentVolumeClaim  # Data persists

3. Infrastructure as Code

Industry principle: Define infrastructure in version-controlled files

DS01 Dockerfile:

FROM henrycgbaker/aime-pytorch:2.8.0-cuda12.4-ubuntu22.04
WORKDIR /workspace
RUN pip install transformers datasets accelerate
EXPOSE 8888

This is exactly what you'd use in:

AWS ECS task definitions
Kubernetes pod specs
Google Cloud Run
Azure Container Instances

Benefits:

Reproducible: Same Dockerfile = same environment
Version controlled: Track changes in Git
Collaborative: Share with team
Automated: CI/CD can rebuild automatically

4. Resource Management & Fair Scheduling

DS01 quotas:

# config/resource-limits.yaml
defaults:
  max_gpus: 2
  max_cpus: 16
  memory: 64GB
  priority: 50

Kubernetes equivalent:

# ResourceQuota
apiVersion: v1
kind: ResourceQuota
metadata:
  name: user-quota
spec:
  hard:
    requests.nvidia.com/gpu: "2"
    requests.cpu: "16"
    requests.memory: "64Gi"

AWS equivalent:

Service quotas (max instances, max vCPUs)
Budget alerts
Cost allocation tags

5. Separation of Concerns

DS01 architecture:

├── Compute (Container)     ← Ephemeral, scalable
├── Storage (Workspace)     ← Persistent, backed up
├── Images (Environment)    ← Versioned, reproducible
└── Orchestration (DS01)    ← Scheduling, limits

Cloud architecture:

├── Compute (EC2, Lambda)   ← Ephemeral, scalable
├── Storage (S3, EBS)       ← Persistent, backed up
├── Images (ECR, Artifact Registry) ← Versioned
└── Orchestration (ECS, K8s) ← Scheduling, auto-scaling

Benefits:

Scale independently: Add storage without changing compute
Cost optimise: Different tiers for compute vs storage
Resilience: Compute fails? Recreate. Data always safe.

Real-World Workflow Parallels

ML Model Training

DS01 workflow:

# 1. Build environment
image-create

# 2. Launch training container
container-deploy training --background

# 3. Monitor progress
container-stats
docker logs training._.$(whoami)

# 4. Training complete, free resources
container-retire training

# 5. Model saved to workspace
ls ~/workspace/training/models/

AWS SageMaker workflow:

# 1. Define environment
estimator = PyTorch(
    image_uri="pytorch-training-image",
    ...
)

# 2. Launch training job
estimator.fit(inputs)
# Job runs on ephemeral instances

# 3. Monitor
estimator.logs()

# 4. Training complete, instances terminated
# (automatic)

# 5. Model saved to S3
model_artifacts = estimator.model_data

Same conceptual workflow!

Model Serving (API Deployment)

DS01 approach (simplified):

# Build image with model serving code
docker build -t model-api .

# Run API container
container-deploy model-api --background

# Inside container
python api.py  # Flask/FastAPI app

Production (Kubernetes):

apiVersion: apps/v1
kind: Deployment
metadata:
  name: model-api
spec:
  replicas: 3  # Auto-scaling
  template:
    spec:
      containers:
      - name: api
        image: model-api:latest
        resources:
          limits:
            nvidia.com/gpu: 1

CI/CD Pipeline

DS01 testing:

# Build test environment
docker build -f test.Dockerfile .

# Run tests in container
docker run test-image pytest

GitHub Actions (Production):

name: Test
on: [push]
jobs:
  test:
    runs-on: ubuntu-latest
    container:
      image: pytorch:2.8.0
    steps:
      - uses: actions/checkout@v2
      - run: pip install -r requirements.txt
      - run: pytest

Same principle: Isolated test environment, reproducible results

Industry Tools & Platforms

Cloud Platforms

Amazon Web Services (AWS):

ECS/EKS: Container orchestration (like DS01 + Kubernetes)
SageMaker: ML platform (training, serving, monitoring)
EC2 + GPUs: Elastic compute (on-demand, spot instances)
S3: Object storage (like persistent workspace)

Google Cloud Platform (GCP):

GKE: Kubernetes engine
Vertex AI: ML platform
Compute Engine + GPUs: VM instances
Cloud Storage: Object storage

Microsoft Azure:

AKS: Azure Kubernetes Service
Azure ML: ML platform
Azure VMs + GPUs: Compute
Blob Storage: Object storage

DS01 skills apply to all of them.

Container Orchestration

Kubernetes:

Most popular container orchestrator
Runs in production at Google, Amazon, Microsoft, etc.
Concepts directly map to DS01:
- Pods = Containers
- PersistentVolumes = Workspace
- ResourceQuotas = DS01 limits
- Jobs = Ephemeral compute

Docker Swarm:

Simpler orchestrator
Good for smaller deployments
Docker compose for multi-container apps

Learning DS01 = Foundation for Kubernetes

ML Platforms

AWS SageMaker:

Managed ML platform
Notebook instances (like Jupyter containers)
Training jobs (ephemeral compute)
Model registry (like Docker images)
Endpoints (model serving)

Google Vertex AI:

Similar to SageMaker
Training jobs in containers
Model deployment
Feature store

Databricks:

Data + ML platform
Notebooks in containers
Spark clusters (ephemeral)
MLflow integration

Azure ML:

Microsoft's ML platform
Compute instances/clusters
Training jobs
Model deployment

All use containerisation and ephemeral compute.

MLOps Tools

Experiment Tracking:

Weights & Biases: Log metrics from containers
MLflow: Track experiments, models, parameters
TensorBoard: Visualize training

Model Serving:

TorchServe: PyTorch model serving
TensorFlow Serving: TensorFlow models
Seldon Core: Multi-framework serving on K8s
BentoML: Package models as APIs

Workflow Orchestration:

Airflow: Data pipeline orchestration
Kubeflow: ML workflows on Kubernetes
Prefect: Modern workflow engine

All deploy workloads in containers.

Best Practices from Industry

1. Immutable Infrastructure

Principle: Never modify running systems. Replace with new version.

DS01:

# Don't: Modify running container
container-run my-project
pip install new-package  # Bad: Not reproducible

# Do: Update image, recreate container
image-update                  # Add package via interactive GUI
container-retire my-project
container-deploy my-project

Production equivalent:

Blue/green deployments: Run two identical environments ("blue" = current, "green" = new). Deploy changes to green, test it, then switch traffic. If something breaks, instantly switch back to blue. You never modify blue while it's serving users.
Rolling updates: Instead of updating all servers at once (risky), update them one-by-one. If server #3 fails after update, stop the rollout - servers #4-10 still run the old version. Users experience zero downtime.
No SSH into production: Never log into a running server to "fix" something. That fix isn't reproducible, isn't tracked, and will vanish when the server restarts. Instead, fix it in code, build a new image, and deploy that image.

2. Everything in Version Control

DS01:

~/workspace/my-project/
├── .git/               # Code in Git
├── Dockerfile          # Environment in Git
├── requirements.txt    # Dependencies in Git
├── src/                # Source code in Git
└── README.md           # Documentation in Git

Production equivalent:

Application code in Git: Every line of code is tracked. You can see who changed what, when, and why. You can revert bad changes. Multiple people can work on the same codebase without overwriting each other.
Infrastructure as Code (Terraform, CloudFormation): Your servers, databases, and networks are defined in code files, not clicked together in a web console. Need 10 identical servers? Change count = 10 and run terraform apply. The entire infrastructure is reproducible and auditable.
CI/CD configs: Your build and deployment process is also in Git. A .github/workflows/deploy.yml file defines exactly how code goes from commit to production. No manual steps, no tribal knowledge, no "ask Dave how to deploy".
Documentation: READMEs, architecture diagrams, and runbooks live alongside the code. When the code changes, the docs change in the same commit. Documentation that lives in a wiki gets stale; documentation in the repo stays current.

3. Observability

DS01:

# Logs
docker logs container-name

# Metrics
container-stats

# GPU monitoring
nvidia-smi

Production:

Logging (ELK stack, CloudWatch): Every application writes structured logs (JSON, not plain text). These flow to a central system where you can search across thousands of containers: "show me all errors from the payment service in the last hour". ELK = Elasticsearch (storage) + Logstash (processing) + Kibana (visualisation).
Metrics (Prometheus, Datadog): Numeric measurements collected every few seconds: request latency, error rates, CPU usage, queue depth. Displayed on dashboards so you can see trends. "Response time increased 50% after yesterday's deploy" - you'd never spot this in logs.
Tracing (Jaeger, X-Ray): Follow a single request as it travels through multiple services. User clicks "buy" → API gateway → auth service → payment service → inventory service → notification service. Tracing shows you exactly where the 2-second delay is happening.
Alerting (PagerDuty, Opsgenie): Automated systems that wake you up at 3am when something breaks. "Error rate > 5% for 5 minutes" → page the on-call engineer. Connects to your phone, escalates if you don't respond, tracks incident resolution.

4. Least Privilege

DS01:

User namespace isolation (not root on host)
Resource limits (can't monopolise)
GPU pinning (can't access others' GPUs)

Production:

IAM roles (minimal permissions): Your ML training job needs to read from S3 and write to a model registry - nothing else. It can't delete databases, can't access other teams' data, can't spin up Bitcoin miners. Every service gets exactly the permissions it needs and no more.
Network policies (limit connectivity): Your frontend can talk to the API. The API can talk to the database. But the frontend cannot talk directly to the database. Even if an attacker compromises one service, they can't reach everything else.
Pod security policies: Containers can't run as root, can't mount the host filesystem, can't use privileged mode. A compromised container is contained - it can't escape to the host machine or other containers.
Secrets management: Database passwords, API keys, and certificates are never in code or environment variables. They're stored in a vault (HashiCorp Vault, AWS Secrets Manager) and injected at runtime. Secrets are rotated automatically. If someone leaks a config file, they don't get your credentials.

5. Cost Optimisation

DS01 equivalent:

# Free resources when done
container-retire my-project

# Use appropriate resources
# (Don't request max if you need less)

Production:

Spot instances (70% savings): Cloud providers have spare capacity they sell at huge discounts (60-90% off). The catch: they can terminate your instance with 2 minutes notice. Perfect for training jobs - if interrupted, just restart from a checkpoint. Most ML teams use spot for training, on-demand only for serving.
Auto-scaling (pay for what you use): At 3am, your API gets 10 requests/minute - you need 1 server. At 3pm, it gets 10,000 requests/minute - you need 50 servers. Auto-scaling adds/removes servers based on load. You pay for 1 server at night, 50 during peaks, not 50 all the time.
Reserved instances (commitment discounts): If you know you'll need a GPU server for a year, pay upfront for 30-50% off. Like buying a gym membership vs paying per visit. Good for baseline capacity (the minimum you always need), bad for spiky workloads.
S3 lifecycle policies (move to cheaper tiers): Data you access daily stays in "standard" storage ($0.023/GB/month). Data untouched for 30 days moves to "infrequent access" ($0.0125/GB). After 90 days, it moves to "glacier" ($0.004/GB). Old training runs don't need instant access - automatic tiering cuts storage costs 80%+.

Bridging to Production

From DS01 to AWS

Training job:

# DS01
container-deploy training --background

# AWS SageMaker
import sagemaker
estimator = sagemaker.estimator.Estimator(...)
estimator.fit()

Model serving:

# DS01 (simple API)
container-deploy api
# Inside: Flask app

# AWS
# Deploy to SageMaker Endpoint or ECS

Skills transfer:

Dockerfile → SageMaker image
Workspace → S3 bucket
container-deploy → boto3 API calls

From DS01 to Kubernetes

Single container:

# DS01
container-deploy my-app

# Kubernetes
kubectl run my-app --image=my-image

With resources:

# DS01 (configured in YAML)
# max_gpus: 1, memory: 64GB

# Kubernetes
apiVersion: v1
kind: Pod
spec:
  containers:
  - name: my-app
    image: my-image
    resources:
      limits:
        nvidia.com/gpu: 1
        memory: "64Gi"

Concepts map directly.

Learning Path to Production

Phase 1: DS01 (You are here)

Learn containerisation
Practice resource management
Build reproducible environments
Understand ephemeral compute

Phase 2: Cloud Fundamentals

AWS/GCP free tier accounts
Deploy simple container to ECS/Cloud Run
Use S3/GCS for storage
Try SageMaker/Vertex AI free tier

Phase 3: Kubernetes

Local K8s (minikube, kind)
Deploy pods, services
Learn resource management
Try managed K8s (EKS, GKE)

Phase 4: MLOps

Experiment tracking (W&B, MLflow)
Model serving (TorchServe, Seldon)
CI/CD (GitHub Actions)
Monitoring (Prometheus, Grafana)

Phase 5: Production Scale

Multi-region deployments
Auto-scaling
Cost optimisation
Security hardening

DS01 is your foundation for all of this.

Industry Jargon Decoder

Terms you'll hear in jobs:

Industry Term	DS01 Equivalent
Container orchestration	DS01 management layer
Pod (K8s)	Container
Persistent Volume	Workspace
Job (K8s)	Ephemeral container
Image registry (ECR, GCR)	Docker images
Resource quota	Resource limits YAML
Node (K8s)	DS01 server
Deployment	Container deploy/retire
Service mesh	(Advanced, not in DS01)
Ingress	(Network routing, not in DS01)

Next Steps

→ Ephemeral Containers - Understand the philosophy
→ Daily Usage Patterns - Put skills into practice

The Big Picture​

Core Industry Concepts in DS01​

1. Containerisation (Docker)​

2. Ephemeral Compute​

3. Infrastructure as Code​

4. Resource Management & Fair Scheduling​

5. Separation of Concerns​

Real-World Workflow Parallels​

ML Model Training​

Model Serving (API Deployment)​

CI/CD Pipeline​

Industry Tools & Platforms​

Cloud Platforms​

Container Orchestration​

ML Platforms​

MLOps Tools​

Best Practices from Industry​

1. Immutable Infrastructure​

2. Everything in Version Control​

3. Observability​

4. Least Privilege​

5. Cost Optimisation​

Bridging to Production​

From DS01 to AWS​

From DS01 to Kubernetes​

Learning Path to Production​

Phase 1: DS01 (You are here)​

Phase 2: Cloud Fundamentals​

Phase 3: Kubernetes​

Phase 4: MLOps​

Phase 5: Production Scale​

Industry Jargon Decoder​

Next Steps​

The Big Picture

Core Industry Concepts in DS01

1. Containerisation (Docker)

2. Ephemeral Compute

3. Infrastructure as Code

4. Resource Management & Fair Scheduling

5. Separation of Concerns

Real-World Workflow Parallels

ML Model Training

Model Serving (API Deployment)

CI/CD Pipeline

Industry Tools & Platforms

Cloud Platforms

Container Orchestration

ML Platforms

MLOps Tools

Best Practices from Industry

1. Immutable Infrastructure

2. Everything in Version Control

3. Observability

4. Least Privilege

5. Cost Optimisation

Bridging to Production

From DS01 to AWS

From DS01 to Kubernetes

Learning Path to Production

Phase 1: DS01 (You are here)

Phase 2: Cloud Fundamentals

Phase 3: Kubernetes

Phase 4: MLOps

Phase 5: Production Scale

Industry Jargon Decoder

Next Steps