Skip to main content

Creating Projects

How to set up new data science projects on DS01.


Quick Start

# Interactive wizard
project init

# Or specify name directly
project init my-new-project

# With framework type hint
project init my-thesis --type=rl

What is a Project?

A project in DS01 includes:

On disk:

  • Workspace directory (~/workspace/my-project/)
  • Git repository
  • requirements.txt (used to build the Dockerfile)
  • Dockerfile (defines environment)
  • README, .gitignore, directory structure

In Docker:

  • Custom image (ds01-username/my-project:latest)
  • Built from your Dockerfile
  • Contains your chosen packages and tools

The workflow:

  • Create project once (project init)
  • Launch containers from it anytime (project launch)

project init # also --guided (Guided Mode available)

The GUI asks:

1. Project Name

What would you like to name your project?
Example: my-thesis, research-2024, cv-experiments

Rules:

  • Lowercase letters, numbers, hyphens
  • No spaces or special characters
  • Short and memorable

2. Project Type

Select project type:
1) Machine Learning (general)
2) Computer Vision (CV)
3) Natural Language Processing (NLP)
4) Reinforcement Learning (RL)
5) Time Series Analysis
6) Large Language Models (LLM)
7) Custom

What this does: Pre-selects common packages for that domain.

Example: CV includes OpenCV, Pillow, torchvision

3. Framework

Select framework:
1) PyTorch 2.8.0 (recommended for research)
2) TensorFlow 2.16.1
3) JAX 0.4.23

Not sure? Choose PyTorch.

4. Additional Packages

Common data science packages: [y/n]
pandas, numpy, matplotlib, scipy

Jupyter Lab: [y/n]

Project-specific packages: [package names or Enter to skip]

Tip: You can add more packages later with image-update.

5. Build Image Now?

Build Docker image now? [Y/n]:

Initial build of cache takes 5-10 minutes but subsequent builds faster.


Direct Mode (Faster)

If you know what you want:

# Create with defaults
project init my-thesis

# Specify type
project init my-thesis --type=cv

# Skip interactive questions
project init my-thesis --type=ml --quick

Defaults:

  • Framework: PyTorch
  • Packages: pandas, numpy, matplotlib, scikit-learn, jupyter
  • Builds image automatically

What Gets Created

After project init my-thesis completes:

On DS01 Host

~/workspace/my-thesis/
├── Dockerfile Environment definition
├── requirements.txt Python packages (optional)
├── pyproject.toml Project metadata
├── README.md Project documentation
├── .git/ Git repository
├── .gitignore Ignore data, models, etc.
├── .gitattributes Git LFS for large files
├── data/ Datasets
├── notebooks/ Jupyter notebooks
├── src/ Source code
├── tests/ Unit tests
└── models/ Saved models

In Docker

Docker image: ds01-username/my-thesis:latest

Built from Dockerfile, includes:

  • Base image with CUDA
  • Framework (PyTorch/TensorFlow/JAX)
  • Your selected packages
  • Jupyter Lab (if selected)

Editing the Dockerfile

The Dockerfile lives in your project:

vim ~/workspace/my-thesis/Dockerfile

Example Dockerfile:

FROM aime/pytorch:2.8.0-cuda12.4

# Install additional packages
RUN pip install --no-cache-dir \
transformers>=4.30.0 \
datasets>=2.12.0 \
accelerate>=0.20.0

# Install system packages
RUN apt-get update && apt-get install -y \
ffmpeg \
&& rm -rf /var/lib/apt/lists/*

# Configure Jupyter (if needed)
RUN pip install jupyterlab ipywidgets

WORKDIR /workspace

After editing:

# Rebuild image after manual Dockerfile edit
image-update my-thesis --rebuild

# Recreate containers
project launch my-thesis
# Or
container-deploy my-thesis

Tip: For simpler package management, use image-update (no arguments) to open the interactive GUI.

Custom environments guide for details


Creating Multiple Projects

You can have many projects:

# Different research areas
project init thesis-cv --type=cv
project init thesis-nlp --type=nlp

# Different experiments
project init baseline-model
project init improved-model

# Teaching vs research
project init cs231n-homework
project init research-2024

Each project has:

  • Own workspace directory
  • Own Docker image
  • Own git repository
  • Own environment/packages

Switching between projects:

# Work on project A
project launch thesis-cv --open
# ... work ...
exit # keep running in background, or retire

# Switch to project B
project launch thesis-nlp --open
# ... work ...
exit # keep running in background, or retire

Project Metadata (pyproject.toml)

DS01 uses pyproject.toml for project metadata:

[project]
name = "my-thesis"
description = "Computer vision thesis project"
version = "0.1.0"
authors = [{name = "Your Name", email = "you@example.com"}]

[tool.ds01]
type = "cv"
created = "2024-12-09"
author = "h_baker"
image = "ds01-12345/my-thesis:latest"
framework = "pytorch"

Automatically created during project init.

Why this matters:

  • Tracks project metadata
  • Stores image name for project launch
  • Version-controlled with your code
  • Python ecosystem standard

Git Integration

Projects automatically initialise git:

cd ~/workspace/my-thesis

# Check status
git status

# Add changes
git add .

# Commit
git commit -m "Initial project setup"

Pre-configured .gitignore:

# Ignore large files
data/
models/
*.pt
*.pth
*.h5
*.ckpt

# Ignore outputs
outputs/
__pycache__/
*.pyc

Git LFS for large files:

# Automatically configured for:
*.pt
*.pth
*.h5
*.safetensors
*.ckpt
*.bin

Add remote (optional):

cd ~/workspace/my-thesis
git remote add origin git@github.com:username/my-thesis.git
git push -u origin main

→ Requires SSH keys from user-setup


Building Images Later

Skipped image build during init?

# Build it now
image-create my-thesis

# Or build when first launching
project launch my-thesis
# Detects missing image and offers to build

Project Templates

Different project types include different packages:

NB: this may change in future!!

  • Machine Learning (General): scikit-learn, xgboost, lightgbm, pandas, numpy, matplotlib, seaborn, plotly

  • Computer Vision: torchvision, Pillow, OpenCV, albumentations (augmentation), timm (model zoo)

  • NLP: transformers, datasets, tokenizers, sentencepiece, spacy

  • Reinforcement Learning: gym, stable-baselines3, tensorboard

  • Time Series: statsmodels, prophet, pmdarima, tslearn

  • LLM: transformers, accelerate, bitsandbytes, peft, flash-attention

  • Custom: Only base packages; add your own in Dockerfile


Troubleshooting

"Project already exists"

Error: Directory ~/workspace/my-thesis already exists

Fix: Choose a different name or remove old project:

# Back up if needed
mv ~/workspace/my-thesis ~/workspace/my-thesis-old

# Then create new
project init my-thesis

"Image build failed"

Common causes:

  • Network timeout
  • Package version conflict
  • Invalid Dockerfile syntax

Debug:

# Check Dockerfile
cat ~/workspace/my-thesis/Dockerfile

# Try building manually
cd ~/workspace/my-thesis
docker build -t ds01-$(id -u)/my-thesis:latest .

"Cannot push - no remote"

# Add GitHub remote first
cd ~/workspace/my-thesis
git remote add origin git@github.com:username/my-thesis.git
git push -u origin main

Requires SSH keys configured during user-setup.


Best Practices

1. Meaningful Names

# Good
project init thesis-chapter3
project init baseline-resnet50
project init ablation-study

# Less helpful
project init project1
project init test
project init asdf

2. One Project Per Research Question

# Separate experiments
project init baseline-model
project init improved-model
project init final-model

Each has own environment, easier to track.

3. Version Control from Day 1

# Commit early and often
git add .
git commit -m "Initial setup"

git add train.py
git commit -m "Add training script"

git push

4. Document in README

# Edit README immediately
vim ~/workspace/my-thesis/README.md

Suggested to include:

  • What this project does
  • How to reproduce experiments
  • Dataset locations
  • Key results

5. Keep Dockerfiles Simple

# Good - clear, minimal
RUN pip install transformers datasets

# Avoid - overly complex
RUN pip install transformers && \
wget https://... && \
tar -xzf ... && \
cd ... && \
python setup.py install && \
...

Next Steps

Launch your project:

project launch my-thesis --open