Troubleshooting Guide
Common issues and solutions for DS01. Check here before asking for help.
Note: In examples below, replace
<project-name>with your actual project name. The$(whoami)part auto-substitutes your username.
Container Issues
"No GPUs available"
Symptoms:
$ container-deploy my-project
Error: No GPUs available for allocation
Causes:
- All GPUs currently allocated
- System maintenance
- GPU hardware issues
Solutions:
-
Check availability:
dashboard# orpython3 /opt/ds01-infra/scripts/docker/gpu_allocator.py status -
Wait for GPU to free up - Other users may be finishing soon
-
Check for idle containers:
# Retire your idle containerscontainer-listcontainer-retire old-project -
Contact admin if GPUs show available but allocation fails
"Container won't start"
Symptoms:
$ container-start my-project
Error: Container failed to start
Causes:
- GPU no longer exists (was reallocated)
- Resource limits exceeded
- Container configuration issue
Solutions:
-
Check container status:
docker ps -a | grep my-projectdocker logs <project-name>._.$(whoami) -
Recreate container:
container-remove my-projectcontainer-create my-project -
Check resource limits:
check-limits
"Container stopped unexpectedly"
Symptoms:
- Container was running, now shows as stopped
- Processes terminated
Causes:
- Idle timeout reached (30min-2h, varies by user)
- Out of memory (OOM killer)
- Max runtime exceeded
- Code crashed
Solutions:
-
Check logs:
docker logs <project-name>._.$(whoami) | tail -100 -
Check for OOM:
docker inspect <project-name>._.$(whoami) | grep OOMKilled -
Prevent idle timeout:
touch ~/workspace/<project-name>/.keep-aliveContact DSL First: The
.keep-aliveworkaround is available but should be a last resort as it can disrupt the system for other users. Please open an issue on DS01 Hub first to find a better solution together. -
Restart container:
container-start my-project# or recreatecontainer-retire my-projectcontainer-deploy my-project
"Can't enter container"
Symptoms:
$ container-run my-project
Error: Container not found
Causes:
- Container was removed
- Wrong project name
- Container never created
Solutions:
-
List containers:
container-list --alldocker ps -a --filter "name=._.$(whoami)" -
Recreate if needed:
container-deploy my-project -
Check project name spelling
File & Storage Issues
"Can't find my files"
Symptoms:
- Files missing from container
- Workspace appears empty
Causes:
- Looking in wrong location
- Files saved outside workspace
- Permissions issue
Solutions:
-
Check both locations:
# On hostls ~/workspace/<project-name>/# In containerdocker exec <project-name>._.$(whoami) ls /workspace/ -
Verify workspace mount:
docker inspect <project-name>._.$(whoami) | grep -A 5 "Mounts" -
Remember the mapping:
- Host:
~/workspace/<project-name>/ - Container:
/workspace/
- Host:
"Permission denied" on files
Symptoms:
$ touch /workspace/file.txt
Permission denied
Causes:
- Directory ownership issue
- Incorrect mount
- Filesystem full
Solutions:
-
Check ownership:
ls -ld ~/workspace/<project-name>/# Should be owned by you -
Fix permissions (on host):
sudo chown -R $(whoami):$(whoami) ~/workspace/<project-name>/ -
Check disk space:
df -h | grep home
"Out of disk space"
Symptoms:
$ cp large-file.dat /workspace/
No space left on device
Causes:
- Workspace full
- Docker images consuming space
- Quota exceeded
Solutions:
-
Check usage:
# Workspacedu -sh ~/workspace/*# Dockerdocker system df# Quota (if enforced)quota -s -
Clean up:
# Remove old projectsrm -rf ~/workspace/old-project/# Clean Dockerdocker image prunedocker system prune# Remove old checkpointsfind ~/workspace -name "checkpoint-*.pt" -mtime +30 -delete -
Contact admin if quota too small
Image Issues
"Image build fails"
Symptoms:
$ image-create my-project
Error: Failed to build image
Common causes and solutions:
1. Network issues (downloading base image)
# Retry build
image-create my-project
# Or use cached base
docker images | grep aime-pytorch
2. Package installation fails
# Use interactive GUI to fix packages
image-update # Select image, fix package names/versions
# Or check Dockerfile manually
cat ~/dockerfiles/my-project.Dockerfile
vim ~/dockerfiles/my-project.Dockerfile
image-update my-project --rebuild
3. Disk space
df -h
docker system df
docker system prune # Free space
4. Invalid Dockerfile syntax
# Validate Dockerfile
docker build --no-cache -f ~/dockerfiles/my-project.Dockerfile . 2>&1 | less
"Package not found in container"
Symptoms:
$ python -c "import transformers"
ModuleNotFoundError: No module named 'transformers'
Causes:
- Package not in image
- Package name typo
Note: DS01 containers ARE your Python environment - you don't need venv or conda. See Python Environments.
Solutions:
-
Check if installed:
pip list | grep transformers -
Temporary install:
pip install transformers -
Permanent fix (add to image):
exit # Exit containerimage-update # Select image, add packagecontainer-retire <project-name>container-deploy <project-name>
GPU Issues
"nvidia-smi: command not found"
Symptoms:
$ nvidia-smi
bash: nvidia-smi: command not found
Cause: Not inside container, or container not started with GPU
Solutions:
-
Enter container:
container-run my-projectnvidia-smi # Should work now -
Check GPU allocation:
docker inspect <project-name>._.$(whoami) | grep -i gpu -
Recreate container:
container-retire my-projectcontainer-deploy my-project
"CUDA out of memory"
Symptoms:
RuntimeError: CUDA out of memory. Tried to allocate X.XX GiB
Causes:
- Model too large for GPU
- Batch size too large
- Memory leak in code
Solutions:
-
Reduce batch size:
batch_size = 32 # Try 16, 8, or smaller -
Use gradient accumulation:
# Effective batch size = batch_size * accumulation_stepsaccumulation_steps = 4 -
Use mixed precision:
from torch.cuda.amp import autocast, GradScalerscaler = GradScaler()with autocast():output = model(input) -
Clear cache:
import torchtorch.cuda.empty_cache() -
Check for memory leaks:
# Monitor memory usageprint(torch.cuda.memory_summary())
"GPU not showing in nvidia-smi"
Symptoms:
$ nvidia-smi
No devices found
Causes:
- Not allocated GPU
- Container created without GPU
Solutions:
-
Check allocation:
python3 /opt/ds01-infra/scripts/docker/gpu_allocator.py status -
Recreate container:
container-retire my-projectcontainer-deploy my-project
Network Issues
"Can't access Jupyter"
Symptoms:
- Jupyter running but can't access in browser
Solutions:
-
Check Jupyter is running:
docker exec <project-name>._.$(whoami) ps aux | grep jupyter -
Check port:
docker port <project-name>._.$(whoami) -
Set up SSH tunnel:
# On your laptopssh -L 8888:localhost:8888 ds01# Without SSH keys: ssh -L 8888:localhost:8888 <your-username>@hertie-school.lan@10.1.23.20# Then access: http://localhost:8888 -
Start Jupyter correctly:
jupyter lab --ip=0.0.0.0 --port=8888 --no-browser
"Can't download datasets"
Symptoms:
$ wget https://example.com/data.zip
Connection refused
Solutions:
-
Check network from container:
ping google.comcurl https://google.com -
Check proxy settings (if your network requires proxy)
-
Try alternative download method:
# Instead of wgetcurl -O https://example.com/data.zip# Or Pythonpython -c "import urllib.request; urllib.request.urlretrieve('url', 'file')"
Permission Issues
"Permission denied" for Docker
Symptoms:
$ docker ps
Permission denied while trying to connect to the Docker daemon socket
Cause: Not in docker group
Solution:
# Check groups
groups | grep docker
# If not in docker group, ask admin:
# sudo usermod -aG docker your-username
# Then log out and back in
"Permission denied" for commands
Symptoms:
$ container-deploy my-project
bash: container-deploy: Permission denied
Causes:
- Commands not in PATH
- Commands not executable
Solutions:
-
Check PATH:
echo $PATH | grep ds01 -
Use full path:
/opt/ds01-infra/scripts/user/container-deploy.sh my-project -
Ask admin to update symlinks:
sudo /opt/ds01-infra/scripts/system/deploy-commands.sh
Resource Limit Issues
"At maximum container limit"
Symptoms:
$ container-deploy new-project
Error: Maximum containers reached (3/3)
Solution:
# Check current containers
container-list
# Retire unused containers
container-retire old-project-1
container-retire old-project-2
# Now can deploy new one
container-deploy new-project
"Memory limit exceeded"
Symptoms:
- Container killed
- OOMKilled in logs
Solutions:
-
Check limits:
check-limits# Shows your memory limit and current usage -
Reduce memory usage:
- Process data in chunks
- Use data generators
- Clear variables when done
-
Request limit increase (contact admin)
Git Issues
"Can't push to GitHub"
Symptoms:
$ git push
Permission denied (publickey)
Solutions:
-
Check SSH key:
ls ~/.ssh/cat ~/.ssh/id_ed25519.pub -
Add key to GitHub:
- Copy public key
- GitHub → Settings → SSH Keys → Add
-
Test connection:
ssh -T git@github.com -
Use HTTPS instead:
git remote set-url origin https://github.com/user/repo.git
Getting Help
Before Asking for Help
-
Check this guide - Most issues are common
-
Check logs:
docker logs <project-name>._.$(whoami) -
Check system status:
dashboardcontainer-listcontainer-stats -
Try recreating:
container-retire my-projectcontainer-deploy my-project
How to Ask for Help
Include this information:
-
What you tried:
container-deploy my-project -
Error message:
Error: No GPUs available -
System state:
container-listcheck-limits -
Relevant logs:
docker logs <project-name>._.$(whoami) | tail -50
Contact Points
- System administrator - Account issues, quotas, system problems
- Documentation - This guide, Command Reference
- Colleagues - Often have encountered same issues
Preventive Measures
Best Practices to Avoid Issues
-
Save frequently to workspace:
# Always work in /workspace (inside container)cd /workspace -
Commit code regularly:
git commit -m "Progress checkpoint"git push -
Retire containers when done:
container-retire my-project -
Monitor resource usage:
container-statsnvidia-smi # Inside container -
Keep environments updated:
image-update # Add packages as needed -
Check limits before starting:
check-limitscontainer-list # How many running?
Emergency Recovery
"I lost all my work!"
Don't panic. Check:
-
Workspace (most likely safe):
ls ~/workspace/<project-name>/ -
Git (if you pushed):
cd ~/workspace/my-projectgit loggit pull -
Previous checkpoints:
ls ~/workspace/<project-name>/models/
If truly lost:
- Learn from mistake
- Implement better backup strategy
- Use Git religiously going forward
"Container won't stop"
Symptoms:
container-stophangs- Container stuck in stopping state
Solutions:
-
Force stop:
docker stop -t 1 <project-name>._.$(whoami) -
Force kill:
docker kill <project-name>._.$(whoami) -
Remove forcefully:
container-remove my-project --force
"System seems broken"
Symptoms:
- Multiple commands failing
- Unusual errors
Steps:
-
Check system status:
dashboard -
Check your account:
groupsquota -sdf -h -
Try minimal operation:
docker ps -
Contact administrator with details
Common Error Messages Decoded
| Error | Meaning | Solution |
|---|---|---|
No GPUs available | All GPUs allocated | Wait or retire old containers |
OOMKilled | Out of memory | Reduce memory usage |
Permission denied | Not in docker group or file permissions | Check groups, fix permissions |
Container not found | Container removed or wrong name | Recreate or check name |
Image not found | Image doesn't exist | Build image first |
Network unreachable | Network issue | Check network, retry |
Quota exceeded | Hit disk quota | Clean up old files |
Next Steps
Understand the system:
Learn best practices: →
Command reference:
Most issues have simple solutions. Check logs, try recreating, and remember: your workspace is always safe!