Skip to main content

Long-Running Jobs

Running overnight training, preventing timeouts, and managing extended workloads.


Quick Start

# Deploy in background
container-deploy my-project --background

# Inside container, use nohup or tmux
nohup python train.py > training.log 2>&1 &

# Prevent idle timeout
touch /workspace/.keep-alive

Running Training Overnight

Method 1: nohup (Simple)

# Start training that continues after disconnect
nohup python train.py > training.log 2>&1 &

# Check progress
tail -f training.log

# Exit container safely
exit
# Create named session
tmux new -s training

# Run training
python train.py

# Detach: Ctrl+B, then D

# Later, reattach
tmux attach -t training

Method 3: Screen

# Create session
screen -S training

# Run training
python train.py

# Detach: Ctrl+A, then D

# Reattach
screen -r training

Preventing Idle Timeout

DS01 auto-stops containers that are idle (low GPU activity). Timeout varies by user (typically 30min-2h) and is dynamically adjusted. Run check-limits to see your current limits.

Contact DSL First

The workarounds below (.keep-alive, nohup, etc.) are available but should be last resorts as they can disrupt the system for other users by holding GPUs longer than necessary.

Please open an issue on DS01 Hub first to discuss your requirements with the Data Science Lab team. We can often find better solutions together (adjusted limits, scheduled runs, checkpointing strategies).

Option 1: .keep-alive File

touch /workspace/.keep-alive

This tells DS01 your job is intentionally long-running.

Option 2: Active Training

Active GPU/CPU usage doesn't count as idle. Training jobs naturally prevent timeout.


Pausing Jobs Temporarily

For short breaks, you can pause containers instead of stopping them.

# Pause container (freeze all processes)
container-pause my-project

# Resume container
container-unpause my-project

Docker Commands (L1)

Replace <project-name> with your actual project name.

# Pause container
docker pause <project-name>._.$(id -u)

# Resume container
docker unpause <project-name>._.$(id -u)

What happens when paused:

  • All processes freeze instantly (training pauses mid-batch)
  • GPU remains allocated but idle
  • Memory state preserved
  • Resume instantly where you left off

When to use pause:

  • Debugging (freeze state for inspection)
  • Testing before checkpoints

When NOT to use pause:

  • Long breaks (use container-stop instead)
  • Freeing GPU for others (pause keeps GPU allocated)

Monitoring Progress

From Outside Container

Replace <project-name> with your actual project name.

# Check container is running
container-list

# View logs
docker logs <project-name>._.$(whoami) --tail 100

# Follow logs
docker logs <project-name>._.$(whoami) -f

Inside Container

# Attach to running container
container-attach my-project

# Check GPU
nvidia-smi

# Check training logs
tail -f /workspace/training.log

Checkpointing

Save progress regularly so you can resume if interrupted:

PyTorch

# Save checkpoint
torch.save({
'epoch': epoch,
'model_state_dict': model.state_dict(),
'optimizer_state_dict': optimizer.state_dict(),
'loss': loss,
}, f'/workspace/checkpoints/checkpoint_{epoch}.pt')

# Load checkpoint
checkpoint = torch.load('/workspace/checkpoints/checkpoint_latest.pt')
model.load_state_dict(checkpoint['model_state_dict'])
optimizer.load_state_dict(checkpoint['optimizer_state_dict'])
start_epoch = checkpoint['epoch']

TensorFlow/Keras

# Save callback
checkpoint = tf.keras.callbacks.ModelCheckpoint(
'/workspace/checkpoints/model_{epoch:02d}.keras',
save_best_only=True
)

model.fit(x, y, callbacks=[checkpoint])

# Resume
model = tf.keras.models.load_model('/workspace/checkpoints/model_10.keras')

Resource Limits

Check your limits before long runs:

check-limits

Key limits:

  • Max Runtime: 24h-72h (varies by user)
  • Idle Timeout: 30min-2h (varies by user)
  • Memory: Per-container limit

Run check-limits to see your current values.


Best Practices

1. Checkpoint Early and Often

# Every N epochs
if epoch % checkpoint_interval == 0:
save_checkpoint(model, optimizer, epoch)

2. Log to File

import logging
logging.basicConfig(
filename='/workspace/training.log',
level=logging.INFO
)

3. Monitor GPU Memory

# In training loop
if epoch % 10 == 0:
print(f"GPU Memory: {torch.cuda.memory_allocated() / 1e9:.2f} GB")

4. Set Up Alerts (Optional)

# At end of training
import os
os.system('echo "Training complete" | mail -s "DS01 Alert" you@email.com')

Troubleshooting

Replace <project-name> with your actual project name in commands below.

Job Stopped Unexpectedly

  1. Check logs:

    docker logs <project-name>._.$(whoami) | tail -100
  2. Check for OOM:

    docker inspect <project-name>._.$(whoami) | grep OOMKilled
  3. Resume from checkpoint:

    container-deploy my-project --open
    python train.py --resume /workspace/checkpoints/latest.pt

Container Removed

Your workspace is safe. Recreate and resume:

container-deploy my-project --open
# Resume training from checkpoint

See Also