GPU Issues
Solutions for GPU allocation, availability, and CUDA problems.
No GPUs Available
Symptoms:
$ container-deploy my-project
Error: No GPUs available for allocation
Causes:
- All GPUs currently allocated
- System maintenance
- GPU hardware issues
Solutions:
-
Check availability:
dashboard -
Wait for GPU to free up - Other users may be finishing soon
-
Retire your idle containers:
container-listcontainer-retire old-project -
Contact DSL admin if GPUs show available but allocation fails
CUDA Out of Memory
Symptoms:
RuntimeError: CUDA out of memory. Tried to allocate X.XX GiB
Causes:
- Model too large for GPU
- Batch size too large
- Memory leak in code
Solutions:
-
Reduce batch size:
batch_size = 32 # Try 16, 8, or smaller -
Use gradient accumulation:
# Effective batch size = batch_size * accumulation_stepsaccumulation_steps = 4 -
Use mixed precision:
from torch.cuda.amp import autocast, GradScalerscaler = GradScaler()with autocast():output = model(input) -
Clear cache:
import torchtorch.cuda.empty_cache() -
Check for memory leaks:
print(torch.cuda.memory_summary())
GPU Not Detected by Framework
Symptoms:
>>> torch.cuda.is_available()
False
Solutions:
-
Check nvidia-smi first:
nvidia-smi -
Check CUDA version compatibility:
nvcc --versionpython -c "import torch; print(torch.version.cuda)" -
Reinstall PyTorch with correct CUDA:
pip install torch --index-url https://download.pytorch.org/whl/cu124 -
For TensorFlow:
import tensorflow as tfprint(tf.config.list_physical_devices('GPU'))
GPU Shows No Devices
Symptoms:
$ nvidia-smi
No devices found
Causes:
- Not allocated GPU
- Container created without GPU
Solutions:
-
Check allocation:
python3 /opt/ds01-infra/scripts/docker/gpu_allocator.py status -
Recreate container:
container-retire my-projectcontainer-deploy my-project
GPU Memory Not Releasing
Symptoms:
- nvidia-smi shows memory used even after training ends
- "CUDA out of memory" after successful run
Solutions:
-
Delete tensors and clear cache:
del model, optimizertorch.cuda.empty_cache() -
Exit Python and restart:
exit() # Pythonpython # Restart -
Restart container:
exitcontainer-stop my-projectcontainer-start my-project
Multi-GPU Not Working
Symptoms:
- Only one GPU visible
- DataParallel/DistributedDataParallel fails
Solutions:
-
Check allocation:
nvidia-smi # Should show all allocated GPUs -
Verify all GPUs visible to PyTorch:
import torchprint(torch.cuda.device_count())for i in range(torch.cuda.device_count()):print(torch.cuda.get_device_name(i)) -
Request more GPUs (via container-create):
container-retire my-projectcontainer-create my-project --num-gpus 2container-run my-project