GPU Issues

Solutions for GPU allocation, availability, and CUDA problems.

No GPUs Available

Symptoms:

$ container-deploy my-project
Error: No GPUs available for allocation

Causes:

All GPUs currently allocated
System maintenance
GPU hardware issues

Solutions:

Check availability:
```
dashboard
```
Wait for GPU to free up - Other users may be finishing soon

Retire your idle containers:

container-list
container-retire old-project

Contact DSL admin if GPUs show available but allocation fails

CUDA Out of Memory

Symptoms:

RuntimeError: CUDA out of memory. Tried to allocate X.XX GiB

Causes:

Model too large for GPU
Batch size too large
Memory leak in code

Solutions:

Reduce batch size:

batch_size = 32  # Try 16, 8, or smaller

Use gradient accumulation:

# Effective batch size = batch_size * accumulation_steps
accumulation_steps = 4

Use mixed precision:

from torch.cuda.amp import autocast, GradScaler
scaler = GradScaler()

with autocast():
    output = model(input)

Clear cache:
```
import torch
torch.cuda.empty_cache()
```
Check for memory leaks:
```
print(torch.cuda.memory_summary())
```

GPU Not Detected by Framework

Symptoms:

>>> torch.cuda.is_available()
False

Solutions:

Check nvidia-smi first:
```
nvidia-smi
```

Check CUDA version compatibility:

nvcc --version
python -c "import torch; print(torch.version.cuda)"

Reinstall PyTorch with correct CUDA:

pip install torch --index-url https://download.pytorch.org/whl/cu124

For TensorFlow:

import tensorflow as tf
print(tf.config.list_physical_devices('GPU'))

GPU Shows No Devices

Symptoms:

$ nvidia-smi
No devices found

Causes:

Not allocated GPU
Container created without GPU

Solutions:

Check allocation:

python3 /opt/ds01-infra/scripts/docker/gpu_allocator.py status

Recreate container:

container-retire my-project
container-deploy my-project

GPU Memory Not Releasing

Symptoms:

nvidia-smi shows memory used even after training ends
"CUDA out of memory" after successful run

Solutions:

Delete tensors and clear cache:

del model, optimizer
torch.cuda.empty_cache()

Exit Python and restart:
```
exit()  # Python
python  # Restart
```

Restart container:

exit
container-stop my-project
container-start my-project

Multi-GPU Not Working

Symptoms:

Only one GPU visible
DataParallel/DistributedDataParallel fails

Solutions:

Check allocation:

nvidia-smi  # Should show all allocated GPUs

Verify all GPUs visible to PyTorch:

import torch
print(torch.cuda.device_count())
for i in range(torch.cuda.device_count()):
    print(torch.cuda.get_device_name(i))

Request more GPUs (via container-create):

container-retire my-project
container-create my-project --num-gpus 2
container-run my-project

No GPUs Available​

CUDA Out of Memory​

GPU Not Detected by Framework​

GPU Shows No Devices​

GPU Memory Not Releasing​

Multi-GPU Not Working​

See Also​

No GPUs Available

CUDA Out of Memory

GPU Not Detected by Framework

GPU Shows No Devices

GPU Memory Not Releasing

Multi-GPU Not Working

See Also