This setup allows you to run ML experiments on RunPod with automatic pod termination and persistent storage for artifacts.
- ✅ Automatic pod termination after experiment completion
- ✅ Persistent storage for artifacts and logs
- ✅ Both Docker and Dockerless (template) deployment options
- ✅ Comprehensive logging and metrics tracking
- ✅ Cost-efficient (pay only for actual compute time)
train_mnist.py
- Main training script with comprehensive loggingrun_experiment.sh
- Dockerless runner with auto-terminationdeploy_runpod.py
- Python script to deploy experimentsDockerfile
&entrypoint.sh
- For Docker-based deploymentrequirements.txt
- Python dependencies
Pros: Faster setup, no Docker knowledge needed, direct SSH access Cons: Less reproducible, manual dependency management
Pros: Fully reproducible, version-locked dependencies, CI/CD ready Cons: Requires Docker setup, slower initial deployment
- RunPod account with API key
- Python 3.8+ installed locally
- (Optional) Docker for building custom images
- Go to RunPod Settings
- Generate an API key
- Save it securely
pip install runpod
# Deploy using PyTorch template
python deploy_runpod.py \
--api-key YOUR_API_KEY \
--experiment-name "mnist_test_1" \
--mode template \
--gpu-type "NVIDIA GeForce RTX 3090" \
--volume-size 50 \
--epochs 10 \
--batch-size 64 \
--learning-rate 0.001
- Go to RunPod Console
- Click "Deploy" → "GPU Pod"
- Select PyTorch template
- Choose GPU (e.g., RTX 3090)
- Set Network Volume to 50GB (for persistent storage)
- Deploy the pod
- SSH into the pod and run:
# Upload files to pod (from local machine)
scp train_mnist.py run_experiment.sh root@POD_IP:/workspace/scripts/
# SSH into pod
ssh root@POD_IP
# Inside the pod, run experiment
cd /workspace
bash scripts/run_experiment.sh "my_experiment" 10 64 0.001
# Build image
docker build -t yourusername/mnist-runpod:latest .
# Push to Docker Hub
docker push yourusername/mnist-runpod:latest
# Deploy with Docker
python deploy_runpod.py \
--api-key YOUR_API_KEY \
--experiment-name "mnist_docker_1" \
--mode docker \
--docker-image yourusername/mnist-runpod:latest \
--gpu-type "NVIDIA GeForce RTX 3090" \
--volume-size 50
All experiments save the following to /workspace/artifacts/EXPERIMENT_NAME/
:
artifacts/
└── mnist_experiment_20240315_120000/
├── models/
│ ├── best_model.pth
│ ├── final_model.pth
│ └── checkpoint_epoch_5.pth
├── logs/
│ └── training_20240315_120000.log
├── metrics/
│ └── training_metrics.json
└── summary.txt
- Use Spot Instances: 50-80% cheaper than on-demand
- Auto-termination: Ensures you don't pay for idle time
- Right-size GPU: Use smaller GPUs for testing (e.g., RTX 3070)
- Persistent Volume: Keep data between runs without re-downloading
- Go to Pods section to see status
- Check logs in pod terminal
python deploy_runpod.py \
--api-key YOUR_API_KEY \
--experiment-name "test" \
--mode template \
--monitor # Enables monitoring
# SSH into running pod
ssh root@POD_IP
# Check logs
tail -f /workspace/artifacts/*/logs/*.log
# Check GPU usage
nvidia-smi
Edit train_mnist.py
to add more hyperparameters:
- Optimizer types
- Learning rate schedules
- Model architectures
- Data augmentation
Run parallel experiments with different configs:
# Experiment 1
python deploy_runpod.py --experiment-name "lr_0.01" --learning-rate 0.01 &
# Experiment 2
python deploy_runpod.py --experiment-name "lr_0.001" --learning-rate 0.001 &
# Experiment 3
python deploy_runpod.py --experiment-name "batch_128" --batch-size 128 &
Network volumes persist across pods. To reuse data:
- Create a network volume in RunPod
- Mount it to
/workspace/data
- Download datasets once, reuse across experiments
- Problem:
sign_and_send_pubkey: no mutual signature supported
- Cause: SSH client compatibility (especially Termux)
- Solution: Use Jupyter interface instead (
http://POD_IP:8888
, password:rp12345
) - Alternative: Try RSA keys instead of ed25519
- Details: See
runpod_ssh_troubleshooting.md
- Check if
RUNPOD_POD_ID
environment variable is set - Ensure
runpodctl
is available in the pod - Check logs for errors
- Verify persistent volume is mounted at
/workspace
- Check volume size isn't full
- Ensure write permissions
- Check GPU memory with
nvidia-smi
- Reduce batch size if OOM
- Check Python dependencies installed correctly
- Always use persistent volumes for artifacts
- Set reasonable timeouts to avoid runaway costs
- Test locally first with small epochs
- Version control your experiment configs
- Monitor GPU usage to optimize batch sizes
- Use structured logging for easy debugging
- Develop and test locally (1-2 epochs)
- Deploy to RunPod with small GPU (test run)
- Verify artifacts are saved correctly
- Scale up to larger GPU and full training
- Download results from persistent volume
- Pod auto-terminates, saving costs
- RunPod Documentation: https://docs.runpod.io
- RunPod Discord: https://discord.gg/runpod
- GitHub Issues: Create issue in your repo
MIT License - Feel free to adapt for your needs!