Skip to content

Instantly share code, notes, and snippets.

@slopp
Created December 23, 2024 20:17
Show Gist options
  • Save slopp/f85515eabdb99f7430cfbad0abc5908f to your computer and use it in GitHub Desktop.
Save slopp/f85515eabdb99f7430cfbad0abc5908f to your computer and use it in GitHub Desktop.
Torch Experiment

Torch Experiment

Goal

  • First attempt at using Torch for some type of "deep" learning
  • Take advantage of modal to access serverless Python compute, including GPUs

Approach

I find the Torch for R documentation excellent in terms of explaining cocnepts. So my goal was to follow their getting started guide, but using Python instead of R.

Because I don't have ready access to a GPU, and I wanted my work to be fast despite my slow internet, I decided to use Modal for execution.

Code

The structure of the code is a Modal app, which consists of:

  • specifying an image (where python deps are specified)
  • specifying an entrypoint which will call various modal functions:
    • functions we want to evaluate remotely
  • a modal volume, which is a persistent disk

The intent of the code is to:

  • download a bunch of images of x/y scatterplots from Kaggle along with some metadata about the true correlation shown in each image
  • cache the downloaded images in a modal volume
  • create a simple CNN following the R tutorial
  • train the CNN on both a CPU and GPU

Results

  • running on the CPU: took ~300s
  • running on the GPU: took ~40s
  • the loss function was always jumping everywhere, which isn't suprising since the model structure is garbage ... but at least it runs!

Example Output

(.venv) .venvlopp@Seans-MacBook-Pro torch_experiments % modal run r_get_started_to_py.py
✓ Initialized. View run at https://modal.com/apps/slopp/main/ap-ORU5nv0uFhIhCDqs4AHtRW ✓ Created objects. ├── 🔨 Created mount /Users/lopp/Projects/torch_experiments/r_get_started_to_py.py ├── 🔨 Created mount PythonPackage:_remote_module_non_scriptable ├── 🔨 Created function get_dataset. ├── 🔨 Created function unzip_train. ├── 🔨 Created function unzip. └── 🔨 Created function train_my_cnn. Getting started Using device: cuda Test model output with no training: tensor([[-0.3558, 0.1729, -0.2098, ..., -0.2338, 0.0097, 0.2568], [-0.3577, 0.2050, -0.2150, ..., -0.2148, -0.0254, 0.1179], [-0.3023, 0.2556, -0.1702, ..., -0.2177, 0.0264, 0.1928], ..., [-0.3042, 0.1326, -0.1455, ..., -0.2130, 0.0060, 0.1976], [-0.3430, 0.2401, -0.1778, ..., -0.1850, -0.0759, 0.1300], [-0.3457, 0.1536, -0.1847, ..., -0.1420, -0.0363, 0.2074]], device='cuda:0', grad_fn=) True values: tensor([-0.4578, -0.5231, -0.1790, 0.2515, 0.3540, 0.8361, -0.3141, -0.1900, -0.0079, 0.5127, -0.6961, -0.6385, 0.3890, 0.7433, 0.3033, 0.5869, 0.4751, 0.6581, -0.2687, 0.1978, 0.4256, 0.2940, -0.3028, -0.4231, 0.0289, 0.1835, 0.6971, 0.1370, 0.6549, 0.2446, -0.2694, -0.4043, -0.6028, 0.5621, -0.3291, 0.6700, -0.2683, -0.0634, 0.0409, 0.2124, -0.2689, 0.4382, 0.1959, -0.5694, 0.4116, -0.5908, 0.2020, -0.5625, -0.1433, -0.5540, 0.3926, -0.2108, 0.8185, -0.5357, 0.2938, 0.4767, 0.3512, -0.1851, -0.5930, 0.0535, 0.4559, 0.4891, 0.3569, 0.1135], device='cuda:0') MSE with no training: 0.24444007873535156 BATCH: 1 Current loss is: 0.24444007873535156 BATCH: 2 Current loss is: 21744.98828125 BATCH: 3 Current loss is: 1392.033447265625 BATCH: 4 Current loss is: 4206.7177734375 BATCH: 5 Current loss is: 11732.6806640625 BATCH: 6 Current loss is: 9478.251953125 BATCH: 7 Current loss is: 3385.68115234375 BATCH: 8 Current loss is: 88.81295776367188 BATCH: 9 Current loss is: 1193.997802734375 BATCH: 10 Current loss is: 4003.126220703125 BATCH: 11 Current loss is: 5196.607421875 BATCH: 12 Current loss is: 4032.54931640625 BATCH: 13 Current loss is: 1814.0806884765625 BATCH: 14 Current loss is: 245.66452026367188 BATCH: 15 Current loss is: 123.73826599121094 BATCH: 16 Current loss is: 1057.9072265625 BATCH: 17 Current loss is: 2032.654541015625 BATCH: 18 Current loss is: 2302.048828125 BATCH: 19 Current loss is: 1725.1231689453125 BATCH: 20 Current loss is: 806.3592529296875 BATCH: 21 Current loss is: 139.1751251220703 BATCH: 22 Current loss is: 26.688827514648438 BATCH: 23 Current loss is: 370.3856201171875 BATCH: 24 Current loss is: 812.04833984375 BATCH: 25 Current loss is: 1035.07373046875 BATCH: 26 Current loss is: 882.0335693359375 BATCH: 27 Current loss is: 512.5167236328125 BATCH: 28 Current loss is: 161.7069091796875 BATCH: 29 Current loss is: 3.03468918800354 Took 20.191120147705078 seconds BATCH: 30 Current loss is: 79.19509887695312 MSE at start: 0.24444007873535156 MSE at end: 283.2290344238281

import modal as md
import time
import pathlib
import requests
import zipfile
from PIL import Image
import os
from torch.utils.data import Dataset, DataLoader
from torchvision import transforms
import pandas as pd
import torch.nn as nn
import torch.optim as optim
from torch import relu, Tensor, tensor, mean, cuda
import torch
# Setup modal stuff
app = md.App("torch-get-started-py")
my_image = md.Image.debian_slim(python_version="3.10").pip_install(
"requests", "torch==2.5.1", "torchvision==0.20.1", "pandas"
)
# pro-tip: modal volume ls my-volume extracted/trainings to interact locally
volume = md.Volume.from_name("my-volume", create_if_missing=True)
p = pathlib.Path("/root/data/")
# functions to download and unzip from Posit's CDN copy of the kaggle dataset
@app.function(volumes={"/root/data": volume}, image=my_image)
def get_dataset():
"""Download the raw zip to the volume"""
dataset = "https://torch-cdn.mlverse.org/datasets/guess-the-correlation.zip"
try:
response = requests.get(dataset, stream=True)
response.raise_for_status()
with open(p.joinpath(pathlib.Path("data.zip")), "wb") as file:
for chunk in response.iter_content(chunk_size=8192):
file.write(chunk)
print(f"ZIP file downloaded successfully and saved to data.zip")
volume.commit()
except requests.exceptions.RequestException as e:
print(f"An error occurred: {e}")
@app.function(volumes={"/root/data": volume}, image=my_image)
def unzip():
"""Unzip the top-level dataset which creates metadata files and a zip image dir"""
with zipfile.ZipFile(p.joinpath(pathlib.Path("data.zip"))) as zip_ref:
zip_ref.extractall(p.joinpath(pathlib.Path("extracted")))
volume.commit()
@app.function(volumes={"/root/data": volume}, image=my_image)
def unzip_train():
"""Unzip the image dir into individual training images"""
with zipfile.ZipFile(
p.joinpath(
pathlib.Path("extracted"),
pathlib.Path("train_imgs.zip"),
)
) as zip_ref:
zip_ref.extractall(p.joinpath(pathlib.Path("extracted/training")))
volume.commit()
class GuessCorrelationDataset(Dataset):
def __init__(self, p: pathlib.Path):
self.p = p # the directory to the dataset extract, eg /root/data/extracted
self.img_path = self.p.joinpath(pathlib.Path("training/train_imgs"))
self.files = os.listdir(self.img_path)
self.train_metadata = pd.read_csv(self.p.joinpath(pathlib.Path("train.csv")))
def __len__(self):
return 64 * 30 # do n batches of 64 images
def __getitem__(self, idx):
"""
Load an individual image as a tensor, plus the correlation label into a dict
idx: numeric id, eg get the 1st value from the list of image paths
Returns a dict with:
id: the actual image id
corr: the true correlation depicted in the image as a tensor
tensor: the tensor representation of the grayscale image
"""
img = Image.open(self.img_path.joinpath(pathlib.Path(self.files[idx]))).convert(
"RGB"
)
transform = transforms.ToTensor()
rgb_img = transform(img)
to_grayscale = transforms.Grayscale(num_output_channels=1)
img_grayscale = to_grayscale(rgb_img)
id = self.files[idx]
corr = float(self.train_metadata.iloc[idx]["corr"])
corr = tensor(corr)
return {"x": img_grayscale, "id": id, "corr": corr}
class myCNN(nn.Module):
def __init__(self):
super(myCNN, self).__init__()
# 1 convolutional layer, input channels = 1 (grayscale), output channels = 32, kernel size = 3
self.conv1 = nn.Conv2d(in_channels=1, out_channels=32, kernel_size=3)
# a fully connected layer for the output
self.fc = nn.Linear(
32 * 148 * 148, 64
) # The input image size is 150 x 150, but padding brings to 148 x 148, second arg (64) is batch size
def forward(self, x):
x = self.conv1(x) # Apply convolution
x = relu(x) # Apply ReLU activation
# print(f"The shape after the convulution layer is: {x.shape}")
x = x.view(
64,
-1, # -1 here means "view figure out what size this needs to be"
) # flatten it to the shape [batch_size, linear input of 148x148x32]
# print(f"The shape after the reshaping is: {x.shape}")
x = self.fc(x) # Fully connected layer
return x
@app.function(
volumes={"/root/data": volume}, image=my_image, gpu="L40S"
) # comment out the gpu bit here to remove the gpu
def train_my_cnn():
device = torch.device("cuda" if cuda.is_available() else "cpu")
print(f"Using device: {device}")
dataset = GuessCorrelationDataset(p=pathlib.Path("/root/data/extracted"))
dataloader = DataLoader(dataset=dataset, batch_size=64)
model = myCNN().to(device=device)
# Just some testing / printing going on here
for batch in dataloader:
first_batch = batch
break
first_tensors: Tensor = first_batch["x"].to(device)
true_corrs: Tensor = first_batch["corr"].to(device)
model_output_no_training = model(first_tensors)
print(f"Test model output with no training: {model_output_no_training}")
print(f"True values: {true_corrs}")
mse = mean((model_output_no_training - true_corrs) ** 2)
print(f"MSE with no training: {mse}")
# Begin Training
criterion = nn.MSELoss() # use mse sincse this is a regression problem basically
optimizer = optim.Adam(model.parameters(), lr=0.001)
model.train() # Set the model to training mode
b = 0
for batch in dataloader:
b += 1
print(f"BATCH: {b}")
inputs = batch["x"].to(device)
true_values = batch["corr"].to(device)
# Zero the gradients
optimizer.zero_grad()
# Forward pass
outputs = model(inputs)
# Calculate the loss
loss = criterion(outputs, true_values)
# Backward pass and optimize
loss.backward()
optimizer.step()
# Print the statistics after each batch
print(f"Current loss is: {loss.item()}")
print(f"MSE at start: {mse}")
model_output_with_training = model(first_tensors)
mse = mean((model_output_with_training - true_corrs) ** 2)
print(f"MSE at end: {mse}")
@app.local_entrypoint()
def main():
print("Getting started")
start_time = time.time()
# get_dataset.remote()
# unzip.remote()
# unzip_train.remote()
train_my_cnn.remote()
end_time = time.time()
print(f"Took {end_time - start_time} seconds")
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment