Skip to content

Instantly share code, notes, and snippets.

View 3outeille's full-sized avatar
🎯
https://www.youtube.com/watch?v=VYPi0qcHWvQ&ab_channel=ABANIMETION

Ferdinand Mom 3outeille

🎯
https://www.youtube.com/watch?v=VYPi0qcHWvQ&ab_channel=ABANIMETION
View GitHub Profile
@3outeille
3outeille / pipeline_parallel.py
Last active November 15, 2024 19:37
Self contained example of how pipeline parallel works (AFAB and 1F1B) in 200 LOC
#VERBOSE=0 torchrun --nproc_per_node 3 self_contained_pp_LOC.py
import os, random, numpy as np, torch, torch.nn as nn, torch.distributed as dist, torch.nn.functional as F
from torch.optim import AdamW
from torch.utils.data import DataLoader, DistributedSampler
from datasets import load_dataset
from transformers import AutoConfig, AutoModelForCausalLM, AutoTokenizer
STEP, local_rank, world_size, verbose = 0, int(os.environ["LOCAL_RANK"]), int(os.environ["WORLD_SIZE"]), os.environ.get("VERBOSE", "0") == "1"
def set_all_seed(seed):
@3outeille
3outeille / pipeline-model-parallel-visualization.ipynb
Created June 14, 2024 19:58 — forked from sighingnow/pipeline-model-parallel-visualization.ipynb
Visualizing various different pipeline model parallel scheduling algorithms: GPipe, Pipedream(1F1B), Pipedream-2BW(async, no-flushes), and eager-1F1B
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
@3outeille
3outeille / pipeline-model-parallel-visualization.ipynb
Created June 14, 2024 19:58 — forked from sighingnow/pipeline-model-parallel-visualization.ipynb
Visualizing various different pipeline model parallel scheduling algorithms: GPipe, Pipedream(1F1B), Pipedream-2BW(async, no-flushes), and eager-1F1B
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
@3outeille
3outeille / test_hf.py
Last active December 11, 2023 11:48
RuntimeError: a leaf Variable that requires grad is being used in an in-place operation.
import torch
from torch.nn import functional as F
from torch import distributed as dist
import os
import numpy as np
import random
def set_random_seed(seed: int):
torch.manual_seed(seed)
if torch.cuda.is_available():
from copy import deepcopy
import torch
from datasets import load_dataset
from torch.optim import SGD
from torch.utils.data import DataLoader
from transformers import AutoModelForCausalLM, AutoTokenizer
import random
import os
import numpy as np
@3outeille
3outeille / full_cpu.py
Last active May 11, 2023 09:37
rwvk perplexity measure
import torch
from tqdm import tqdm
from transformers import AutoModelForCausalLM, AutoTokenizer
from datasets import load_dataset
# Model
device = "cpu"
device_map = {
@3outeille
3outeille / README.md
Last active April 14, 2023 13:55
Triton Matmul Group-ordering vs Row-major ordering
@3outeille
3outeille / README.md
Last active March 28, 2023 19:22
race condition fuck my life
  • Problem: We have blocks that are scheduled later than others which imply that we won't get the "true max value" at the time we need it.
  • Direction: We should find a way to wait for all threads of all blocks to finish
  • Solution:
      1. Split into 2 kernels
    1. Use cooperative groups: https://numba.readthedocs.io/en/stable/cuda/cooperative_groups.html
@3outeille
3outeille / CMakeLists.txt
Last active November 21, 2022 00:25
CUDA experiment bank conflict shared memory (with a CMakeLists)
# To run
# mkdir build && cd build
# cmake ..
# make -j && ./bank conflict <offset> <is_debug>
cmake_minimum_required(VERSION 3.0)
set(CMAKE_CXX_FLAGS "-O3 -std=c++14")
set(CUDA_NVCC_FLAGS -arch=compute_52 -code=sm_75)
@3outeille
3outeille / c-cpp-oops.md
Created October 2, 2022 06:37 — forked from ayan-b/c-cpp-oops.md
C, C++ & OOPS for Interviews