Skip to content

Instantly share code, notes, and snippets.

@f0ster
f0ster / deepseek-v3-tech-dive-ptx.ipynb
Last active February 18, 2025 17:09
💥 Smashing the Tariffs for Fun and Profit: How DeepSeek v3 Outsmarted the AI Ban 🧠🚀
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
@f0ster
f0ster / deepseek-v3-tech-dive.ipynb
Last active February 18, 2025 01:57
Smashing the Tariffs for Fun and Profit: How DeepSeek v3 Outsmarted the AI Ban
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
@f0ster
f0ster / deepseek-v3-tech-dive.md
Created February 17, 2025 20:03
Smashing the Tariffs for Fun and Profit: How DeepSeek v3 Outsmarted the AI Ban

Smashing the Tariffs for Fun and Profit: How DeepSeek v3 Outsmarted the AI Ban

1. CUDA and PTX Optimizations

DeepSeek-V3’s engineers optimized GPU performance at the low-level by tailoring kernels and memory access patterns to NVIDIA’s hardware. A key strategy was warp specialization: they partitioned a subset of GPU threads (warps) specifically for communication tasks, allowing compute to overlap with data transfers (DeepSeek-V3 Technical Report). In practice, only ~20 of the GPU’s Streaming Multiprocessors (SMs) were reserved to handle all cross-node communications – enough to saturate both InfiniBand (IB) and NVLink bandwidth – while the remaining SMs focused purely on computation (DeepSeek-V3 Technical Report) ([DeepSeek-V3 Technical Report](https://arx

@f0ster
f0ster / add_arrays.cu
Created January 7, 2025 18:48
Custom CUDA Kernel Example
#include <iostream>
#include <cuda.h>
// CUDA Kernel
__global__ void add_arrays(float *a, float *b, float *c, int n) {
int idx = blockIdx.x * blockDim.x + threadIdx.x;
if (idx < n) {
c[idx] = a[idx] + b[idx];
}
@f0ster
f0ster / .tmux.conf
Last active May 12, 2024 01:22
modern osx tmux conf
# ~/.tmux.conf
# General settings
set -g default-terminal "screen-256color" # Use 256-color terminal
set -g history-limit 5000 # Increase scrollback buffer size
set -g base-index 0 # Start window indexes at 0
set -g mouse on # Enable mouse control (pane selection, resizing, scrolling)
# Restore original prefix key
set-option -g prefix C-b # Set prefix to Ctrl-b
@f0ster
f0ster / inspect_git_repos.py
Last active May 11, 2024 18:46
Summarize git repositories
# script to provide a summary of the repositories by only listing each one's name
# along with its status (public, public with changes, or private)
import os
import subprocess
import json
from concurrent.futures import ThreadPoolExecutor, as_completed
def execute_command(command, cwd):
"""Executes a shell command in a specified directory and returns the output."""
@f0ster
f0ster / mixtral_demo.py
Created April 28, 2024 15:30
Running mistralai mixtral locally
import time
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
def load_model_and_tokenizer(model_id):
"""
Load the tokenizer and model based on the specified model ID.
Model is set to use float16 for computation to reduce memory usage and improve performance.
"""
tokenizer = AutoTokenizer.from_pretrained(model_id)
@f0ster
f0ster / accelerate_presharder.py
Created April 28, 2024 14:08
CLI for sharding and publishing models to huggingface
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
from accelerate import Accelerator
import os
import argparse
def main():
# Parse command line arguments
args = parse_args()
@f0ster
f0ster / big_sharder.py
Created April 28, 2024 04:28
Shard Large LLM models
import os
import json
import sys
import torch
import glob
def load_parameters(directory):
""" Load model parameters from a JSON file. """
with open(os.path.join(directory, 'params.json'), 'r') as file:
return json.load(file)
@f0ster
f0ster / finetune_dance_diffusion.ipynb
Created February 19, 2024 01:44
Finetune_Dance_Diffusion.ipynb
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.