Ryan Foster f0ster

Smashing the Tariffs for Fun and Profit: How DeepSeek v3 Outsmarted the AI Ban

1. CUDA and PTX Optimizations

DeepSeek-V3’s engineers optimized GPU performance at the low-level by tailoring kernels and memory access patterns to NVIDIA’s hardware. A key strategy was warp specialization: they partitioned a subset of GPU threads (warps) specifically for communication tasks, allowing compute to overlap with data transfers (DeepSeek-V3 Technical Report). In practice, only ~20 of the GPU’s Streaming Multiprocessors (SMs) were reserved to handle all cross-node communications – enough to saturate both InfiniBand (IB) and NVLink bandwidth – while the remaining SMs focused purely on computation (DeepSeek-V3 Technical Report) ([DeepSeek-V3 Technical Report](https://arx

	<!DOCTYPE html>
	<html lang="en">
	<head>
	<meta charset="UTF-8">
	<meta name="viewport" content="width=device-width, initial-scale=1.0">
	<title>Horizontal Binary Clock</title>
	<style>
	body {
	background-color: #1a1a1a;
	/* Center the main wrapper on the page */


	#include <iostream>
	#include <cuda.h>

	// CUDA Kernel
	__global__ void add_arrays(float a, float b, float *c, int n) {
	int idx = blockIdx.x * blockDim.x + threadIdx.x;
	if (idx < n) {
	c[idx] = a[idx] + b[idx];
	}

	# ~/.tmux.conf

	# General settings
	set -g default-terminal "screen-256color" # Use 256-color terminal
	set -g history-limit 5000 # Increase scrollback buffer size
	set -g base-index 0 # Start window indexes at 0
	set -g mouse on # Enable mouse control (pane selection, resizing, scrolling)

	# Restore original prefix key
	set-option -g prefix C-b # Set prefix to Ctrl-b

	# script to provide a summary of the repositories by only listing each one's name
	# along with its status (public, public with changes, or private)

	import os
	import subprocess
	import json
	from concurrent.futures import ThreadPoolExecutor, as_completed

	def execute_command(command, cwd):
	"""Executes a shell command in a specified directory and returns the output."""

	import time
	import torch
	from transformers import AutoModelForCausalLM, AutoTokenizer

	def load_model_and_tokenizer(model_id):
	"""
	Load the tokenizer and model based on the specified model ID.
	Model is set to use float16 for computation to reduce memory usage and improve performance.
	"""
	tokenizer = AutoTokenizer.from_pretrained(model_id)

	import torch
	from transformers import AutoTokenizer, AutoModelForCausalLM
	from accelerate import Accelerator
	import os
	import argparse

	def main():
	# Parse command line arguments
	args = parse_args()

	import os
	import json
	import sys
	import torch
	import glob

	def load_parameters(directory):
	""" Load model parameters from a JSON file. """
	with open(os.path.join(directory, 'params.json'), 'r') as file:
	return json.load(file)