Stas Bekman stas00

Toolmaker. Author. Software creator, optimizer and harmonizer. Makes things work. Current domains: LLM/Training/Inference/Scalability/Machine Learning

2k followers · 7 following

Stasosphere Online Inc. /
BC, Canada
https://stasosphere.com/machine-learning/
@StasBekman

View GitHub Profile

Recently created

Least recently created

Recently updated

Least recently updated

stas00 / mfu_compute.py

Created January 5, 2024 23:28 — forked from Chillee/mfu_compute.py

Compute Flop Utilization in PyTorch

	import torch
	from torch.utils.flop_counter import FlopCounterMode
	from triton.testing import do_bench

	def get_flops_achieved(f):
	flop_counter = FlopCounterMode(display=False)
	with flop_counter:
	f()
	total_flops = flop_counter.get_total_flops()
	ms_per_iter = do_bench(f)

stas00 / mm_bmm-perf.py

Created February 16, 2024 00:27 — forked from malfet/mm_bmm-perf.py

Measure performance difference of `torch.mm` vs `torch.bmm`

	# Benchmark relative performance of torch.mm and torch.bmm with single batch
	import torch
	import time


	def benchmark_fn(fn, args, warmup=5, cycles=300, use_kineto=False) -> float:
	if use_kineto:
	with torch.profiler.profile(activities=[torch.profiler.ProfilerActivity.CUDA]) as p:
	fn(*args)
	return sum([e.cuda_time for e in p.key_averages()])

stas00 / Mellanox OFED cheat sheet

Created March 1, 2024 02:40 — forked from githubfoam/Mellanox OFED cheat sheet

Mellanox OFED cheat sheet

	--------------------------------------------------------------------------
	# ofed_info -s
	--------------------------------------------------------------------------
	Find Mellanox Adapter Type and Firmware/Driver version
	ConnectX-4 card

	# lspci \| grep Mellanox
	0a:00.0 Network controller: Mellanox Technologies MT27500 Family [ConnectX-3]
	# lspci -vv -s 0a:00.0 \| grep "Part number" -A 3
	# lspci \| grep Mellanox \| awk '{print $1}' \| xargs -i -r mstvpd {}

stas00 / static_kv_cache.py

Created March 2, 2024 02:56 — forked from ArthurZucker/static_kv_cache.py

simple static kv cache script

	from transformers import AutoModelForCausalLM, AutoTokenizer, StaticCache
	import torch
	from typing import Optional
	device = "cuda"

	# Copied from the gpt-fast repo
	def multinomial_sample_one_no_sync(probs_sort): # Does multinomial sampling without a cuda synchronization
	q = torch.empty_like(probs_sort).exponential_(1)
	return torch.argmax(probs_sort / q, dim=-1, keepdim=True).to(dtype=torch.int)

stas00 / README.md

Created September 13, 2024 20:15 — forked from rutcreate/README.md

Install Python 3.10.x on Ubuntu 20.04

Prerequisite

sudo apt update
sudo apt install software-properties-common -y

Add custom APT repository

stas00 / generate.py

Created October 7, 2024 18:54 — forked from philschmid/generate.py

	import os
	import asyncio
	import subprocess
	import time
	from typing import List, Dict
	import torch
	from openai import AsyncOpenAI
	from tqdm.asyncio import tqdm
	import logging

stas00 / gist:060bffc245244532231a7bb29003cd56

Created October 12, 2024 02:08

easy scalable inference benchmarking with aiohttp client (via vllm)

	git clone https://github.com/vllm-project/vllm
	cd vllm/benchmarks
	wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json
	mkdir results
	python benchmark_serving.py \
	--backend vllm \
	--model meta-llama/Meta-Llama-3-8B-Instruct \
	--dataset-name sharegpt \
	--dataset-path ShareGPT_V3_unfiltered_cleaned_split.json \
	--port 9999 \

stas00 / scaled_mm_api.md

Created January 13, 2025 01:05 — forked from drisspg/scaled_mm_api.md

Scaled MM API

Summary

This doc servers as a quick reference for the _scaled_mm API and how it has changed overtime for each major version of PyTorch.

NOTE The leading underscore is intended here and we make no current FC/BC guarantees on this API. That being said it is currently the only OP that has native support for FP8 matmuls within the PyTorch Libary. We are planning to make an official Public api for this. Until then this is subject to change but you can use this doc as a reference.

stas00 / sequential_ring.py

Created July 2, 2025 18:09 — forked from a-r-r-o-w/sequential_ring.py

sequential and templated ring/ulysses/unified attention implementation

	import torch

	torch.manual_seed(42)


	def torch_sdpa(query, key, value):
	out, lse, cum_seq_q, cum_seq_k, max_q, max_k, philox_seed, philox_offset, debug_attn_mask = (
	torch.ops.aten._scaled_dot_product_cudnn_attention(
	query=query,
	key=key,

OlderNewer