Skip to content

Instantly share code, notes, and snippets.

View 152334H's full-sized avatar
💤

152334H 152334H

💤
View GitHub Profile
@drisspg
drisspg / scaled_mm_api.md
Last active February 8, 2025 16:03
Scaled MM API

Summary

This doc servers as a quick reference for the _scaled_mm API and how it has changed overtime for each major version of PyTorch.


NOTE The leading underscore is intended here and we make no current FC/BC guarantees on this API. That being said it is currently the only OP that has native support for FP8 matmuls within the PyTorch Libary. We are planning to make an official Public api for this. Until then this is subject to change but you can use this doc as a reference.


@gd3kr
gd3kr / script.js
Created February 15, 2024 06:30
Download a JSON List of twitter bookmarks
/*
the twitter api is stupid. it is stupid and bad and expensive. hence, this.
Literally just paste this in the JS console on the bookmarks tab and the script will automatically scroll to the bottom of your bookmarks and keep a track of them as it goes.
When finished, it downloads a JSON file containing the raw text content of every bookmark.
for now it stores just the text inside the tweet itself, but if you're reading this why don't you go ahead and try to also store other information (author, tweetLink, pictures, everything). come on. do it. please?
*/
@Chillee
Chillee / mfu_compute.py
Last active March 2, 2025 22:10
Compute Flop Utilization in PyTorch
import torch
from torch.utils.flop_counter import FlopCounterMode
from triton.testing import do_bench
def get_flops_achieved(f):
flop_counter = FlopCounterMode(display=False)
with flop_counter:
f()
total_flops = flop_counter.get_total_flops()
ms_per_iter = do_bench(f)

Typical approaches to training, and sampling from, denoising diffusion models yield results whose per-item means match the initial input - i.e. zero when using i.i.d. samples from a standard normal distribution. This has major implications for what outputs can be obtained from popular text-to-image generative models, see e.g. https://twitter.com/apeoffire/status/1624884816851206145 and https://www.crosslabs.org/blog/diffusion-with-offset-noise.

It also means we can reliably produce dark, bright, or tinted images by shifting the input to a desired color.

Now, I was curious what would happen if I made Stable Diffusion denoise an "impossible" image whose mean color exceeds the [0,1] valid RGB range:

init_latent = vae_encode(tensor([1.5, 1.5, 1.5])[None,:,None,None].tile(1,1,512,512)) + sigma_max * randn(1,4,64,64)
@agnostic-apollo
agnostic-apollo / Android-Phantom,Cached-And-Empty-Processes.md
Last active February 23, 2025 01:52
Android Phantom, Cached And Empty Processes
@sheepymeh
sheepymeh / README.md
Last active March 4, 2022 06:11
google-drive-restricted-download

Downloads restricted PDFs from Google Drive that cannot normally be downloaded. Uses A4 paper size and scales accordingly. Copy and paste script.js into the console and press Enter.

Ensure that your system has enough RAM before starting.
Thanks to this comment for zoom-in code.

@joshlk
joshlk / 0_nvidia_benchmark.md
Last active January 22, 2025 07:40
Benchmark bandwidth and latency of P2P NVIDIA GPUs (NVLINK vs PCI)

NVIDIA GPU P2P Benchmark bandwidth/throughput and latency

Using https://github.com/NVIDIA/cuda-samples

You can also view the GPU topology using nvidia-smi topo -m

  1. Download repo git clone https://github.com/NVIDIA/cuda-samples.git
  2. Checkout the tag that corresponds with the right CUDA version: git checkout tags/v11.1
  3. You might need to install some additional packages sudo apt-get install freeglut3-dev build-essential libx11-dev libxmu-dev libxi-dev libgl1-mesa-glx libglu1-mesa libglu1-mesa-dev libglfw3-dev libgles2-mesa-dev
  4. Either build everything by just execting make in root dir. Or cd Samples/p2pBandwidthLatencyTest; make