152334H 152334H

Summary

This doc servers as a quick reference for the _scaled_mm API and how it has changed overtime for each major version of PyTorch.

NOTE The leading underscore is intended here and we make no current FC/BC guarantees on this API. That being said it is currently the only OP that has native support for FP8 matmuls within the PyTorch Libary. We are planning to make an official Public api for this. Until then this is subject to change but you can use this doc as a reference.

Typical approaches to training, and sampling from, denoising diffusion models yield results whose per-item means match the initial input - i.e. zero when using i.i.d. samples from a standard normal distribution. This has major implications for what outputs can be obtained from popular text-to-image generative models, see e.g. https://twitter.com/apeoffire/status/1624884816851206145 and https://www.crosslabs.org/blog/diffusion-with-offset-noise.

It also means we can reliably produce dark, bright, or tinted images by shifting the input to a desired color.

Now, I was curious what would happen if I made Stable Diffusion denoise an "impossible" image whose mean color exceeds the [0,1] valid RGB range:

init_latent = vae_encode(tensor([1.5, 1.5, 1.5])[None,:,None,None].tile(1,1,512,512)) + sigma_max * randn(1,4,64,64)

Android Phantom, Cached And Empty Processes

This has been moved to https://github.com/agnostic-apollo/Android-Docs/blob/master/en/docs/apps/processes/phantom-cached-and-empty-processes.md

Some important headings are kept so that users can redirect to new link if they land here.

Downloads restricted PDFs from Google Drive that cannot normally be downloaded. Uses A4 paper size and scales accordingly. Copy and paste script.js into the console and press Enter.

Ensure that your system has enough RAM before starting.
Thanks to this comment for zoom-in code.

NVIDIA GPU P2P Benchmark bandwidth/throughput and latency

Using https://github.com/NVIDIA/cuda-samples

You can also view the GPU topology using nvidia-smi topo -m

Download repo git clone https://github.com/NVIDIA/cuda-samples.git
Checkout the tag that corresponds with the right CUDA version: git checkout tags/v11.1
You might need to install some additional packages sudo apt-get install freeglut3-dev build-essential libx11-dev libxmu-dev libxi-dev libgl1-mesa-glx libglu1-mesa libglu1-mesa-dev libglfw3-dev libgles2-mesa-dev
Either build everything by just execting make in root dir. Or cd Samples/p2pBandwidthLatencyTest; make


	/*
	the twitter api is stupid. it is stupid and bad and expensive. hence, this.

	Literally just paste this in the JS console on the bookmarks tab and the script will automatically scroll to the bottom of your bookmarks and keep a track of them as it goes.

	When finished, it downloads a JSON file containing the raw text content of every bookmark.

	for now it stores just the text inside the tweet itself, but if you're reading this why don't you go ahead and try to also store other information (author, tweetLink, pictures, everything). come on. do it. please?
	*/

	import torch
	from torch.utils.flop_counter import FlopCounterMode
	from triton.testing import do_bench

	def get_flops_achieved(f):
	flop_counter = FlopCounterMode(display=False)
	with flop_counter:
	f()
	total_flops = flop_counter.get_total_flops()
	ms_per_iter = do_bench(f)