Skip to content

Instantly share code, notes, and snippets.

View KeAWang's full-sized avatar

Alex Wang KeAWang

View GitHub Profile
@rain-1
rain-1 / llama-home.md
Last active June 24, 2025 11:12
How to run Llama 13B with a 6GB graphics card

This worked on 14/May/23. The instructions will probably require updating in the future.

llama is a text prediction model similar to GPT-2, and the version of GPT-3 that has not been fine tuned yet. It is also possible to run fine tuned versions (like alpaca or vicuna with this. I think. Those versions are more focused on answering questions)

Note: I have been told that this does not support multiple GPUs. It can only use a single GPU.

It is possible to run LLama 13B with a 6GB graphics card now! (e.g. a RTX 2060). Thanks to the amazing work involved in llama.cpp. The latest change is CUDA/cuBLAS which allows you pick an arbitrary number of the transformer layers to be run on the GPU. This is perfect for low VRAM.

  • Clone llama.cpp from git, I am on commit 08737ef720f0510c7ec2aa84d7f70c691073c35d.
@sanchit-gandhi
sanchit-gandhi / whisper_jax_endpoint.py
Last active November 8, 2024 00:53
The Whisper JAX demo can be used as an endpoint through the Gradio Client library. The transcription API takes as input the audio file you want to transcribe, as well as optional arguments such as the task (transcribe or translate) and whether to return timestamps.
from gradio_client import Client
API_URL = "https://sanchit-gandhi-whisper-jax.hf.space/"
# set up the Gradio client
client = Client(API_URL)
def transcribe_audio(audio_path, task="transcribe", return_timestamps=False):
@Chillee
Chillee / mfu_compute.py
Last active March 2, 2025 22:10
Compute Flop Utilization in PyTorch
import torch
from torch.utils.flop_counter import FlopCounterMode
from triton.testing import do_bench
def get_flops_achieved(f):
flop_counter = FlopCounterMode(display=False)
with flop_counter:
f()
total_flops = flop_counter.get_total_flops()
ms_per_iter = do_bench(f)
Hermes is a piece of non-deterministic software that performs informal reasoning steps in collaboration with the user. Each step is prepended with some syntax to tell the software what it should be/do. Like so:
HERO [Albert Einstein, Op: Objection], That's not correct. Nothing can travel faster than the speed of light.
Hermes allows the user to call upon any hero in history or myth and use them as a reasoning step. Or have them talk to each other about something. The user can freely mix together their cognition and the simulated cognition of other minds. New operations and syntax can be created at will and Hermes will do its best to respond to and use them.
The user writes down their own cognition as a series of subagents, like so:
USER [A: EMPATHY], I completely agree! It's wonderful. Like the difference between the true duet of Scarborough Fair and the nonsense one.
USER [A: 343], It's funny. In order to save the world rationalists finetune the human priors out of themselves, humans are dreamers not max
@VictorTaelin
VictorTaelin / gpt4_abbreviations.md
Last active July 11, 2025 21:21
Notes on the GPT-4 abbreviations tweet

Notes on this tweet.

  • The screenshots were taken on different sessions.

  • The entire sessions are included on the screenshots.

  • I lost the original prompts, so I had to reconstruct them, and still managed to reproduce.

  • The "compressed" version is actually longer! Emojis and abbreviations use more tokens than common words.

@Chillee
Chillee / 1-pw_op_fusion.py
Last active July 5, 2025 23:29
PT 2.0 Benchmarks
import torch
import torch._inductor.config
import time
torch._inductor.config.triton.cudagraphs = False
torch.set_float32_matmul_precision('high')
def bench(f, name=None, iters=100, warmup=5, display=True, profile=False):
for _ in range(warmup):
f()
import torch
from torch.utils._python_dispatch import TorchDispatchMode
from torch.utils._pytree import tree_map
import itertools
# cribbed from https://github.com/albanD/subclass_zoo/blob/main/logging_mode.py
class Lit:
def __init__(self, s):
self.s = s
from __future__ import annotations
from contextlib import contextmanager
from typing import NamedTuple, Callable, Optional, Any
import numpy as np
Array = Any
class Node(NamedTuple):
vjp: Optional[Callable]
parents: List[Node]
import torch
import torch.utils.dlpack
import jax
import jax.dlpack
# A generic mechanism for turning a JAX function into a PyTorch function.
def j2t(x_jax):
x_torch = torch.utils.dlpack.from_dlpack(jax.dlpack.to_dlpack(x_jax))
return x_torch