Skip to content

Instantly share code, notes, and snippets.

View Stonesjtu's full-sized avatar
🍏
Bad Apple

Kaiyu Shi Stonesjtu

🍏
Bad Apple
  • NIO
  • Shanghai, China
View GitHub Profile
@Stonesjtu
Stonesjtu / nsight.sh
Created September 10, 2024 11:50 — forked from mcarilli/nsight.sh
Favorite nsight systems profiling commands for Pytorch scripts
# This isn't supposed to run as a bash script, i named it with ".sh" for syntax highlighting.
# https://developer.nvidia.com/nsight-systems
# https://docs.nvidia.com/nsight-systems/profiling/index.html
# My preferred nsys (command line executable used to create profiles) commands
#
# In your script, write
# torch.cuda.nvtx.range_push("region name")
# ...
@Stonesjtu
Stonesjtu / tmux-cheats.md
Created October 11, 2019 06:39 — forked from Starefossen/tmux-cheats.md
My personal tmux cheat sheet for working with sessions, windows, and panes. `NB` I have remapped the command prefix to `ctrl` + `a`.

Sessions

New Session

  • tmux new [-s name] [cmd] (:new) - new session

Switch Session

  • tmux ls (:ls) - list sessions
  • tmux switch [-t name] (:switch) - switches to an existing session
@Stonesjtu
Stonesjtu / latency.txt
Created June 15, 2019 22:05 — forked from understeer/latency.txt
HPC-oriented Latency Numbers Every Programmer Should Know
Latency Comparison Numbers
--------------------------
L1 cache reference/hit 1.5 ns 4 cycles
Floating-point add/mult/FMA operation 1.5 ns 4 cycles
L2 cache reference/hit 5 ns 12 ~ 17 cycles
Branch mispredict 6 ns 15 ~ 20 cycles
L3 cache hit (unshared cache line) 16 ns 42 cycles
L3 cache hit (shared line in another core) 25 ns 65 cycles
Mutex lock/unlock 25 ns
L3 cache hit (modified in another core) 29 ns 75 cycles
@Stonesjtu
Stonesjtu / mpi4py_pycuda_demo.py
Created July 18, 2018 07:28 — forked from lebedov/mpi4py_pycuda_demo.py
Demo of how to pass GPU memory managed by pycuda to mpi4py.
#!/usr/bin/env python
"""
Demo of how to pass GPU memory managed by pycuda to mpi4py.
Notes
-----
This code can be used to perform peer-to-peer communication of data via
NVIDIA's GPUDirect technology if mpi4py has been built against a
CUDA-enabled MPI implementation.
from graphviz import Digraph
import re
import torch
import torch.nn.functional as F
from torch.autograd import Variable
from torch.autograd import Variable
import torchvision.models as models
def make_dot(var):