Kaiyu Shi Stonesjtu

🍏

Bad Apple

Machine Learning from Huge to Tiny Deep Learning from Python to RTL

Stonesjtu / mem_report.py

Last active March 7, 2023 16:58

A simple Pytorch memory usages profiler

	import gc

	import torch

	## MEM utils ##
	def mem_report():
	'''Report the memory usage of the tensor.storage in pytorch
	Both on CPUs and GPUs are reported'''

	def _mem_report(tensors, mem_type):

Stonesjtu / viz_net_pytorch.py

Created January 26, 2018 06:54 — forked from wangg12/viz_net_pytorch.py

	from graphviz import Digraph
	import re
	import torch
	import torch.nn.functional as F
	from torch.autograd import Variable
	from torch.autograd import Variable
	import torchvision.models as models


	def make_dot(var):

Stonesjtu / mpi4py_pycuda_demo.py

Created July 18, 2018 07:28 — forked from lebedov/mpi4py_pycuda_demo.py

Demo of how to pass GPU memory managed by pycuda to mpi4py.

	#!/usr/bin/env python

	"""
	Demo of how to pass GPU memory managed by pycuda to mpi4py.

	Notes
	-----
	This code can be used to perform peer-to-peer communication of data via
	NVIDIA's GPUDirect technology if mpi4py has been built against a
	CUDA-enabled MPI implementation.

Stonesjtu / bilstm_type.py

Created April 15, 2019 01:37

Test the bidirectional LSTM type

	"""A simple script to test the biLSTM type that pytorch uses.
	The gradients are computed only w.r.t the output of one single direction,
	so gradient of the reverse direction in layer 1 should be zero if type1.

	In my tests, it's type2
	"""

	import torch
	from torch import nn

Stonesjtu / latency.txt

Created June 15, 2019 22:05 — forked from understeer/latency.txt

HPC-oriented Latency Numbers Every Programmer Should Know

	Latency Comparison Numbers
	--------------------------
	L1 cache reference/hit 1.5 ns 4 cycles
	Floating-point add/mult/FMA operation 1.5 ns 4 cycles
	L2 cache reference/hit 5 ns 12 ~ 17 cycles
	Branch mispredict 6 ns 15 ~ 20 cycles
	L3 cache hit (unshared cache line) 16 ns 42 cycles
	L3 cache hit (shared line in another core) 25 ns 65 cycles
	Mutex lock/unlock 25 ns
	L3 cache hit (modified in another core) 29 ns 75 cycles

Stonesjtu / tmux-cheats.md

Created October 11, 2019 06:39 — forked from Starefossen/tmux-cheats.md

My personal tmux cheat sheet for working with sessions, windows, and panes. `NB` I have remapped the command prefix to `ctrl` + `a`.

Sessions

Stonesjtu / nsight.sh

Created September 10, 2024 11:50 — forked from mcarilli/nsight.sh

Favorite nsight systems profiling commands for Pytorch scripts

	# This isn't supposed to run as a bash script, i named it with ".sh" for syntax highlighting.

	# https://developer.nvidia.com/nsight-systems
	# https://docs.nvidia.com/nsight-systems/profiling/index.html

	# My preferred nsys (command line executable used to create profiles) commands
	#
	# In your script, write
	# torch.cuda.nvtx.range_push("region name")
	# ...