CHANG-NING TSAI crazyguitar

Ordinary launch commands (no profiling):

Single-process:

python main_amp.py -a resnet50 --b 224 --deterministic --workers 4 --opt-level O1 ./bare_metal_train_val/

Multi-process:

python -m torch.distributed.launch  --nproc_per_node=2 main_amp.py -a resnet50 --b 224 --deterministic --workers 4 --opt-level O1 ./bare_metal_train_val/

Lecture 1: Introduction to Research — [📝Lecture Notebooks] [▶️Video]
Lecture 2: Introduction to Python — [📝Lecture Notebooks] [▶️Video]
Lecture 3: Introduction to NumPy — [📝Lecture Notebooks] [▶️Video]
Lecture 4: Introduction to pandas — [📝Lecture Notebooks] [▶️Video]
Lecture 5: Plotting Data — [📝Lecture Notebooks] [[▶️Vide

	# Use this script to test that your Telegram bot works.
	#
	# Install the dependency
	#
	# $ gem install telegram_bot
	#
	# Run the bot
	#
	# $ ruby bot.rb
	#

	#include <stdlib.h>
	#include <stdio.h>
	#include <stdint.h>
	#include <fcntl.h>
	#include <sys/stat.h>
	#include <sys/mman.h>
	#include <unistd.h>

	int main(int argc, const char *argv[])
	{

	# This isn't supposed to run as a bash script, i named it with ".sh" for syntax highlighting.

	# https://developer.nvidia.com/nsight-systems
	# https://docs.nvidia.com/nsight-systems/profiling/index.html

	# My preferred nsys (command line executable used to create profiles) commands
	#
	# In your script, write
	# torch.cuda.nvtx.range_push("region name")
	# ...

	//----------------------------------------------------------------------
	#include <stdio.h>
	#include <emmintrin.h>
	#include <immintrin.h>
	//----------------------------------------------------------------------
	void
	printm256(__m256d r){
	double a = (double)(&r);
	printf("%f %f %f %f\n",a[0],a[1],a[2],a[3]);
	}

	"""
	This model integrates the MoE concept within a Transformer architecture. Each token's
	representation is processed by a subset of experts, determined by the gating mechanism.
	This architecture allows for efficient and specialized handling of different aspects of the
	data, aiming for the adaptability and efficiency noted in the Mixtral 8x7B model's design
	philosophy. The model activates only a fraction of the available experts for each token,
	significantly reducing the computational resources needed compared to activating all experts
	for all tokens.
	"""