Zhanwen Chen zhanwenchen

Pretraining

A Map for Studying Pre-training in LLMs

Data Collection
- General Text Data
- Specialized Data
Data Preprocessing
- Quality Filtering
Deduplication

I wrote these instructions as part of "installing PyTorch with CUDA 12.1.1".
I extracted them into this separate gist, because I realised there's a much easier way to install magma for CUDA 12.1.1:
https://anaconda.org/pytorch/magma-cuda121

There's a conda package!

conda install -c pytorch magma-cuda121

Installing CUDA 12.1.1 + PyTorch nightly + Python 3.10 on Ubuntu 22.10

Should you keep your NVIDIA driver?

CUDA 12.1.1 toolkit is gonna offer to install Nvidia driver 530 for us. It's from New Feature branch. It's likely to be newer than the default Nvidia driver you would've installed via apt-get (apt would prefer to give you 525, i.e. Production Branch).

If you're confident that you already have a new enough Nvidia driver for CUDA 12.1.1, and you'd like to keep your driver: feel free to skip this "uninstall driver" step.

But if you're not sure, or you know your driver is too old: let's uninstall it. CUDA will install a new driver for us later.

This is a collection of Ubuntu fixes for Lenovo Legion 5i

Tested on: Lenovo Legion 5i with below specs:
AMD® Ryzen 7 4800h with radeon graphics × 16
NVIDIA Corporation / NVIDIA GeForce RTX 2060/PCIe/SSE2

1. GPU ISSUES for RTX 2060:

nvidia-driver-470 - HDMI doesn't have to work from the beginning
nvidia-driver-495 - HDMI works from the beginning, unstable (random reboots)\

	# See https://gist.github.com/BramVanroy/f78530673b1437ed0d6be7c61cdbdd7c
	parser = HfArgumentParser((ModelArguments, DataTrainingArguments, TrainingArguments, HyperOptArguments))

	try:
	# Assumes that the first .json file is the config file (if any)
	config_file = next(iter(arg for arg in sys.argv if arg.endswith(".json")))
	except StopIteration:
	config_file = None

	run_name_specified = False

	import numpy as np
	from statsmodels.nonparametric.smoothers_lowess import lowess

	from sklearn.datasets import load_breast_cancer
	from sklearn.pipeline import Pipeline
	from sklearn.preprocessing import StandardScaler
	from sklearn.linear_model import LogisticRegression
	from sklearn.model_selection import KFold, RepeatedKFold, GridSearchCV, cross_val_score
	from sklearn.metrics import make_scorer, brier_score_loss
	from sklearn.utils import resample

	import torchvision

	class UnNormalize(torchvision.transforms.Normalize):
	def __init__(self,mean,std,args,*kwargs):
	new_mean = [-m/s for m,s in zip(mean,std)]
	new_std = [1/s for s in std]
	super().__init__(new_mean, new_std, args, *kwargs)

	# imagenet_norm = dict(mean=[0.485, 0.456, 0.406],std=[0.229, 0.224, 0.225])
	# UnNormalize(**imagenet_norm)

	# So now you want to finetune that GPT-J-6B on a 3090/TITAN GPU ... okay
	# More exploratory coding. It uses the Huggingface model port, deepspeed and reads all text/md files from a target directory
	# It is a fragment of a larger system with remote editing, but that's another story
	# This is the raw, training tester. Items to look out for:
	# - uses DeepSpeed and has a DS config
	# - to save space uses SGD instead of ADAM
	# - uses gradient checkpointing
	# - freezes 25% of the layers to fit

	# Assumes you can already run https://gist.github.com/kinoc/2d636a68876cd3de7b6e9c9452b61089

	brew install pandoc
	brew tap homebrew/cask
	brew install --cask basictex
	eval "$(/usr/libexec/path_helper)"
	# Update $PATH to include `/usr/local/texlive/2022basic/bin/universal-darwin`
	sudo tlmgr update --self
	sudo tlmgr install texliveonfly
	sudo tlmgr install xelatex
	sudo tlmgr install adjustbox
	sudo tlmgr install tcolorbox

	; /usr/share/pulseaudio/alsa-mixer/profile-sets/astro-a50-gen4.conf

	[General]
	auto-profiles = yes

	[Mapping analog-voice]
	description = Voice
	device-strings = hw:%f,0,0
	channel-map = left,right
	paths-output = steelseries-arctis-output-chat-common