A Map for Studying Pre-training in LLMs
- Data Collection
- General Text Data
- Specialized Data
- Data Preprocessing
- Quality Filtering
- Deduplication
I wrote these instructions as part of "installing PyTorch with CUDA 12.1.1".
I extracted them into this separate gist, because I realised there's a much easier way to install magma for CUDA 12.1.1:
https://anaconda.org/pytorch/magma-cuda121
There's a conda package!
conda install -c pytorch magma-cuda121
CUDA 12.1.1 toolkit is gonna offer to install Nvidia driver 530 for us. It's from New Feature branch. It's likely to be newer than the default Nvidia driver you would've installed via apt-get (apt would prefer to give you 525, i.e. Production Branch).
If you're confident that you already have a new enough Nvidia driver for CUDA 12.1.1, and you'd like to keep your driver: feel free to skip this "uninstall driver" step.
But if you're not sure, or you know your driver is too old: let's uninstall it. CUDA will install a new driver for us later.
# See https://gist.github.com/BramVanroy/f78530673b1437ed0d6be7c61cdbdd7c | |
parser = HfArgumentParser((ModelArguments, DataTrainingArguments, TrainingArguments, HyperOptArguments)) | |
try: | |
# Assumes that the first .json file is the config file (if any) | |
config_file = next(iter(arg for arg in sys.argv if arg.endswith(".json"))) | |
except StopIteration: | |
config_file = None | |
run_name_specified = False |
import numpy as np | |
from statsmodels.nonparametric.smoothers_lowess import lowess | |
from sklearn.datasets import load_breast_cancer | |
from sklearn.pipeline import Pipeline | |
from sklearn.preprocessing import StandardScaler | |
from sklearn.linear_model import LogisticRegression | |
from sklearn.model_selection import KFold, RepeatedKFold, GridSearchCV, cross_val_score | |
from sklearn.metrics import make_scorer, brier_score_loss | |
from sklearn.utils import resample |
Tested on:
Lenovo Legion 5i with below specs:
AMD® Ryzen 7 4800h with radeon graphics × 16
NVIDIA Corporation / NVIDIA GeForce RTX 2060/PCIe/SSE2
nvidia-driver-470 - HDMI doesn't have to work from the beginning
nvidia-driver-495 - HDMI works from the beginning, unstable (random reboots)\
# So now you want to finetune that GPT-J-6B on a 3090/TITAN GPU ... okay | |
# More exploratory coding. It uses the Huggingface model port, deepspeed and reads all text/md files from a target directory | |
# It is a fragment of a larger system with remote editing, but that's another story | |
# This is the raw, training tester. Items to look out for: | |
# - uses DeepSpeed and has a DS config | |
# - to save space uses SGD instead of ADAM | |
# - uses gradient checkpointing | |
# - freezes 25% of the layers to fit | |
# Assumes you can already run https://gist.github.com/kinoc/2d636a68876cd3de7b6e9c9452b61089 |
brew install pandoc | |
brew tap homebrew/cask | |
brew install --cask basictex | |
eval "$(/usr/libexec/path_helper)" | |
# Update $PATH to include `/usr/local/texlive/2022basic/bin/universal-darwin` | |
sudo tlmgr update --self | |
sudo tlmgr install texliveonfly | |
sudo tlmgr install xelatex | |
sudo tlmgr install adjustbox | |
sudo tlmgr install tcolorbox |
; /usr/share/pulseaudio/alsa-mixer/profile-sets/astro-a50-gen4.conf | |
[General] | |
auto-profiles = yes | |
[Mapping analog-voice] | |
description = Voice | |
device-strings = hw:%f,0,0 | |
channel-map = left,right | |
paths-output = steelseries-arctis-output-chat-common |
This is a companion piece to my instructions on building TensorFlow from source. In particular, the aim is to install the following pieces of software
on an Ubuntu Linux system, in particular Ubuntu 20.04.