Skip to content

Instantly share code, notes, and snippets.

View sagorbrur's full-sized avatar
🎯
Focusing

Sagor Sarker sagorbrur

🎯
Focusing
View GitHub Profile
@sagorbrur
sagorbrur / tokens_count_large_dataset.py
Last active April 28, 2024 07:40
Token counts using hf tokenizer and large datasets
import glob
import json
import multiprocessing
from tqdm import tqdm
from transformers import AutoTokenizer
model_id = "tokenzer_model"
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
@sagorbrur
sagorbrur / rename_shards.py
Created April 25, 2024 11:24
Rename files to hf shards format
# Given list of filenames
file_names = ['chunk_1.jsonl', 'chunk_2.jsonl']
# Function to convert file names to the required format
def convert_filenames(filenames):
total_files = len(filenames)
new_file_names = []
for i, filename in enumerate(filenames, start=1):
# Extract the base name without extension and the chunk number
curl -s https://packagecloud.io/install/repositories/github/git-lfs/script.deb.sh | sudo bash
sudo apt install git-lfs
git lfs install

Problem

A lot of GitHub projects need to have pretty math formulas in READMEs, wikis or other markdown pages. The desired approach would be to just write inline LaTeX-style formulas like this:

$e^{i \pi} = -1$

Unfortunately, GitHub does not support inline formulas. The issue is tracked here.

Investigation

import matplotlib.pyplot as plt
from matplotlib import font_manager
import seaborn as sns
font_path = './font/IPAGothic_24302.ttf' # Your font path goes here, as example using jp fonts
font_manager.fontManager.addfont(font_path)
prop = font_manager.FontProperties(fname=font_path)
plt.rcParams['font.family'] = 'sans-serif'
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
@sagorbrur
sagorbrur / simple_pytorch.ipynb
Created September 24, 2021 17:34
simple_pytorch.ipynb
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
@sagorbrur
sagorbrur / loading_custom_dataset_in_huggingface_datasets.ipynb
Created May 31, 2021 16:27
loading_custom_dataset_in_huggingface_datasets.ipynb
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
@sagorbrur
sagorbrur / bert_medium_training_in_colab.ipynb
Created May 23, 2021 06:35
bert_medium_training_in_colab.ipynb
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
@sagorbrur
sagorbrur / download_glue_data.py
Last active May 23, 2021 06:39 — forked from W4ngatang/download_glue_data.py
Script for downloading data of the GLUE benchmark (gluebenchmark.com)
"""Script for downloading all GLUE data.
Modified by: Sagor Sarker
Dependency:
pip install wget
pip install wasabi
"""
import os