Skip to content

Instantly share code, notes, and snippets.

View SinclairCoder's full-sized avatar
🎯
Focusing

Zengzhi Wang SinclairCoder

🎯
Focusing
View GitHub Profile
import argparse
import os
import random
from datasets import Dataset
from datatrove.pipeline.readers import JsonlReader, ParquetReader
from tqdm import tqdm
from transformers import AutoTokenizer
from vllm import LLM, SamplingParams
from data_gen.configs import GentaskConfig
from utils.data_utils import get_adapter_func, split_into_batches
import pandas as pd
import numpy as np
import math
import matplotlib.pyplot as plt
plt.rcParams['figure.dpi'] = 1000 #分辨率
class ComplexRadar():
"""
Create a complex radar chart with different scales for each variable
Parameters
@SinclairCoder
SinclairCoder / README_hfd.md
Created November 11, 2023 12:02 — forked from padeoe/README_hfd.md
Command-line Tool for Easy Downloading of Huggingface Models

🤗Huggingface Model Downloader

Update: The previous version has a bug. When resuming from a breakpoint, there may be an issue causing incomplete files. Please update to the latest version!!!

Considering the lack of multi-threaded download support in the official huggingface-cli, and the inadequate error handling in hf_transfer, this command-line tool smartly utilizes wget or aria2 for LFS files and git clone for the rest.

Features

  • ⏯️ Resume from breakpoint: You can re-run it or Ctrl+C anytime.
  • 🚀 Multi-threaded Download: Utilize multiple threads to speed up the download process.
  • 🚫 File Exclusion: Use --exclude to skip specific files, save time for models with duplicate formats (e.g., .bin and .safetensors).
  • 🔐 Auth Support: For gated models that require Huggingface login, use --hf_username and --hf_token to authenticate.
@SinclairCoder
SinclairCoder / demo_hf_t5.py
Last active March 30, 2022 15:05
Uncomment the L77-78 to test `max_input_len`
from pytorch_lightning import seed_everything
from transformers import AdamW, T5ForConditionalGeneration, T5Tokenizer, AutoConfig, BartTokenizer, BartForConditionalGeneration
from torch.utils.data import Dataset
from torch.utils.data import DataLoader
import pytorch_lightning as pl
import os
class MyDataset(Dataset):
@SinclairCoder
SinclairCoder / demo_hf.py
Last active March 31, 2022 01:46
Uncomment the L71-75 to test different PTMs and L79-80 to test `max_input_len` respectively
from pytorch_lightning import seed_everything
from transformers import AdamW, T5ForConditionalGeneration, T5Tokenizer, AutoConfig, BartTokenizer, BartForConditionalGeneration
from torch.utils.data import Dataset
from torch.utils.data import DataLoader
import pytorch_lightning as pl
import os
class MyDataset(Dataset):
@SinclairCoder
SinclairCoder / demo_pl_t5.py
Last active March 30, 2022 14:56
uncomment the L109 to test different `max_input_len`
from pytorch_lightning import seed_everything
from transformers import AdamW, T5ForConditionalGeneration, T5Tokenizer, AutoConfig, BartTokenizer, BartForConditionalGeneration
from torch.utils.data import Dataset
from torch.utils.data import DataLoader
import pytorch_lightning as pl
import os
class MyDataset(Dataset):