Bram Vanroy BramVanroy

👋 My name is Bram and I work on natural language processing and machine translation (evaluation) but I also spend a lot of time in this open-source world 🌍

146 followers · 26 following

View GitHub Profile

Recently created

Least recently created

Recently updated

Least recently updated

BramVanroy / run_clm_lora.py

Created September 26, 2023 10:23

	#!/usr/bin/env python
	# coding=utf-8
	# Copyright 2020 The HuggingFace Inc. team. All rights reserved.
	#
	# Licensed under the Apache License, Version 2.0 (the "License");
	# you may not use this file except in compliance with the License.
	# You may obtain a copy of the License at
	#
	# http://www.apache.org/licenses/LICENSE-2.0
	#

BramVanroy / benchmark.py

Last active May 29, 2024 11:33

Fast method of "first-fit-decreasing" packing benchmark. Around 5x faster than baseline. Baseline taken from https://huggingface.co/DiscoResearch/Llama3-German-8B#document-packing. Note that memory usage will be higher in the optimized version.

	import gc

	import numpy as np
	import time

	import pandas as pd
	from tqdm import tqdm


	def pack_documents_original(tokenized_documents, block_size: int = 8192, use_tqdm=True):

BramVanroy / embed.py

Last active October 9, 2024 13:54

Getting word embeddings

	from dataclasses import dataclass, field

	import torch
	from torch import LongTensor, Tensor
	from transformers import (
	AutoTokenizer,
	AutoModel,
	PreTrainedModel,
	PreTrainedTokenizer,
	BatchEncoding,

BramVanroy / fw2-not-unique-ids.py

Created January 18, 2025 20:16

FineWeb-2 IDs are not unique

	from typing import Counter
	from datasets import load_dataset
	from tqdm import tqdm


	ds = load_dataset("HuggingFaceFW/fineweb-2", "nld_Latn", split="train")
	ds_size = len(ds)

	print(f"Dataset size: {ds_size:,}") #
	counts = Counter(ds["id"])

OlderNewer