Skip to content

Instantly share code, notes, and snippets.

@thomwolf
Last active January 18, 2024 14:04
Show Gist options
  • Save thomwolf/13ca2b2b172b2d17ac66685aa2eeba62 to your computer and use it in GitHub Desktop.
Save thomwolf/13ca2b2b172b2d17ac66685aa2eeba62 to your computer and use it in GitHub Desktop.
Load full English Wikipedia dataset in HuggingFace nlp library
import os; import psutil; import timeit
from datasets import load_dataset
mem_before = psutil.Process(os.getpid()).memory_info().rss >> 20
wiki = load_dataset("wikipedia", "20200501.en", split='train')
mem_after = psutil.Process(os.getpid()).memory_info().rss >> 20
print(f"RAM memory used: {(mem_after - mem_before)} MB")
s = """batch_size = 1000
for i in range(0, len(wiki), batch_size):
batch = wiki[i:i + batch_size]
"""
time = timeit.timeit(stmt=s, number=1, globals=globals())
size = wiki.dataset_size / 2**30
print(f"Iterated over the {size:.1f} GB dataset in {time:.1f} s, i.e. {size * 8/time:.1f} Gbit/s")
@thomwolf
Copy link
Author

thomwolf commented Jun 15, 2020

Install the requirements for runing this gist with:

pip install datasets psutil

And check the details at https://github.com/huggingface/datasets

@sugatoray
Copy link

I think the API has changed a little bit. See here: https://github.com/huggingface/datasets#usage.

# For loading datasets
from datasets import list_datasets, load_dataset

# To see all available dataset names
print(list_datasets()) 

# To load a dataset
wiki = load_dataset("wikipedia", "20200501.en", split='train')

For installation:

pip install datasets

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment