I was looking for dozens of minutes just the exact size of the different datasets (dev
, val
and train
) of the SQuAD2.0.
Size of the different datasets as of the 30th of March 2025 (I did not find the validation dataset):
Dev Dataset Summary:
Number of categories: 35
Total number of questions: 11873
Train Dataset Summary:
Number of entries: 442
Total number of questions: 130319
The script I used looks like this:
from pathlib import Path
import pandas as pd
if __name__ == "__main__":
df = pd.read_json(Path("data/train-v2.0.json"))
# Display a summary of the dataset
print("Dataset Summary:")
print(f"Number of categories: {len(df)}")
print(f"Columns: {df.columns.tolist()}")
# Count the number of questions in the dataset
num_questions = sum(
len(paragraph["qas"])
for entry in df["data"]
for paragraph in entry["paragraphs"]
)
print(f"Total number of questions: {num_questions}")
Feel free to modify, improve and share again.