Exact size of the SQuAD2.0 dataset

Motivation

I was looking for dozens of minutes just the exact size of the different datasets (dev, val and train) of the SQuAD2.0.

Results

Size of the different datasets as of the 30th of March 2025 (I did not find the validation dataset):

Dev Dataset Summary:
Number of categories: 35
Total number of questions: 11873

Train Dataset Summary:
Number of entries: 442
Total number of questions: 130319

The script I used looks like this:

from pathlib import Path

import pandas as pd

if __name__ == "__main__":
    df = pd.read_json(Path("data/train-v2.0.json"))

    # Display a summary of the dataset
    print("Dataset Summary:")
    print(f"Number of categories: {len(df)}")
    print(f"Columns: {df.columns.tolist()}")

    # Count the number of questions in the dataset
    num_questions = sum(
        len(paragraph["qas"])
        for entry in df["data"]
        for paragraph in entry["paragraphs"]
    )
    print(f"Total number of questions: {num_questions}")

Feel free to modify, improve and share again.

SH2282000/info_size.md

Select an option

No results found

Select an option

No results found

Motivation

Results