Skip to content

Instantly share code, notes, and snippets.

@SH2282000
Last active March 30, 2025 01:13
Show Gist options
  • Save SH2282000/331d88272e760caf9ddf23fc750c2d8c to your computer and use it in GitHub Desktop.
Save SH2282000/331d88272e760caf9ddf23fc750c2d8c to your computer and use it in GitHub Desktop.
Exact size of the SQuAD2.0 dataset

Motivation

I was looking for dozens of minutes just the exact size of the different datasets (dev, val and train) of the SQuAD2.0.

Results

Size of the different datasets as of the 30th of March 2025 (I did not find the validation dataset):

Dev Dataset Summary:
Number of categories: 35
Total number of questions: 11873

Train Dataset Summary:
Number of entries: 442
Total number of questions: 130319

The script I used looks like this:

from pathlib import Path

import pandas as pd

if __name__ == "__main__":
    df = pd.read_json(Path("data/train-v2.0.json"))

    # Display a summary of the dataset
    print("Dataset Summary:")
    print(f"Number of categories: {len(df)}")
    print(f"Columns: {df.columns.tolist()}")

    # Count the number of questions in the dataset
    num_questions = sum(
        len(paragraph["qas"])
        for entry in df["data"]
        for paragraph in entry["paragraphs"]
    )
    print(f"Total number of questions: {num_questions}")

Feel free to modify, improve and share again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment