Skip to content

Instantly share code, notes, and snippets.

@Houssem96
Created August 22, 2021 13:11
Show Gist options
  • Save Houssem96/cbb32dc2a274828a9e2c9e9704899eba to your computer and use it in GitHub Desktop.
Save Houssem96/cbb32dc2a274828a9e2c9e9704899eba to your computer and use it in GitHub Desktop.
classes distribution and lengths of texts for each class
AG_news_dataset = load_dataset(dataset_name)
AG_news_dataset.set_format(type="pandas")
df = AG_news_dataset["train"][:]
def label_int2str(row, split):
return AG_news_dataset[split].features["label"].int2str(row)
df["label_name"] = df["label"].apply(label_int2str, split="train")
df["label_name"].value_counts(ascending=True).plot.barh()
plt.title("Category Counts")
df["Words Per Text"] = df["text"].str.split().apply(len)
df.boxplot("Words Per Tweet", by='label_name', grid=False, showfliers=False,
color='black', )
plt.suptitle("")
plt.xlabel("")
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment