Skip to content

Instantly share code, notes, and snippets.

@rjurney
Last active October 2, 2025 22:24
Show Gist options
  • Save rjurney/d48eb5562340ab6eb49fccbbada6f1ca to your computer and use it in GitHub Desktop.
Save rjurney/d48eb5562340ab6eb49fccbbada6f1ca to your computer and use it in GitHub Desktop.
Claude Code command to group data, count the size of the groups, look and display high / low superkeys and sample and display grouped records
allowed-tools description
pyspark-mcp, WebFetch, Web Search, Bash(python:*), Bash(poetry:*), Bash(pyspark:*)
Groupage command is used to group data, count the group size, plit a histogram and display both the keys of the largest groups and those of groups in a middle range. Arguments include the column to group by,

Groupage Command

Description

Realize that the unique values of fields of real world datasets often have long-tail, log scale distributions. This creates 'superkeys' that can cause problems in downstream code. The groupage command is used to identify and mitigate these superkeys.

Program

Use the pyspark-mcp tool to execute the following steps:

  1. Read the file $1 and consider the variable names of our ongoing pyspark-mcp session. Group the pyspark.sql.DataFrame $2 by the $1 field and count the size of each group.
  2. Print out a table of $2 showing the keys with top 5 and bottom 5 counts of the size of the groups of $1. This will help me evaluate the nature of the problem.
  3. Compute and plot a log-scale histogram of the group size and display using the 'plotext' library to see how many large keys I have and how many small keys I have.
  4. Display a 5 record sample of the keys and values of the groups in the middle range (e.g., groups with sizes between 10 and 100) to understand the distribution better. Limit the number of values in each group to 5 for better readability.
  5. Read, think about the top/bottom most frequent keys, the sampled groups and the histogram of group sizes and analyze the data. Create a strategy to handle superkeys in downstream processing for the top 10 largest groups in a way that minimizes their impact on the overall analysis. You can use the name of the dataframe, fieldnames and values to understand the context of the data.
  6. Write a summary of your findings and the strategy you plan to implement to handle superkeys in downstream processing. Provide multiple options if applicable.
  7. If any of these operations fail due to memory constraints or other issues, try to mitigate these problems by filtering or sampling the data appropriately.
  8. Return and display the data tables, charts and summary as the output of the command.

Arguments

  • $1: The name of the column to group by (string).
  • $2: The name of the pyspark.sql.DataFrame to analyze (string).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment