allowed-tools | description |
---|---|
pyspark-mcp, WebFetch, Web Search, Bash(python:*), Bash(poetry:*), Bash(pyspark:*) |
Groupage command is used to group data, count the group size, plit a histogram and display both the keys of the largest groups and those of groups in a middle range. Arguments include the column to group by, |
Realize that the unique values of fields of real world datasets often have long-tail, log scale distributions. This creates 'superkeys' that can cause problems in downstream code. The groupage command is used to identify and mitigate these superkeys.
Use the pyspark-mcp tool to execute the following steps:
- Read the file $1 and consider the variable names of our ongoing pyspark-mcp session. Group the pyspark.sql.DataFrame $2 by the $1 field and count the size of each group.
- Print out a table of $2 showing the keys with top 5 and bottom 5 counts of the size of the groups of $1. This will help me evaluate the nature of the problem.
- Compute and plot a log-scale histogram of the group size and display using the 'plotext' library to see how many large keys I have and how many small keys I have.
- Display a 5 record sample of the keys and values of the groups in the middle range (e.g., groups with sizes between 10 and 100) to understand the distribution better. Limit the number of values in each group to 5 for better readability.
- Read, think about the top/bottom most frequent keys, the sampled groups and the histogram of group sizes and analyze the data. Create a strategy to handle superkeys in downstream processing for the top 10 largest groups in a way that minimizes their impact on the overall analysis. You can use the name of the dataframe, fieldnames and values to understand the context of the data.
- Write a summary of your findings and the strategy you plan to implement to handle superkeys in downstream processing. Provide multiple options if applicable.
- If any of these operations fail due to memory constraints or other issues, try to mitigate these problems by filtering or sampling the data appropriately.
- Return and display the data tables, charts and summary as the output of the command.
- $1: The name of the column to group by (string).
- $2: The name of the pyspark.sql.DataFrame to analyze (string).