Skip to content

Instantly share code, notes, and snippets.

@ghl3
Last active August 29, 2015 13:55
Show Gist options
  • Select an option

  • Save ghl3/8760132 to your computer and use it in GitHub Desktop.

Select an option

Save ghl3/8760132 to your computer and use it in GitHub Desktop.
pandas tricks

See gist as a notebook here: http://nbviewer.ipython.org/gist/ghl3/8760132/notebook.ipynb

Create a simple DataFrame:

df = pandas.DataFrame({'user_id':[1, 1, 2, 2, 3, 3],
                       'type':['A','A','A','A','B','B'],
                       'val':[100, 200, 150, 30, 300, 75]})
df.head()
type user_id val
0 A 1 100
1 A 1 200
2 A 2 150
3 A 2 30
4 B 3 300

To give this context, let's imagine that we have multiple users, and each user can have a type. Each row represents, for example, the value of a transaction for each user.

Let's say we want a mean and sum of all values for users. Further, we want to retain "type" information so we can later slice this data up by the type. This is how we could do this:

First, let's group by user_id and type and get a list of the values:

df.groupby(('user_id', 'type'))['val'].head()
user_id type
1 A 0 100
1 200
2 A 2 150
3 30
3 B 4 300
5 75

Now, let's aggregate over users and get the sum and mean of the values by user and type:

df.groupby(('user_id', 'type'))['val'].aggregate([np.sum, np.mean]).head()
sum mean
user_id type
1 A 300 150.0
2 A 180 90.0
3 B 375 187.5

This is pretty much what we want. We now have the sum and mean of transactions by each user as well as the corresponding type for that user.

Further, let's say that we want to compare what the sums and means look like across types. We can do this by further grouping this dataframe by type and doing some plotting:

df.groupby(('user_id', 'type'))['val'].aggregate([np.sum, np.mean]).groupby(level='type').head()
sum mean
type user_id type
A 1 A 300 150.0
2 A 180 90.0
B 3 B 375 187.5

We can do aggregates on this. Let's say that we want to take the mean transaction size of each user and find the standard deviation of that value within each "type" category. We can do that as follows:

df.groupby(('user_id', 'type'))['val'].aggregate([np.sum, np.mean]).groupby(level='type')['mean'].std()
type
A 42.426407
B NaN

(It is NaN because the standard deviation uses N-1 degrees of freedom)

We can also get the variance which has more sane values:

df.groupby(('user_id', 'type'))['val'].aggregate([np.sum, np.mean]).groupby(level='type')['mean'].apply(np.var)
type
A 900
B 0

And finally, we can do some plotting:

grouped = df.groupby(('user_id', 'type'))['val'].aggregate([np.sum, np.mean]).groupby(level='type')

fig = plt.figure(figsize=(12,8))
color_cycle = gca()._get_lines.color_cycle
for (color, (a, grp)) in zip(color_cycle, grouped):
    plt.scatter(grp['sum'], grp['mean'], label=a, color=color)

plt.legend(loc='best')
plt.show()
Display the source blob
Display the rendered blob
Raw
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment