See gist as a notebook here: http://nbviewer.ipython.org/gist/ghl3/8760132/notebook.ipynb
Create a simple DataFrame:
df = pandas.DataFrame({'user_id':[1, 1, 2, 2, 3, 3],
'type':['A','A','A','A','B','B'],
'val':[100, 200, 150, 30, 300, 75]})
df.head()| type | user_id | val | |
|---|---|---|---|
| 0 | A | 1 | 100 |
| 1 | A | 1 | 200 |
| 2 | A | 2 | 150 |
| 3 | A | 2 | 30 |
| 4 | B | 3 | 300 |
To give this context, let's imagine that we have multiple users, and each user can have a type. Each row represents, for example, the value of a transaction for each user.
Let's say we want a mean and sum of all values for users. Further, we want to retain "type" information so we can later slice this data up by the type. This is how we could do this:
First, let's group by user_id and type and get a list of the values:
df.groupby(('user_id', 'type'))['val'].head()| user_id | type | ||
|---|---|---|---|
| 1 | A | 0 | 100 |
| 1 | 200 | ||
| 2 | A | 2 | 150 |
| 3 | 30 | ||
| 3 | B | 4 | 300 |
| 5 | 75 |
Now, let's aggregate over users and get the sum and mean of the values by user and type:
df.groupby(('user_id', 'type'))['val'].aggregate([np.sum, np.mean]).head()| sum | mean | ||
|---|---|---|---|
| user_id | type | ||
| 1 | A | 300 | 150.0 |
| 2 | A | 180 | 90.0 |
| 3 | B | 375 | 187.5 |
This is pretty much what we want. We now have the sum and mean of transactions by each user as well as the corresponding type for that user.
Further, let's say that we want to compare what the sums and means look like across types. We can do this by further grouping this dataframe by type and doing some plotting:
df.groupby(('user_id', 'type'))['val'].aggregate([np.sum, np.mean]).groupby(level='type').head()| sum | mean | |||
|---|---|---|---|---|
| type | user_id | type | ||
| A | 1 | A | 300 | 150.0 |
| 2 | A | 180 | 90.0 | |
| B | 3 | B | 375 | 187.5 |
We can do aggregates on this. Let's say that we want to take the mean transaction size of each user and find the standard deviation of that value within each "type" category. We can do that as follows:
df.groupby(('user_id', 'type'))['val'].aggregate([np.sum, np.mean]).groupby(level='type')['mean'].std()| type | |
|---|---|
| A | 42.426407 |
| B | NaN |
(It is NaN because the standard deviation uses N-1 degrees of freedom)
We can also get the variance which has more sane values:
df.groupby(('user_id', 'type'))['val'].aggregate([np.sum, np.mean]).groupby(level='type')['mean'].apply(np.var)| type | |
|---|---|
| A | 900 |
| B | 0 |
And finally, we can do some plotting:
grouped = df.groupby(('user_id', 'type'))['val'].aggregate([np.sum, np.mean]).groupby(level='type')
fig = plt.figure(figsize=(12,8))
color_cycle = gca()._get_lines.color_cycle
for (color, (a, grp)) in zip(color_cycle, grouped):
plt.scatter(grp['sum'], grp['mean'], label=a, color=color)
plt.legend(loc='best')
plt.show()