pandas tricks

See gist as a notebook here: http://nbviewer.ipython.org/gist/ghl3/8760132/notebook.ipynb

Create a simple DataFrame:

df = pandas.DataFrame({'user_id':[1, 1, 2, 2, 3, 3],
                       'type':['A','A','A','A','B','B'],
                       'val':[100, 200, 150, 30, 300, 75]})
df.head()

	type	user_id	val
0	A	1	100
1	A	1	200
2	A	2	150
3	A	2	30
4	B	3	300

To give this context, let's imagine that we have multiple users, and each user can have a type. Each row represents, for example, the value of a transaction for each user.

Let's say we want a mean and sum of all values for users. Further, we want to retain "type" information so we can later slice this data up by the type. This is how we could do this:

First, let's group by user_id and type and get a list of the values:

df.groupby(('user_id', 'type'))['val'].head()

user_id	type
1	A	0	100
		1	200
2	A	2	150
		3	30
3	B	4	300
		5	75

Now, let's aggregate over users and get the sum and mean of the values by user and type:

df.groupby(('user_id', 'type'))['val'].aggregate([np.sum, np.mean]).head()

		sum	mean
user_id	type
1	A	300	150.0
2	A	180	90.0
3	B	375	187.5

This is pretty much what we want. We now have the sum and mean of transactions by each user as well as the corresponding type for that user.

Further, let's say that we want to compare what the sums and means look like across types. We can do this by further grouping this dataframe by type and doing some plotting:

df.groupby(('user_id', 'type'))['val'].aggregate([np.sum, np.mean]).groupby(level='type').head()

			sum	mean
type	user_id	type
A	1	A	300	150.0
	2	A	180	90.0
B	3	B	375	187.5

We can do aggregates on this. Let's say that we want to take the mean transaction size of each user and find the standard deviation of that value within each "type" category. We can do that as follows:

df.groupby(('user_id', 'type'))['val'].aggregate([np.sum, np.mean]).groupby(level='type')['mean'].std()

type
A	42.426407
B	NaN

(It is NaN because the standard deviation uses N-1 degrees of freedom)

We can also get the variance which has more sane values:

df.groupby(('user_id', 'type'))['val'].aggregate([np.sum, np.mean]).groupby(level='type')['mean'].apply(np.var)

type
A	900
B	0

And finally, we can do some plotting:

grouped = df.groupby(('user_id', 'type'))['val'].aggregate([np.sum, np.mean]).groupby(level='type')

fig = plt.figure(figsize=(12,8))
color_cycle = gca()._get_lines.color_cycle
for (color, (a, grp)) in zip(color_cycle, grouped):
    plt.scatter(grp['sum'], grp['mean'], label=a, color=color)

plt.legend(loc='best')
plt.show()

ghl3/grouping.md

Select an option

No results found

Select an option

No results found