Last active
October 19, 2018 03:45
-
-
Save gjreda/ecdd49d133e09eee3264 to your computer and use it in GitHub Desktop.
add grouped cumulative sum column to pandas dataframe
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
""" | |
add grouped cumulative sum column to pandas dataframe | |
Add a new column to a pandas dataframe which holds the cumulative sum for a given grouped window | |
Desired output: | |
user_id,day,session_minutes,cumulative_minutes | |
516530,0,NaN,0 | |
516530,1,0,0 | |
516532,0,5,5 | |
516532,1,33,38 | |
516534,0,32,32 | |
516534,1,104,136 | |
516535,0,5,5 | |
516535,1,0,5 | |
""" | |
import pandas as pd | |
from StringIO import StringIO | |
data = """ | |
user_id,day,session_minutes | |
516530,0,NaN | |
516530,1,0 | |
516532,0,5 | |
516532,1,33 | |
516534,0,32 | |
516534,1,104 | |
516535,0,5 | |
516535,1,0 | |
""" | |
df = pd.read_csv(StringIO(data)) | |
cumsums = df.groupby(['user_id', 'day']).sum().fillna(0).groupby(level=0).cumsum() | |
df.set_index(['user_id', 'day'], inplace=True) | |
df['cumulative_minutes'] = cumsums | |
df.reset_index(inplace=True) | |
print(df) | |
# user_id day session_minutes cumulative_minutes | |
# 0 516530 0 NaN 0 | |
# 1 516530 1 0 0 | |
# 2 516532 0 5 5 | |
# 3 516532 1 33 38 | |
# 4 516534 0 32 32 | |
# 5 516534 1 104 136 | |
# 6 516535 0 5 5 | |
# 7 516535 1 0 5 |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Hello,
Thank you for sharing. Trying your example with
data="""
PATTERN_NAME | PROD_INJ_DATE | OIL_PROD
_CRCU_1101 | 1/31/1973 | 4752
_CRCU_1101 | 2/28/1973 | 4428
_CRCU_1101 | 3/31/1973 | 5316
_CRCU_1101 | 4/30/1973 | 4776
_CRCU_1101 | 5/31/1973 | 4860
_CRCU_1101 | 6/30/1973 | 4905
_CRCU_1101 | 7/31/1973 | 5208
_CRCU_1101 | 8/31/1973 | 5034
_CRCU_1101 | 9/30/1973 | 3240
_CRCU_1103 | 1/31/1973 | 372
_CRCU_1103 | 2/28/1973 | 345
_CRCU_1103 | 3/31/1973 | 397
_CRCU_1103 | 4/30/1973 | 373
_CRCU_1103 | 5/31/1973 | 432
_CRCU_1103 | 6/30/1973 | 454
_CRCU_1103 | 7/31/1973 | 483
_CRCU_1103 | 8/31/1973 | 467
_CRCU_1103 | 9/30/1973 | 458
_CRCU_1104 | 1/31/1973 | 447
_CRCU_1104 | 2/28/1973 | 414
_CRCU_1104 | 3/31/1973 | 488
_CRCU_1104 | 4/30/1973 | 449
_CRCU_1104 | 5/31/1973 | 432
_CRCU_1104 | 6/30/1973 | 454
_CRCU_1104 | 7/31/1973 | 482
_CRCU_1104 | 8/31/1973 | 467
_CRCU_1104 | 9/30/1973 | 431
"""
Using the code below
df=pd.read_csv("ProdByPattern.csv",delimiter="\t")
cumsums = df.groupby(['PATTERN_NAME','PROD_INJ_DATE']).sum().fillna(0).groupby(level=0).cumsum()
df.set_index(['PATTERN_NAME','PROD_INJ_DATE'], inplace=True)
df['cum_oil'] = cumsums
df.reset_index(inplace=True)
print(df)
I got the expected results.
PATTERN_NAME PROD_INJ_DATE OIL_PROD cum_oil
0 _CRCU_1101 1/31/1973 4752 4752
1 _CRCU_1101 2/28/1973 4428 9180
2 _CRCU_1101 3/31/1973 5316 14496
3 _CRCU_1101 4/30/1973 4776 19272
4 _CRCU_1101 5/31/1973 4860 24132
5 _CRCU_1101 6/30/1973 4905 29037
6 _CRCU_1101 7/31/1973 5208 34245
7 _CRCU_1101 8/31/1973 5034 39279
8 _CRCU_1101 9/30/1973 3240 42519
9 _CRCU_1103 1/31/1973 372 372
10 _CRCU_1103 2/28/1973 345 717
11 _CRCU_1103 3/31/1973 397 1114
12 _CRCU_1103 4/30/1973 373 1487
13 _CRCU_1103 5/31/1973 432 1919
14 _CRCU_1103 6/30/1973 454 2373
15 _CRCU_1103 7/31/1973 483 2856
16 _CRCU_1103 8/31/1973 467 3323
17 _CRCU_1103 9/30/1973 458 3781
18 _CRCU_1104 1/31/1973 447 447
19 _CRCU_1104 2/28/1973 414 861
20 _CRCU_1104 3/31/1973 488 1349
21 _CRCU_1104 4/30/1973 449 1798
22 _CRCU_1104 5/31/1973 432 2230
23 _CRCU_1104 6/30/1973 454 2684
24 _CRCU_1104 7/31/1973 482 3166
25 _CRCU_1104 8/31/1973 467 3633
26 _CRCU_1104 9/30/1973 431 4064