Skip to content

Instantly share code, notes, and snippets.

@jeremy-rutman
Last active June 24, 2020 10:21
Show Gist options
  • Select an option

  • Save jeremy-rutman/fb8d7e28f031f1ff9d9ec95ffa8a6027 to your computer and use it in GitHub Desktop.

Select an option

Save jeremy-rutman/fb8d7e28f031f1ff9d9ec95ffa8a6027 to your computer and use it in GitHub Desktop.
# selecting all cols except one
df = pd.DataFrame({'a':[1,2,3,4],'b':[1,2,3,4]})
df2 = df.loc[:,df.columns!='b']
print(df)
print(df2)
# split df into train, val, test with val from 0.9 to 0.95 and test from 0.95 to 1.0 of randomized data
train, validate, test = np.split(df.sample(frac=1), [int(.9*len(df)), int(.95*len(df))])
# FILTERING
#filtering groups and accesing group info
g = df_pd.groupby('time_bucket')
sizes = np.array([len(group) for name,group in g])
# take groups that are smaller than n_rows, otherwise sample
df_out = g.apply(lambda x: x if (len(x)<=n_rows) else x.sample(frac = 0.1 ))
#OR
# take groups that are smaller than n_rows
df_out = g.filter(lambda x: len(x) < n_rows)
# describe one liner
df.mycol.str.split().apply(lambda x:len(x)).describe()
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment