Skip to content

Instantly share code, notes, and snippets.

View VibhuJawa's full-sized avatar
🏠
Working from home

Vibhu Jawa VibhuJawa

🏠
Working from home
  • Nvidia
  • Santa Clara
View GitHub Profile
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
@VibhuJawa
VibhuJawa / non_persisted_df_xgboost.ipynb
Last active September 23, 2019 18:40
This gist shows the error i get when trying to train a non perisited df with dask
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
@VibhuJawa
VibhuJawa / persisted_df_xgboost_error.ipynb
Created September 23, 2019 18:33
This gist shows the error i get with persisted DF
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
@VibhuJawa
VibhuJawa / dask_run_non_persisited.ipynb
Created August 15, 2019 00:58
I can complee dask persisted run as i get a CommClosedError.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
@VibhuJawa
VibhuJawa / dask_run_persisited.ipynb
Last active August 15, 2019 00:51
Here, i am trying to train a persisted dataframe using xgboost , the `cudf-interoperabilty` cudf-interop branch but I cant send persisted dataframes for training.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
print("Length of df = {:,}".format(len(gdf)))
%time cleaned_df = extract_columns_without_regex(gdf)
def extract_columns_small_regex_pd(df):
p1 = """\\[haproxy@([0-9.]*)\\]\\s\\S*([A-Z][\\S\\s]*) ([\\S]*)\\[([0-9]*)\\]:([\\S\\s]*)"""
df1 = df['logline'].str.extract(p1)
temp_cols = cols[:4]
temp_cols.append("suffix")
df1.columns = temp_cols
p2 = """\\s([0-9.]*):([0-9]*)\\s\\[([\\S]*)\\]([\\S\\s]*)"""
extract_p2_df = df1['suffix'].str.extract(p2)
df2 = pd.concat([df1,extract_p2_df ], axis=1)
def extract_columns_custom_pd(df):
### added expland=True for split
### changed drop_column to drop
clean_df = df['logline'].str.split(' ',expand=True)
# log_ip column
clean_df['log_ip'] = clean_df[0].str.lstrip('[haproxy@').str.rstrip(']')
clean_df.drop(columns = [0],inplace=True)