This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
#RMSLE error function | |
def rmsle_error(y, y_pred): | |
assert len(y) == len(y_pred) | |
to_sum = [(math.log(y_pred[i] + 1) - math.log(y[i] + 1)) ** 2.0 for i,pred in enumerate(y_pred)] | |
return (sum(to_sum) * (1.0/len(y))) ** 0.5 |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
def num_feature(df,vectorizer_nums=None,scale=None,training=True): | |
""" | |
After some EDA and manually reviewing textual data I found out that there was lot of numerical information avaliable in text eg. 10ml , 2 packs, 10 lipsticks 512gb. | |
This numerical data had impact on the price of item. So this a function does numerical vectorization of this data.So we first find patterns like (10ml,160gb 2 packs..etc) | |
So now we have sentence "250ml 2 packs of xyz company" so here in "ml" column we had "250" and in packs column we add "2".At the end a sparse matrix is given out as output | |
""" | |
def get_featuers(phrase): | |
""" | |
this function finds all possible numercial patterns in training data and accordingly updates the dictionary. | |
""" |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
# ref - https://www.kaggle.com/gspmoreira/cnn-glove-single-model-private-lb-0-41117-35th | |
def generate_cbs_stats(train,test): | |
df_group = train.groupby('cat_brand_ship',as_index = False).agg({"shipping" : len, | |
"log_price" : [np.median, np.mean, np.std,np.min,np.max]}) | |
df_group.columns = ['cat_brand_ship','cbs_count','cbs_log_price_median','cbs_log_price_mean','cbs_log_price_std', | |
'cbs_log_price_min','cbs_log_price_max'] | |
df_group['cbs_log_price_std'] = df_group['cbs_log_price_std'].fillna(0) | |
df_group['cbs_log_price_conf_variance'] = df_group['cbs_log_price_std'] / df_group['cbs_log_price_mean'] | |
df_group['cbs_log_count'] = np.log1p(df_group['cbs_count']) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
from hyperopt import hp, fmin, tpe, Trials, STATUS_OK | |
lgb_reg_params = { | |
'learning_rate': hp.uniform('learning_rate',0.1,1), | |
'max_depth': hp.choice('max_depth', np.arange(2, 100, 1, dtype=int)), | |
'min_child_weight': hp.choice('min_child_weight', np.arange(1, 50, 1, dtype=int)), | |
'colsample_bytree': hp.uniform('colsample_bytree',0.4,1), | |
'subsample': hp.uniform('subsample', 0.6, 1), | |
'num_leaves': hp.choice('num_leaves', np.arange(1, 200, 1, dtype=int)), | |
'min_split_gain': hp.uniform('min_split_gain', 0, 1), |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
# This is a sample build configuration for docker app | |
# Check our guides at https://confluence.atlassian.com/x/e8YWN for more examples. | |
# Only use spaces to indent your .yml configuration. | |
# ----- | |
# You can specify a custom docker image from Docker Hub as your build environment. | |
image: ubuntu | |
pipelines: | |
default: | |
- step: |