Skip to content

Instantly share code, notes, and snippets.

View duttashi's full-sized avatar
🎯
Focusing

Ashish Dutt duttashi

🎯
Focusing
View GitHub Profile
@duttashi
duttashi / solving_smartgiterror.txt
Last active August 5, 2021 12:15
smartgit push error: unable to read askpass response from 'askpass.cmd' could not read Username for 'https://github.com': terminal prompts disabled
Environment
Using Smartgit version 21.1
Windows 10 environment
Error:
unable to read askpass response from 'askpass.cmd'
could not read Username for 'https://github.com': terminal prompts disabled
also unable to launch command prompt
Solution:
@duttashi
duttashi / cleaning_text_data_using_regex.py
Last active May 25, 2021 07:13
common text data preprocessing regex implementations
# suppose the text data is loaded in a dataframe called, df.
# using regular expressions to clean the text data
#Remove twitter handlers
df.text = df.text.apply(lambda x:re.sub('@[^\s]+','',x))
#remove hashtags
df.text = df.text.apply(lambda x:re.sub(r'\B#\S+','',x))
# Remove URLS

Problem statement

Did some data analysis which resulted in generating a huge data file, greater than 100 MB. Accidentaly, tried to push it to Github and the nightmares began! Keep getting error messages, cant push because large files detected.

Solution

The following solution worked for me;

  1. Open command shell in the repo
@duttashi
duttashi / train_validate_test_split.py
Created February 18, 2021 07:53
function to split a data into 3 sets (train, test, validate)
# create a custom function to split data into 3 sets
import numpy as np
def train_validate_test_split(df, train_percent=.6, validate_percent=.2, seed=None):
np.random.seed(seed)
perm = np.random.permutation(df.index)
m = len(df.index)
train_end = int(train_percent * m)
validate_end = int(validate_percent * m) + train_end
# load required libraries
library(tidyverse)
# READ DATA IN MEMORY
df_train<- read.csv("kaggle_fake_job_prediction/data/fake_job_postings.csv",
header=T, na.strings=c(" ","NA"), stringsAsFactors = FALSE, strip.white = TRUE)
# create copy
df<- df_train
# coerce character vars to factor for data cleanup
df<- df %>%
# load required libraries
library(tidyverse)
# READ DATA IN MEMORY
df_train<- read.csv("kaggle_fake_job_prediction/data/fake_job_postings.csv",
header=T, na.strings=c(" ","NA"), stringsAsFactors = FALSE, strip.white = TRUE)
# create copy
df<- df_train
df<- df %>%
mutate_if(is.character, funs(factor(.)))

There 3 options how to convert categorical features to numerical:

  • Use OneHotEncoder. You will transform categorical feature to four new columns, where will be just one 1 and other 0. The problem here is that difference between "morning" and "afternoon" is the same as the same as "morning" and "evening".

  • Use OrdinalEncoder. You transform categorical feature to just one column. "morning" to 1, "afternoon" to 2 etc. The difference between "morning" and "afternoon" will be smaller than "morning" and "evening" which is good, but the difference between "morning" and "night" will be greatest which might not be what you want.

  • Use transformation that I call two_hot_encoder. It is similar to OneHotEncoder, there are just two 1 in the row. The difference between The difference between "mor

@duttashi
duttashi / ggplotRegression.R
Created December 19, 2018 23:15
Plot the linear regression results
# create function to plot linear regression results
# adapted from https://sejohnston.com/2012/08/09/a-quick-and-easy-function-to-plot-lm-results-in-r/
ggplotRegression <- function (fit) {
lmdf<- data.frame(fitted_values = fit$fitted.values, actual_values = fit$model[, 1])
print(names(lmdf))
ggplot(lmdf, aes(x = actual_values, y = fitted_values)) +
geom_point() +
geom_abline(slope = 1, intercept = 0) +
labs(title = paste("Adj R2 = ", signif(summary(fit)$adj.r.squared, 4),
"Intercept =",signif(fit$coef[[1]],5 ),
@duttashi
duttashi / separate_categorical_continuous_variables.r
Created August 17, 2018 03:29
Easy way to separate categorical and continuous variables from a data frame in R
# Ensure the data is read as a dataframe and that the categorical variables are read as factors and not characters.
# A minimum reprex is given below
# load the adult dataset from the UCI ML repo.
library(data.table)
dt<- fread("https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data",
header = FALSE, sep = ",", stringsAsFactors = TRUE)
# coerce data table to data frame
dt<- as.data.frame(dt)
head(dt)
@duttashi
duttashi / download_data_from_url.r
Created August 17, 2018 01:47
I was trying to download a UCI ML dataset in R using the read.csv() but kept getting error "Error in file(file, "rt") : cannot open the connection In addition: Warning message: In file(file, "rt") : InternetOpenUrl failed: 'An error occurred in the secure channel support'"
# Apparently the problem lies in https. The function read.csv() in R fails at this. I tried RCurl's getURL() still same error.
# Then I tried fread() from library(data.table) and it worked.
# I give below a minimum reproducible example to download data from a https base webpage.
# load the adult dataset
library(data.table)
dt<- fread("https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data",header = FALSE, sep=",")
head(dt)
V1 V2 V3 V4 V5 V6