Skip to content

Instantly share code, notes, and snippets.

@oneryalcin
oneryalcin / sparkify_4_null_values_stats.py
Created September 23, 2019 21:13
sparkify_4_null_values_stats
# First let's have a look if we have any NAN values in our dataset
data.select([count(when(isnan(c), c)).alias(c) for c in data.columns]).head().asDict()
>> {'artist': 0,
'auth': 0,
'firstName': 0,
'gender': 0,
'itemInSession': 0,
'lastName': 0,
'length': 0,
'level': 0,
@oneryalcin
oneryalcin / sparkify_3_read_data.py
Last active September 23, 2019 21:07
3 Sparkify Read Data
# Read data into spark.
# Note: Ideally data should be in a schema supported format like parquet,
# which also supports partitioning, something very important while ingesting big data.
# Also data may be placed in a distributed filesystem like HDFS or in a cloud
# provider storage bucket like AWS S3 / Google Cloud Storage for faster reads.
# here we only read from local disk.
data = spark.read.json('mini_sparkify_event_data.json')
# How many user activity rows do we have?
data.count()
@oneryalcin
oneryalcin / sparkify_2_spark_context.py
Last active September 23, 2019 20:58
2 Sparkify create spark context
# create a Spark session, if not there is or get an existing one
spark = SparkSession \
.builder \
.appName("Sparkify The music streaming platform churn detection") \
.getOrCreate()
# Check the current Spark Config
spark.sparkContext.getConf().getAll()
>> [('spark.app.id', 'local-1569248217329'),
@oneryalcin
oneryalcin / sparkify_1_import_libs.py
Created September 23, 2019 20:42
1 Sparkify Import libs
# import libraries
from pyspark import SparkConf, SparkContext
from pyspark.sql import SparkSession, Window
from pyspark.sql.functions import count, when, isnan, isnull, desc_nulls_first, desc, \
from_unixtime, col, dayofweek, dayofyear, hour, to_date, month
import pyspark.sql.functions as F
from pyspark.ml.feature import OneHotEncoderEstimator, StringIndexer, VectorAssembler, StandardScaler, MinMaxScaler
from pyspark.ml.classification import DecisionTreeClassifier, RandomForestClassifier
# sc = SparkContext(appName="Project_workspace")
@oneryalcin
oneryalcin / gridsearchcv_pipeline.py
Last active August 6, 2019 13:19
Sklearn's GridSearchCV with Pipelines
import nltk
nltk.download(['punkt', 'wordnet', 'averaged_perceptron_tagger'])
import re
import numpy as np
import pandas as pd
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from sklearn.metrics import confusion_matrix
@oneryalcin
oneryalcin / custom_transformer.py
Last active August 6, 2019 12:09
sklearn's FeatureUnion in Action. We use additional custom defined StartingVerbExtractor to use in our pipeline in parallel.
import nltk
nltk.download(['punkt', 'wordnet', 'averaged_perceptron_tagger'])
import re
import pandas as pd
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from sklearn.base import BaseEstimator, TransformerMixin
url_regex = 'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+'
@oneryalcin
oneryalcin / corporate_messaging.py
Last active August 6, 2019 11:27
Example NLTK (Corporate messaging - Udacity example)
import nltk
nltk.download(['punkt', 'wordnet'])
import re
import numpy as np
import pandas as pd
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import train_test_split
@oneryalcin
oneryalcin / add_deploy_user.yml
Created February 22, 2019 14:17
Ansible 2.7 compatible Playbook that deploys a new user with sudo rights and passwordless login to remote server. It also disables root user and password authentication
######### ansible.cfg FILE ############
[defaults]
inventory = ./dev
######### DEV FILE ############
# Dev file
[servers]
@oneryalcin
oneryalcin / pandas_diff.py
Last active May 30, 2023 09:55
Asymmetric difference of two pandas Dataframes
def diff_df(df1, df2, how="left"):
"""
Find Difference of rows for given two dataframes
this function is not symmetric, means
diff(x, y) != diff(y, x)
however
diff(x, y, how='left') == diff(y, x, how='right')
Ref: https://stackoverflow.com/questions/18180763/set-difference-for-pandas/40209800#40209800
"""
@oneryalcin
oneryalcin / description.md
Created April 4, 2017 13:28 — forked from mangecoeur/description.md
Pandas PostgresSQL support for loading to DB using fast COPY FROM method

This small subclass of the Pandas sqlalchemy-based SQL support for reading/storing tables uses the Postgres-specific "COPY FROM" method to insert large amounts of data to the database. It is much faster that using INSERT. To acheive this, the table is created in the normal way using sqlalchemy but no data is inserted. Instead the data is saved to a temporary CSV file (using Pandas' mature CSV support) then read back to Postgres using Psychopg2 support for COPY FROM STDIN.