Skip to content

Instantly share code, notes, and snippets.

@seahrh
seahrh / spark.sh
Created February 8, 2019 06:46
spark-submit airflow jinja template - optional params
#!/usr/bin/env bash
spark-submit --master yarn \
--deploy-mode cluster \
--conf spark.serializer=org.apache.spark.serializer.KryoSerializer \
--class {{ params.class }} {{ params.jar_path }} \
--sink_db {{ params.sink_db }} \
--sink_table {{ params.sink_table }} \
--sink_partition_column_ds {{ ds_nodash }} \
{% if params.sink_partition_column_post_date is defined %}--sink_partition_column_post_date {{ params.sink_partition_column_post_date }} \{% else %}\{% endif %}
{% if params.foo is defined %}--foo {{ params.foo }} \{% else %}\{% endif %}
@seahrh
seahrh / README.mkd
Created April 25, 2019 02:09 — forked from vrillusions/README.mkd
Generate gpg key via batch file

Introduction

This is how to create a gpg key without any user interaction or password. This can be used in cases where the primary goal is to secure the data in transit but the gpg key can/must be stored locally without a password. An example of this is the hiera-gpg plugin which doesn't support passwords.

The below genkey-batch file will use the default which currently are RSA/RSA and 2048 bit length. See the reference link to set this to something else.

References

@seahrh
seahrh / QueryReader.scala
Created April 25, 2019 05:53
spark read jdbc
import org.apache.spark.sql._
final case class QueryReader(
spark: SparkSession,
query: String,
url: String,
driver: String,
user: String,
password: String
) {
@seahrh
seahrh / .gitconfig
Created July 19, 2019 08:11 — forked from Kovrinic/.gitconfig
git global url insteadOf setup
# one or the other, NOT both
[url "https://github"]
insteadOf = git://github
# or
[url "[email protected]:"]
insteadOf = git://github
@seahrh
seahrh / uniques.py
Created September 11, 2019 13:18
pandas: find top n unique values in each column
def df_column_unique_values(df, top_n = 5):
for col_name, values in df.iteritems():
col_value_counts = values.value_counts()
print(f"{col_name} : {len(col_value_counts)}")
col_value_count_list = [
"'" + str(c) + "'" + ":" + str(n) for c, n in sorted(
col_value_counts.items(),
key=lambda kv: kv[1],
reverse=True
)
@seahrh
seahrh / permutation_importance.py
Created October 16, 2019 05:38
Permutation importance function
def permutation_importance(X, y, model):
perm = {}
y_true = model.predict_proba(X)[:,1]
baseline= roc_auc_score(y, y_true)
for cols in X.columns:
value = X[cols]
X[cols] = np.random.permutation(X[cols].values)
y_true = model.predict_proba(X)[:,1]
perm[cols] = roc_auc_score(y, y_true) - baseline
X[cols] = value
@seahrh
seahrh / drop_corr_features.py
Created October 16, 2019 05:58
Removing all the features with a high correlation. Keeping those which correlate with target value better.
to_drop = list()
# Iterating over rows starting from the second one, because position [0, 0] will be self-correlation which is 1
for i in range(1, len(corr_matrix)):
# Iterating over columns of the row. Only going under the diagonal.
for j in range(i):
# See if the correlation between two features are more than a selected threshold
if corr_matrix.iloc[i, j] >= 0.98:
# Then keep the one from thos two which correlates with target better
if abs(pd.concat([X[corr_matrix.index[i]], y], axis=1).corr().iloc[0][1]) > abs(pd.concat([X[corr_matrix.columns[j]], y], axis=1).corr().iloc[0][1]):
@seahrh
seahrh / pystack.py
Created October 25, 2019 10:10 — forked from JettJones/pystack.py
performance of various ways to get the callers name in python
import sys
import inspect
import time
import traceback
def deeper(func, depth):
if depth > 0:
return deeper(func, depth-1)
else:
return func()
@seahrh
seahrh / logging.py
Created January 25, 2020 04:39 — forked from kingspp/logging.py
Python Comprehensive Logging using YAML Configuration
import os
import yaml
import logging.config
import logging
import coloredlogs
def setup_logging(default_path='logging.yaml', default_level=logging.INFO, env_key='LOG_CFG'):
"""
| **@author:** Prathyush SP
| Logging Setup
@seahrh
seahrh / Dockerfile
Created April 6, 2020 03:58
dataflow python app Dockerfile
FROM python:3.7-slim-buster
WORKDIR /app
COPY app app
COPY *.py ./
COPY *.ini ./
RUN python --version \
&& python -m ensurepip --default-pip \
&& python -m pip install --upgrade pip setuptools wheel \
&& pip install . \
&& pip list