Skip to content

Instantly share code, notes, and snippets.

@tomron
tomron / spark_knn_approximation.py
Created November 19, 2015 16:47
A naive approximation of k-nn algorithm (k-nearest neighbors) in pyspark. Approximation quality can be controlled by number of repartitions and number of repartition
from __future__ import print_function
import sys
from math import sqrt
import argparse
from collections import defaultdict
from random import randint
from pyspark import SparkContext
@tomron
tomron / spark_aws_lambda.py
Created February 27, 2016 12:57
Example of python code to submit spark process as an emr step to AWS emr cluster in AWS lambda function
import sys
import time
import boto3
def lambda_handler(event, context):
conn = boto3.client("emr")
# chooses the first cluster which is Running or Waiting
# possibly can also choose by name or already have the cluster id
clusters = conn.list_clusters()
@tomron
tomron / parquet_to_json.py
Created November 17, 2016 10:53
Converts parquet file to json using spark
# impor spark, set spark context
from pyspark import SparkContext, SparkConf
from pyspark.sql.context import SQLContext
import sys
import os
if len(sys.argv) == 1:
sys.stderr.write("Must enter input file to convert")
sys.exit()
input_file = sys.argv[1]
@tomron
tomron / MergeMapUDAF.java
Created July 20, 2017 06:49
Merge Map Spark User Defined Aggregation function - merge two maps of type <String, Long> to one Map.
package com.tomron;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.expressions.MutableAggregationBuffer;
import org.apache.spark.sql.expressions.UserDefinedAggregateFunction;
import org.apache.spark.sql.types.DataType;
import org.apache.spark.sql.types.DataTypes;
import org.apache.spark.sql.types.StructField;
import org.apache.spark.sql.types.StructType;
@tomron
tomron / MergeMapUDAF.java
Created July 20, 2017 06:49
Merge Map Spark User Defined Aggregation function - merge two maps of type <String, Long> to one Map.
package com.tomron;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.expressions.MutableAggregationBuffer;
import org.apache.spark.sql.expressions.UserDefinedAggregateFunction;
import org.apache.spark.sql.types.DataType;
import org.apache.spark.sql.types.DataTypes;
import org.apache.spark.sql.types.StructField;
import org.apache.spark.sql.types.StructType;
@tomron
tomron / welchtest.py
Created August 6, 2018 12:11
welchtest
import pandas as pd
import numpy as np
from scipy import stats
input_file='advertisement_clicks.csv'
df = pd.read_csv(input_file)
a = df[df['advertisement_id']== 'A']['action'].tolist()
@tomron
tomron / welchtest.py
Created August 6, 2018 12:13
welchtest.py - based on the lazy programmer ttest implementation (https://github.com/lazyprogrammer/machine_learning_examples/blob/master/ab_testing/ttest.py). Numbers are not exactly the same but I suspect it have to do with rounding issues
import pandas as pd
import numpy as np
from scipy import stats
input_file='advertisement_clicks.csv'
df = pd.read_csv(input_file)
a = df[df['advertisement_id']== 'A']['action'].tolist()
@tomron
tomron / sprt.py
Created April 30, 2019 09:38
Sequential probability ratio test
import numpy as np
"""
Implements Sequential probability ratio test
https://en.wikipedia.org/wiki/Sequential_probability_ratio_test
"""
class SPRT:
def __init__(self, alpha, beta, mu0, mu1):
@tomron
tomron / sprt.py
Last active April 30, 2019 09:58
Sequential probability ratio test implementation (https://en.wikipedia.org/wiki/Sequential_probability_ratio_test) for exponential distribution. Usage - `t = sprt.SPRT(0.05, 0.8, 1, 2); t.test([1, 2, 3, 4, 5])`
import numpy as np
"""
Implements Sequential probability ratio test
https://en.wikipedia.org/wiki/Sequential_probability_ratio_test
"""
class SPRT:
def __init__(self, alpha, beta, mu0, mu1):
import networkx as nx
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
# Multigraph example
G = nx.MultiGraph()
G.add_nodes_from([1, 2, 3])