Skip to content

Instantly share code, notes, and snippets.

View saswata-dutta's full-sized avatar
💭
I may be slow to respond.

Saswata Dutta saswata-dutta

💭
I may be slow to respond.
View GitHub Profile
@saswata-dutta
saswata-dutta / all_aws_lambda_modules_python.md
Created December 18, 2020 13:19 — forked from gene1wood/all_aws_lambda_modules_python.md
AWS Lambda function to list all available Python modules for Python 2.7 3.6 and 3.7

This gist contains lists of modules available in

in AWS Lambda.

It also contains the code to run in Lambda to generate these lists. In addition there is a less_versbose module in the code that you can call to get a list of the top level modules installed and the version of those modules (if they contain a version

# Luke's config for the Zoomer Shell
# Enable colors and change prompt:
autoload -U colors && colors
PS1="%B%{$fg[red]%}[%{$fg[yellow]%}%n%{$fg[green]%}@%{$fg[blue]%}%M %{$fg[magenta]%}%~%{$fg[red]%}]%{$reset_color%}$%b "
# History in cache directory:
HISTSIZE=10000
SAVEHIST=10000
HISTFILE=~/.cache/zsh/history
val df = spark.read.json("concats-head.json")
val df1 = df.withColumn("elements", explode($"contacts")).select($"acc", col("elements.contactEmail").as("emails"))
val emailDomains = udf((s: String) => s.split(",").flatMap(it => it.split(";")).map(_.split("@")(1)))
val df2 = df1.withColumn("domains", emailDomains($"emails")).withColumn("domain", explode($"domains")).select("acc", "domain")
@saswata-dutta
saswata-dutta / seeds.md
Created November 12, 2020 05:32 — forked from non/seeds.md
Simple example of using seeds with ScalaCheck for deterministic property-based testing.

introduction

ScalaCheck 1.14.0 was just released with support for deterministic testing using seeds. Some folks have asked for examples, so I wanted to produce a Gist to help people use this feature.

simple example

These examples will assume the following imports:

@saswata-dutta
saswata-dutta / clean.scala
Created October 30, 2020 16:05
spark data cleaning
import org.apache.spark.sql.types._
import org.apache.spark.sql.functions.udf
import org.apache.spark.sql.expressions.UserDefinedFunction
import org.apache.spark.sql.DataFrame
val schema = StructType(Array(
StructField("ID", LongType, false),
StructField("ASIN", StringType, false),
StructField("ASIN_STATIC_ITEM_NAME", StringType, false),
@saswata-dutta
saswata-dutta / java_threaddump.sh
Created October 29, 2020 09:28
Periodically creates thread dump of a running java process to debug unresponsive/blocked threads
#!/bin/bash
# get process_id using jps -v or ps
# ps -mo pid.lwp.stime.time.cpu -C java
# your/path/to/jstack `ps aux | grep java | grep -v grep | awk '{print $2}'` >> threaddump.log
readonly process_id=${1}
readonly times=${2:-10}
readonly pause=${3:-10}
@saswata-dutta
saswata-dutta / postgres_process_id.java
Created October 27, 2020 10:39
log postgres serverside pid for debugging
private void logServerPID(Connection connection) throws SQLException {
if (connection.isWrapperFor(PGConnection.class)) {
// WARNING this PGConnection will have some mem leak, but cant close it, so only use it for debugging
PGConnection pgConnection = connection.unwrap(PGConnection.class);
log.info("Server PID {}", pgConnection.getBackendPID());
} else {
log.warn("Connection instance is not of pg instance type");
}
}
@saswata-dutta
saswata-dutta / Schema2CaseClass.scala
Created October 24, 2020 08:18 — forked from yoyama/Schema2CaseClass.scala
Generate case class from spark DataFrame/Dataset schema.
/**
* Generate Case class from DataFrame.schema
*
* val df:DataFrame = ...
*
* val s2cc = new Schema2CaseClass
* import s2cc.implicit._
*
* println(s2cc.schemaToCaseClass(df.schema, "MyClass"))
*
@saswata-dutta
saswata-dutta / ExecutorTest.java
Created October 15, 2020 07:09
Java batched executor sample
package org.saswata;
import java.util.ArrayList;
import java.util.Arrays;
import java.util.HashSet;
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;
import java.util.concurrent.TimeUnit;
public class Test {
@saswata-dutta
saswata-dutta / spark_reatain_latest_in_group.scala
Last active October 4, 2020 16:59
drop duplicate rows by id, keeping one with latest timestamp
// https://www.datasciencemadesimple.com/distinct-value-of-dataframe-in-pyspark-drop-duplicates/
// https://stackoverflow.com/questions/38687212/spark-dataframe-drop-duplicates-and-keep-first
// to deal with ties within window partitions, a tiebreaker column is added
import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions.Window
val byId = Window.partitionBy("id").orderBy(col("last_updated").desc, col("tiebreak"))