Saswata Dutta saswata-dutta

💭

I may be slow to respond.

19 followers · 45 following

AWS
India
https://www.linkedin.com/in/saswata-dutta/

View GitHub Profile

Recently created

Least recently created

Recently updated

Least recently updated

saswata-dutta / all_aws_lambda_modules_python.md

Created December 18, 2020 13:19 — forked from gene1wood/all_aws_lambda_modules_python.md

AWS Lambda function to list all available Python modules for Python 2.7 3.6 and 3.7

This gist contains lists of modules available in

Python 2.7
Python 3.6
Python 3.7

in AWS Lambda.

It also contains the code to run in Lambda to generate these lists. In addition there is a less_versbose module in the code that you can call to get a list of the top level modules installed and the version of those modules (if they contain a version

saswata-dutta / zshrc

Created December 6, 2020 04:46 — forked from LukeSmithxyz/zshrc

	# Luke's config for the Zoomer Shell

	# Enable colors and change prompt:
	autoload -U colors && colors
	PS1="%B%{$fg[red]%}[%{$fg[yellow]%}%n%{$fg[green]%}@%{$fg[blue]%}%M %{$fg[magenta]%}%~%{$fg[red]%}]%{$reset_color%}$%b "

	# History in cache directory:
	HISTSIZE=10000
	SAVEHIST=10000
	HISTFILE=~/.cache/zsh/history

saswata-dutta / spark_nested_json.scala

Created November 22, 2020 14:32

	val df = spark.read.json("concats-head.json")
	val df1 = df.withColumn("elements", explode($"contacts")).select($"acc", col("elements.contactEmail").as("emails"))
	val emailDomains = udf((s: String) => s.split(",").flatMap(it => it.split(";")).map(_.split("@")(1)))
	val df2 = df1.withColumn("domains", emailDomains($"emails")).withColumn("domain", explode($"domains")).select("acc", "domain")

saswata-dutta / seeds.md

Created November 12, 2020 05:32 — forked from non/seeds.md

Simple example of using seeds with ScalaCheck for deterministic property-based testing.

introduction

ScalaCheck 1.14.0 was just released with support for deterministic testing using seeds. Some folks have asked for examples, so I wanted to produce a Gist to help people use this feature.

simple example

These examples will assume the following imports:

saswata-dutta / clean.scala

Created October 30, 2020 16:05

spark data cleaning


	import org.apache.spark.sql.types._
	import org.apache.spark.sql.functions.udf
	import org.apache.spark.sql.expressions.UserDefinedFunction
	import org.apache.spark.sql.DataFrame

	val schema = StructType(Array(
	StructField("ID", LongType, false),
	StructField("ASIN", StringType, false),
	StructField("ASIN_STATIC_ITEM_NAME", StringType, false),

saswata-dutta / java_threaddump.sh

Created October 29, 2020 09:28

Periodically creates thread dump of a running java process to debug unresponsive/blocked threads

	#!/bin/bash

	# get process_id using jps -v or ps
	# ps -mo pid.lwp.stime.time.cpu -C java
	# your/path/to/jstack `ps aux \| grep java \| grep -v grep \| awk '{print $2}'` >> threaddump.log

	readonly process_id=${1}
	readonly times=${2:-10}
	readonly pause=${3:-10}

saswata-dutta / postgres_process_id.java

Created October 27, 2020 10:39

log postgres serverside pid for debugging

	private void logServerPID(Connection connection) throws SQLException {
	if (connection.isWrapperFor(PGConnection.class)) {
	// WARNING this PGConnection will have some mem leak, but cant close it, so only use it for debugging
	PGConnection pgConnection = connection.unwrap(PGConnection.class);
	log.info("Server PID {}", pgConnection.getBackendPID());
	} else {
	log.warn("Connection instance is not of pg instance type");
	}
	}

saswata-dutta / Schema2CaseClass.scala

Created October 24, 2020 08:18 — forked from yoyama/Schema2CaseClass.scala

Generate case class from spark DataFrame/Dataset schema.

	/**
	* Generate Case class from DataFrame.schema
	*
	* val df:DataFrame = ...
	*
	* val s2cc = new Schema2CaseClass
	* import s2cc.implicit._
	*
	* println(s2cc.schemaToCaseClass(df.schema, "MyClass"))
	*

saswata-dutta / ExecutorTest.java

Created October 15, 2020 07:09

Java batched executor sample

	package org.saswata;

	import java.util.ArrayList;
	import java.util.Arrays;
	import java.util.HashSet;
	import java.util.concurrent.ExecutorService;
	import java.util.concurrent.Executors;
	import java.util.concurrent.TimeUnit;

	public class Test {

saswata-dutta / spark_reatain_latest_in_group.scala

Last active October 4, 2020 16:59

drop duplicate rows by id, keeping one with latest timestamp

	// https://www.datasciencemadesimple.com/distinct-value-of-dataframe-in-pyspark-drop-duplicates/
	// https://stackoverflow.com/questions/38687212/spark-dataframe-drop-duplicates-and-keep-first

	// to deal with ties within window partitions, a tiebreaker column is added

	import org.apache.spark.sql.functions._
	import org.apache.spark.sql.expressions.Window

	val byId = Window.partitionBy("id").orderBy(col("last_updated").desc, col("tiebreak"))

Newer Older