Skip to content

Instantly share code, notes, and snippets.

View dennyglee's full-sized avatar

Denny Lee dennyglee

View GitHub Profile
@dennyglee
dennyglee / Spark 1.4,Java7
Last active October 21, 2015 21:53
Spark 1.4 PermGenSize Error (ssimeonov)
/* Spark Shell Executed */
./bin/spark-shell --master spark://servername:7077 --driver-class-path $CLASSPATH
/* Output */
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 1.4.0
@dennyglee
dennyglee / define_window_specification.py
Created March 24, 2016 18:55
Introducing Window Functions in Spark SQL: Define window specification
from pyspark.sql.window import Window
# Defines partitioning specification and ordering specification.
windowSpec = \
Window \
.partitionBy(...) \
.orderBy(...)
# Defines a Window Specification with a ROW frame.
windowSpec.rowsBetween(start, end)
# Defines a Window Specification with a RANGE frame.
@dennyglee
dennyglee / accessing_dataframe_with_vector_double_schema.py
Created November 8, 2016 17:20
Accessing DataFrame with [('features', 'vector'), ('label', 'double')] schema
from pyspark.mllib.linalg import Vectors
# Sample dataset
data = sc.parallelize([
(0.0, [0.0, 1.0, 2.0]),
(1.0, [1.0, 2.0, 3.0]),
(3.0, [2.0, 3.0, 4.0]),
(2.0, [3.0, 4.0, 5.0])
])
@dennyglee
dennyglee / azure-cosmosdb-spark_to_mongo.scala
Created June 7, 2017 22:09
Spark Connector for Cosmos DB to Mongo container
//
// Spark Connector for Cosmos DB to Mongo container
// This gist provides an example of how to connect to Spark Connector for Cosmos DB to a Mongo container
//
// How to start spark-shell
// spark-shell --master yarn --jars /home/sshuser/jars/0.0.3c_1.12/azure-cosmosdb-spark-0.0.3-SNAPSHOT.jar,/home/sshuser/jars/0.0.3c_1.12/azure-documentdb-1.12.0-SNAPSHOT.jar
//
// Import Necessary Libraries
//
// Spark 2.0 to SQL Server via External Data Source API and SQL JDBC
//
// References:
// - https://docs.databricks.com/spark/latest/data-sources/sql-databases.html
// - https://blogs.msdn.microsoft.com/bigdatasupport/2015/10/22/how-to-allow-spark-to-access-microsoft-sql-server/
// - https://docs.microsoft.com/en-us/sql/connect/jdbc/using-the-jdbc-driver
// Run spark-shell
// - Get the SQL Server JDBC JAR fom the above "Using the JDBC driver" link
@dennyglee
dennyglee / ru_su_splits.md
Last active May 17, 2019 04:48
Request Units, Storage Utilization, Splits....oh my!

Request Units, Storage Utilization, Splits....oh my!

The Unofficial Throughput and Capacity Guestimate Guide for Azure Cosmos DB

Introduction

I have had a lot of great questions about how to estimate the throughput and storage capacity for Azure Cosmos DB. To get yourself up and running, the key best practices references are:

@dennyglee
dennyglee / cqlsh-CosmosDB-Cassandra-API-macos.md
Last active January 12, 2018 19:40
Connecting cqlsh to Cosmos DB Cassandra API on MacOS

As noted in the Introduction to Apache Cassandra API for Azure Cosmos DB, you can connect to the Cosmos DB Cassandra API using cqlsh. The instructions included in the Quick Start are setup for Windows (not MacOS) and there may be a versioning issue as the default cassandra-driver (installed via pip install cassandra-driver is for 3.3.1 instead of 3.4 (which is what is needed for Cosmos DB Cassandra API).

Install Cassandra via brew

This will ensure that you have the latest Cassandra-driver for CQL 3.4:

brew install cassandra

Cassandra will be installed in the /usr/local/Cellar/cassandra/$version folder

@dennyglee
dennyglee / NGB-Genome-Browser-Docker-E2E-Script.md
Last active January 27, 2018 05:38
The NGB Genome Browser is a web-based NGS data viewer with structural variations (SVs) visualization capabilities. This gist provides end-to-end Docker installation instructions and a demos script.

NGB Genome Browser Docker End-to-End Demo Script

The NGB Genome Browser is a web-based NGS data viewer with structural variations (SVs) visualization capabilities. This gist provides end-to-end Docker installation instructions and a demos script. This is an e2e version including downloading the sample VCF and BAM files.

Note, these instructions are derived from the following sources:

@dennyglee
dennyglee / spark-to-sql-validation-sample.py
Created April 4, 2018 18:54
Validate Spark DataFrame data and schema prior to loading into SQL
'''
Example Schema Validation
Assumes the DataFrame `df` is already populated with schema:
{id : int, day_cd : 8-digit code representing date, category : varchar(24), type : varchar(10), ind : varchar(1), purchase_amt : decimal(18,6) }
Runs various checks to ensure data is valid (e.g. no NULL id and day_cd fields) and schema is valid (e.g. [category] cannot be larger than varchar(24))
'''