Denny Lee dennyglee

package org.apache.spark.sql.execution.datasources.parquet

|- val vectorizedReader = new VectorizedParquetRecordReader()

|- [VectorizedParquetRecordReader.java#48](https://github.com/apache/spark/blob/39e2bad6a866d27c3ca594d15e574a1da3ee84cc/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedParquetRecordReader.java#L4

Request Units, Storage Utilization, Splits....oh my!

The Unofficial Throughput and Capacity Guestimate Guide for Azure Cosmos DB

Introduction

I have had a lot of great questions about how to estimate the throughput and storage capacity for Azure Cosmos DB. To get yourself up and running, the key best practices references are:

Request Units in Azure Cosmos DB: This document describes what request units and various considerations.
Estimate Request Units and Data Storage: This tool helps you calculate the throughput and storage requirements you need.
Azure Cosmos DB Local Emulator: Use the Azure Cosmos DB Emulator for local development and testing

As noted in the Introduction to Apache Cassandra API for Azure Cosmos DB, you can connect to the Cosmos DB Cassandra API using cqlsh. The instructions included in the Quick Start are setup for Windows (not MacOS) and there may be a versioning issue as the default cassandra-driver (installed via pip install cassandra-driver is for 3.3.1 instead of 3.4 (which is what is needed for Cosmos DB Cassandra API).

Install Cassandra via `brew`

This will ensure that you have the latest Cassandra-driver for CQL 3.4:

brew install cassandra

Cassandra will be installed in the /usr/local/Cellar/cassandra/$version folder

NGB Genome Browser Docker End-to-End Demo Script

The NGB Genome Browser is a web-based NGS data viewer with structural variations (SVs) visualization capabilities. This gist provides end-to-end Docker installation instructions and a demos script. This is an e2e version including downloading the sample VCF and BAM files.

Note, these instructions are derived from the following sources:

	/* Spark Shell Executed */
	./bin/spark-shell --master spark://servername:7077 --driver-class-path $CLASSPATH


	/* Output */
	Welcome to
	____ __
	/ __/__ ___ _____/ /__
	_\ \/ _ \/ _ `/ __/ '_/
	/___/ .__/\_,_/_/ /_/\_\ version 1.4.0

	from pyspark.sql.window import Window

	# Defines partitioning specification and ordering specification.
	windowSpec = \
	Window \
	.partitionBy(...) \
	.orderBy(...)
	# Defines a Window Specification with a ROW frame.
	windowSpec.rowsBetween(start, end)
	# Defines a Window Specification with a RANGE frame.

	from pyspark.mllib.linalg import Vectors

	# Sample dataset
	data = sc.parallelize([
	(0.0, [0.0, 1.0, 2.0]),
	(1.0, [1.0, 2.0, 3.0]),
	(3.0, [2.0, 3.0, 4.0]),
	(2.0, [3.0, 4.0, 5.0])
	])

	//
	// Spark Connector for Cosmos DB to Mongo container
	// This gist provides an example of how to connect to Spark Connector for Cosmos DB to a Mongo container
	//

	// How to start spark-shell
	// spark-shell --master yarn --jars /home/sshuser/jars/0.0.3c_1.12/azure-cosmosdb-spark-0.0.3-SNAPSHOT.jar,/home/sshuser/jars/0.0.3c_1.12/azure-documentdb-1.12.0-SNAPSHOT.jar
	//

	// Import Necessary Libraries

	//
	// Spark 2.0 to SQL Server via External Data Source API and SQL JDBC
	//
	// References:
	// - https://docs.databricks.com/spark/latest/data-sources/sql-databases.html
	// - https://blogs.msdn.microsoft.com/bigdatasupport/2015/10/22/how-to-allow-spark-to-access-microsoft-sql-server/
	// - https://docs.microsoft.com/en-us/sql/connect/jdbc/using-the-jdbc-driver

	// Run spark-shell
	// - Get the SQL Server JDBC JAR fom the above "Using the JDBC driver" link