João Neves jonsnowseven

Spark Tips & Tricks

Misc. Tips & Tricks

If values are integers in [0, 255], Parquet will automatically compress to use 1 byte unsigned integers, thus decreasing the size of saved DataFrame by a factor of 8.
Partition DataFrames to have evenly-distributed, ~128MB partition sizes (empirical finding). Always err on the higher side w.r.t. number of partitions.
Pay particular attention to the number of partitions when using flatMap, especially if the following operation will result in high memory usage. The flatMap op usually results in a DataFrame with a [much] larger number of rows, yet the number of partitions will remain the same. Thus, if a subsequent op causes a large expansion of memory usage (i.e. converting a DataFrame of indices to a DataFrame of large Vectors), the memory usage per partition may become too high. In this case, it is beneficial to repartition the output of flatMap to a number of partitions that will safely allow for appropriate partition memory sizes, based upon the

"Plain JSON" encoding for Apache Avro

April 2024, by Clemens Vasters, Microsoft Corp.

Notational Conventions
Interoperability issues of the Avro JSON Encoding with common JSON usage
The "Plain JSON" encoding

The Apache Avro project defines a JSON Encoding, which is optimized for encoding data in JSON, but primarily aimed at exchanging data between implementations of

	// Just before switching jobs:
	// Add one of these.
	// Preferably into the same commit where you do a large merge.
	//
	// This started as a tweet with a joke of "C++ pro-tip: #define private public",
	// and then it quickly escalated into more and more evil suggestions.
	// I've tried to capture interesting suggestions here.
	//
	// Contributors: @r2d2rigo, @joeldevahl, @msinilo, @_Humus_,
	// @YuriyODonnell, @rygorous, @cmuratori, @mike_acton, @grumpygiant,

	def get_profile_credentials(profile_name):
	from ConfigParser import ConfigParser
	from ConfigParser import ParsingError
	from ConfigParser import NoOptionError
	from ConfigParser import NoSectionError
	from os import path
	config = ConfigParser()
	config.read([path.join(path.expanduser("~"),'.aws/credentials')])
	try:
	aws_access_key_id = config.get(profile_name, 'aws_access_key_id')

	import boto3
	import datetime
	import json
	from requests_aws4auth import AWS4Auth
	import requests

	boto3.setup_default_session(region_name='us-east-1')
	identity = boto3.client('cognito-identity', region_name='us-east-1')

	account_id='XXXXXXXXXXXXXXX'

	'Update or create a stack given a name and template + params'
	from __future__ import division, print_function, unicode_literals

	from datetime import datetime
	import logging
	import json
	import sys

	import boto3
	import botocore

	{
	// Use IntelliSense to learn about possible attributes.
	// Hover to view descriptions of existing attributes.
	// For more information, visit: https://go.microsoft.com/fwlink/?linkid=830387
	"version": "0.2.0",
	"configurations": [
	{
	"name": "Python: Flask",
	"type": "python",
	"request": "launch",