Mike Sukmanowsky msukmanowsky

Update: please note that I have since switched to using a set of bash scripts instead of poluting the Git repository with git svn.

Using Git and GitHub with Subversion (SVN) Mirroring for WordPress Plugin Development

Keep in mind, our use case is largely timeseries analytics, but broad themes of issues we encountered:

Realtime indexing + querying is tough. Required us to throw beefed up dedicated hardware at that problem while we were serving historical queries on nodes w/ a different config (typical hot, warm cold node configuration).
As always, skewed data sets require special consideration in index and document schema modelling.
JVM heap, aggregation query and doc mapping optimization needed or you'll easily hit OOM on nodes which can lead to...
Bad failure scenarios where you get an entire cluster brought to a halt, no queries able to be served. Literally one bad and greedy query can put your node and cluster in a very bad state.
Depending on your document mapping, disk storage requirements can easily bite you but are made better by https://www.elastic.co/blog/store-compression-in-lucene-and-elasticsearch

+1 to the ES team though, they do listen to and fix issues quickly. Moving to doc values as the d

The existing examples for this are good, but they miss a pretty critical observation, the number of partitions and how this affects things.

Assume we have the following script, aggregate_by_key.py:

import pprint
from pyspark.context import SparkContext

sort -k3 -r

	2016-05-27 21:26:23,209 WARN org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl (Container Monitor): Process tree for container: container_1444365544629_1373_01_115320 has processes older than 1 iteration running over the configured limit. Limit=17716740096, current usage = 17761673216
	2016-05-27 21:26:23,211 WARN org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl (Container Monitor): Container [pid=8812,containerID=container_1444365544629_1373_01_115320] is running beyond physical memory limits. Current usage: 16.5 GB of 16.5 GB physical memory used; 31.4 GB of 82.5 GB virtual memory used. Killing container.
	Dump of the process-tree for container_1444365544629_1373_01_115320 :
	\|- PID PPID PGRPID SESSID CMD_NAME USER_MODE_TIME(MILLIS) SYSTEM_TIME(MILLIS) VMEM_USAGE(BYTES) RSSMEM_USAGE(PAGES) FULL_CMD_LINE
	\|- 19475 8930 8930 8812 (python) 122020 9482 648155136 27838 python -m pyspark.daemon
	\|- 22219 8930 8930 8812 (python) 112689 1

	# Clone Dan's fork of conda-env
	git clone https://github.com/dan-blanchard/conda-env.git

	# Install the fork of conda-env
	cd conda-env
	git checkout feature/pip_requirements.txt
	conda uninstall --yes conda-env
	python setup.py develop

	import datetime as dt
	import pprint

	import pytz

	print(pytz.__version__)
	# '2015.4'

	timezone = pytz.timezone('Europe/London')
	tmsp = dt.datetime(2015, 3, 29, 1, tzinfo=pytz.utc)

	#!/usr/bin/env bash
	# Hitting CTRL-C kills the Django server as well as all tunnels that were created

	TUNNEL_PIDS=()
	function kill_tunnels() {
	for tunnel_pid in "${TUNNEL_PIDS[@]}"
	do
	kill $tunnel_pid
	done
	}

	install.packages("jsonlite", dependencies = TRUE)
	install.packages("RCurl", dependencies = TRUE)
	library("jsonlite")
	library("RCurl")


	base_url <- "https://api.parsely.com/v2"
	apikey <- "computerworld.com"
	api_secret <- "YOUR SECRET KEY"

	from pyspark import SparkContext


	def main():
	sc = SparkContext(appName="Test Compression")
	# RDD has to be key, value pairs
	data = sc.parallelize([
	("key1", "value1"),
	("key2", "value2"),
	("key3", "value3"),