Update: please note that I have since switched to using a set of bash scripts instead of poluting the Git repository with git svn
.
Author: Kaspars Dambis
kaspars.net / @konstruktors
2016-05-27 21:26:23,209 WARN org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl (Container Monitor): Process tree for container: container_1444365544629_1373_01_115320 has processes older than 1 iteration running over the configured limit. Limit=17716740096, current usage = 17761673216 | |
2016-05-27 21:26:23,211 WARN org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl (Container Monitor): Container [pid=8812,containerID=container_1444365544629_1373_01_115320] is running beyond physical memory limits. Current usage: 16.5 GB of 16.5 GB physical memory used; 31.4 GB of 82.5 GB virtual memory used. Killing container. | |
Dump of the process-tree for container_1444365544629_1373_01_115320 : | |
|- PID PPID PGRPID SESSID CMD_NAME USER_MODE_TIME(MILLIS) SYSTEM_TIME(MILLIS) VMEM_USAGE(BYTES) RSSMEM_USAGE(PAGES) FULL_CMD_LINE | |
|- 19475 8930 8930 8812 (python) 122020 9482 648155136 27838 python -m pyspark.daemon | |
|- 22219 8930 8930 8812 (python) 112689 1 |
# Clone Dan's fork of conda-env | |
git clone https://github.com/dan-blanchard/conda-env.git | |
# Install the fork of conda-env | |
cd conda-env | |
git checkout feature/pip_requirements.txt | |
conda uninstall --yes conda-env | |
python setup.py develop |
Update: please note that I have since switched to using a set of bash scripts instead of poluting the Git repository with git svn
.
Author: Kaspars Dambis
kaspars.net / @konstruktors
Keep in mind, our use case is largely timeseries analytics, but broad themes of issues we encountered:
+1 to the ES team though, they do listen to and fix issues quickly. Moving to doc values as the d
import datetime as dt | |
import pprint | |
import pytz | |
print(pytz.__version__) | |
# '2015.4' | |
timezone = pytz.timezone('Europe/London') | |
tmsp = dt.datetime(2015, 3, 29, 1, tzinfo=pytz.utc) |
The existing examples for this are good, but they miss a pretty critical observation, the number of partitions and how this affects things.
Assume we have the following script, aggregate_by_key.py:
import pprint
from pyspark.context import SparkContext
#!/usr/bin/env bash | |
# Hitting CTRL-C kills the Django server as well as all tunnels that were created | |
TUNNEL_PIDS=() | |
function kill_tunnels() { | |
for tunnel_pid in "${TUNNEL_PIDS[@]}" | |
do | |
kill $tunnel_pid | |
done | |
} |
install.packages("jsonlite", dependencies = TRUE) | |
install.packages("RCurl", dependencies = TRUE) | |
library("jsonlite") | |
library("RCurl") | |
base_url <- "https://api.parsely.com/v2" | |
apikey <- "computerworld.com" | |
api_secret <- "YOUR SECRET KEY" |
from pyspark import SparkContext | |
def main(): | |
sc = SparkContext(appName="Test Compression") | |
# RDD has to be key, value pairs | |
data = sc.parallelize([ | |
("key1", "value1"), | |
("key2", "value2"), | |
("key3", "value3"), |