Update: please note that I have since switched to using a set of bash scripts instead of poluting the Git repository with git svn.
Author: Kaspars Dambis
kaspars.net / @konstruktors
| 2016-05-27 21:26:23,209 WARN org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl (Container Monitor): Process tree for container: container_1444365544629_1373_01_115320 has processes older than 1 iteration running over the configured limit. Limit=17716740096, current usage = 17761673216 | |
| 2016-05-27 21:26:23,211 WARN org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl (Container Monitor): Container [pid=8812,containerID=container_1444365544629_1373_01_115320] is running beyond physical memory limits. Current usage: 16.5 GB of 16.5 GB physical memory used; 31.4 GB of 82.5 GB virtual memory used. Killing container. | |
| Dump of the process-tree for container_1444365544629_1373_01_115320 : | |
| |- PID PPID PGRPID SESSID CMD_NAME USER_MODE_TIME(MILLIS) SYSTEM_TIME(MILLIS) VMEM_USAGE(BYTES) RSSMEM_USAGE(PAGES) FULL_CMD_LINE | |
| |- 19475 8930 8930 8812 (python) 122020 9482 648155136 27838 python -m pyspark.daemon | |
| |- 22219 8930 8930 8812 (python) 112689 1 |
| # Clone Dan's fork of conda-env | |
| git clone https://github.com/dan-blanchard/conda-env.git | |
| # Install the fork of conda-env | |
| cd conda-env | |
| git checkout feature/pip_requirements.txt | |
| conda uninstall --yes conda-env | |
| python setup.py develop |
Update: please note that I have since switched to using a set of bash scripts instead of poluting the Git repository with git svn.
Author: Kaspars Dambis
kaspars.net / @konstruktors
Keep in mind, our use case is largely timeseries analytics, but broad themes of issues we encountered:
+1 to the ES team though, they do listen to and fix issues quickly. Moving to doc values as the d
| import datetime as dt | |
| import pprint | |
| import pytz | |
| print(pytz.__version__) | |
| # '2015.4' | |
| timezone = pytz.timezone('Europe/London') | |
| tmsp = dt.datetime(2015, 3, 29, 1, tzinfo=pytz.utc) |
The existing examples for this are good, but they miss a pretty critical observation, the number of partitions and how this affects things.
Assume we have the following script, aggregate_by_key.py:
import pprint
from pyspark.context import SparkContext| #!/usr/bin/env bash | |
| # Hitting CTRL-C kills the Django server as well as all tunnels that were created | |
| TUNNEL_PIDS=() | |
| function kill_tunnels() { | |
| for tunnel_pid in "${TUNNEL_PIDS[@]}" | |
| do | |
| kill $tunnel_pid | |
| done | |
| } |
| install.packages("jsonlite", dependencies = TRUE) | |
| install.packages("RCurl", dependencies = TRUE) | |
| library("jsonlite") | |
| library("RCurl") | |
| base_url <- "https://api.parsely.com/v2" | |
| apikey <- "computerworld.com" | |
| api_secret <- "YOUR SECRET KEY" |
| from pyspark import SparkContext | |
| def main(): | |
| sc = SparkContext(appName="Test Compression") | |
| # RDD has to be key, value pairs | |
| data = sc.parallelize([ | |
| ("key1", "value1"), | |
| ("key2", "value2"), | |
| ("key3", "value3"), |