Skip to content

Instantly share code, notes, and snippets.

View msukmanowsky's full-sized avatar
🥳
Building the future of how companies work with elvex!

Mike Sukmanowsky msukmanowsky

🥳
Building the future of how companies work with elvex!
View GitHub Profile

Keep in mind, our use case is largely timeseries analytics, but broad themes of issues we encountered:

  1. Realtime indexing + querying is tough. Required us to throw beefed up dedicated hardware at that problem while we were serving historical queries on nodes w/ a different config (typical hot, warm cold node configuration).
  2. As always, skewed data sets require special consideration in index and document schema modelling.
  3. JVM heap, aggregation query and doc mapping optimization needed or you'll easily hit OOM on nodes which can lead to...
  4. Bad failure scenarios where you get an entire cluster brought to a halt, no queries able to be served. Literally one bad and greedy query can put your node and cluster in a very bad state.
  5. Depending on your document mapping, disk storage requirements can easily bite you but are made better by https://www.elastic.co/blog/store-compression-in-lucene-and-elasticsearch

+1 to the ES team though, they do listen to and fix issues quickly. Moving to doc values as the d

@msukmanowsky
msukmanowsky / wordpress-plugin-svn-to-git.md
Created November 26, 2015 15:17 — forked from kasparsd/wordpress-plugin-svn-to-git.md
Using Git with Subversion Mirroring for WordPress Plugin Development
@msukmanowsky
msukmanowsky / install-forked-conda-env.sh
Created March 31, 2016 14:24
Install a forked version of conda-env which falls back to PyPI for requirements and supports -e editable requirements.
# Clone Dan's fork of conda-env
git clone https://github.com/dan-blanchard/conda-env.git
# Install the fork of conda-env
cd conda-env
git checkout feature/pip_requirements.txt
conda uninstall --yes conda-env
python setup.py develop
2016-05-27 21:26:23,209 WARN org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl (Container Monitor): Process tree for container: container_1444365544629_1373_01_115320 has processes older than 1 iteration running over the configured limit. Limit=17716740096, current usage = 17761673216
2016-05-27 21:26:23,211 WARN org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl (Container Monitor): Container [pid=8812,containerID=container_1444365544629_1373_01_115320] is running beyond physical memory limits. Current usage: 16.5 GB of 16.5 GB physical memory used; 31.4 GB of 82.5 GB virtual memory used. Killing container.
Dump of the process-tree for container_1444365544629_1373_01_115320 :
|- PID PPID PGRPID SESSID CMD_NAME USER_MODE_TIME(MILLIS) SYSTEM_TIME(MILLIS) VMEM_USAGE(BYTES) RSSMEM_USAGE(PAGES) FULL_CMD_LINE
|- 19475 8930 8930 8812 (python) 122020 9482 648155136 27838 python -m pyspark.daemon
|- 22219 8930 8930 8812 (python) 112689 1
@msukmanowsky
msukmanowsky / numpy_unique.py
Last active June 14, 2016 16:39
To run `pip install numpy psutil` and then `time python <script> <set size>`.
import uuid
import sys
import numpy as np
import psutil
num_users = int(sys.argv[1])
users = np.unique(list(str(uuid.uuid4()) for _ in xrange(num_users)))
print len(users)
@msukmanowsky
msukmanowsky / 1-driver.py
Last active August 5, 2016 02:21
Testing out some handy dandy pyspark debugging. Note that the output script displays file, line number and code being executed as well as a sample of the result.
import pyspark
from sparkbuddy.monkey import patch_pyspark
patch_pyspark()
def is_even(val):
if val % 2 == 0:
return [val]
return []
{
"session_initial_url": "http://www.example.com/",
"surl_fragment": "",
"display_pixel_depth": 24,
"sref_scheme": "https",
"surl_clean": "http://www.example.com/",
"visitor": true,
"url_scheme": "http",
"surl_domain": "example.com",
"sref_netloc": "www.google.ca",
{
"action": "pageview",
"apikey": "mashable.com",
"display": true,
"display_avail_height": 735,
"display_avail_width": 1280,
"display_pixel_depth": 24,
"display_total_height": 800,
"display_total_width": 1280,
"engaged_time_inc": null,
<script src="http://content.jwplatform.com/libraries/2SJ3KAcS.js"></script>
<div id="embed_container">Loading the player...</div>
<script type="text/javascript">
var player = jwplayer("embed_container");
player.setup({
file: "http://content.jwplatform.com/videos/xxxxxxxx.mp4",
image: "http://content.jwplatform.com/thumbs/xxxxxxxx.jpg",
mediaid: "myvideoid12345",
width: 640,
height: 360,
#!/usr/bin/env python
'''Build up a set of URLs using the common crawl index. See
http://commoncrawl.org/2015/04/announcing-the-common-crawl-index/ for more info.
'''
from __future__ import print_function
import gzip
import logging
import os
import random