Skip to content

Instantly share code, notes, and snippets.

View bepcyc's full-sized avatar
🙃
Sparkling

Viacheslav Rodionov bepcyc

🙃
Sparkling
  • Qualcomm
  • Germany
View GitHub Profile
yarn node -list -all 2>>/dev/null|cut -f3|grep -v "Total Nodes"|grep -P "\:\d{2,}$"|cut -d':' -f1
hadoop fs -mkdir /tmp/${tmp_dir}
hadoop fs -put ${dest} /tmp/${tmp_dir}/
pdsh hadoop fs -get /tmp/${tmp_dir}/${dest}
# add these lines to .bashrc or to the other start script
export SEARCH_MR_JOB_JAR="/opt/cloudera/parcels/CDH/lib/solr/contrib/mr/search-mr-job.jar"
alias dfsFind="hadoop jar ${SEARCH_MR_JOB_JAR} org.apache.solr.hadoop.HdfsFindTool"
#alias MapReduceIndexerTool="hadoop jar ${SEARCH_MR_JOB_JAR} org.apache.solr.hadoop.MapReduceIndexerTool"
# use it like regular find:
# dfsFind / -name "*.snappy" | grep flume
@bepcyc
bepcyc / clean_sbt_mvn.sh
Last active March 15, 2016 17:12
Clean all mvn and sbt projects
# maven
find . -name pom.xml -type f | xargs -L1 sh -c 'dirname $0' | xargs -L1 sh -c 'cd $0 && mvn clean'
# sbt
find . -name build.sbt -type f | xargs -L1 sh -c 'dirname $0' | xargs -L1 sh -c 'cd $0 && sbt clean'
# Hint: put 'git pull' as a last command and you will get all your repos updated
# insert somewhere in function working with sc directly
sc.stop()
from pyspark import SparkContext
SparkContext.setSystemProperty('spark.executor.memory', '6g') # no sure which one works, use both
SparkContext.setSystemProperty('spark.python.worker.memory', '6g') # no sure which one works, use both
SparkContext.setSystemProperty('spark.shuffle.spill', 'false')
SparkContext.setSystemProperty('spark.driver.memory', '2g')
SparkContext.setSystemProperty('spark.io.compression.codec', 'snappy') # just to be sure
sc = SparkContext("local[8]", "Simple App") # set to your number of cores
# set 8 cores and 6GB of RAM
Vagrant.configure(2) do |config|
config.vm.define "myvm" do |master|
master.vm.provider :virtualbox do |v|
v.customize ["modifyvm", :id, "--ioapic", "on"] # this one is important for setting cores
v.customize ["modifyvm", :id, "--cpus", 8]
v.customize ["modifyvm", :id, "--memory", 6144]
end
end
end
@bepcyc
bepcyc / Vagrantfile
Created June 28, 2015 18:42
edX CS 100.1X. Vagrant on steroids.
# -*- mode: ruby -*-
# vi: set ft=ruby :
ipythonPort = 8001 # Ipython port to forward (also set in IPython notebook config)
Vagrant.configure(2) do |config|
config.ssh.insert_key = true
config.vm.define "sparkvm" do |master|
master.vm.box = "sparkmooc/base"
master.vm.box_download_insecure = true
package com.avira.ds.sparser.spark
import org.apache.hadoop.io.NullWritable
import org.apache.hadoop.mapred.lib.MultipleTextOutputFormat
import org.apache.spark.{SparkContext, SparkConf}
import scala.language.implicitConversions
sealed trait Event
case class ClickEvent(blaBla: String) extends Event
case class ViewEvent(blaBla: String) extends Event
@bepcyc
bepcyc / extract_table.py
Created December 27, 2015 13:52
The script to OCR table from PDF in python. Source found here: http://craiget.com/blog/extracting-table-data-from-pdfs-with-ocr/
import Image, ImageOps
import subprocess, sys, os, glob
# minimum run of adjacent pixels to call something a line
H_THRESH = 300
V_THRESH = 300
def get_hlines(pix, w, h):
"""Get start/end pixels of lines containing horizontal runs of at least THRESH black pix"""
hlines = []
@bepcyc
bepcyc / opus_encoding_4quality.sh
Last active March 28, 2016 13:34
Encode FLAC (also WAV or AIFF) audio into OPUS format saving maximum possible quality.
#!/usr/bin/env bash
# USAGE: opus_encoding_4quality.sh input_file.flac
INPUT_FILE="${1}"
opusenc --bitrate 512 --vbr --comp 10 --framesize 60 "${INPUT_FILE}" "${INPUT_FILE%.*}.opus"
@bepcyc
bepcyc / get_bitrate.sh
Last active June 3, 2016 15:08
Determine audio file overall bit rate in Kbps.
# sudo apt-get install mediainfo
# input is a file or a glob mask (first file is taken in this case)
# output is one decimal number - bitrate of a file. Zero if file not found.
get_bitrate() {
F="${1}"
FOUND=$(compgen -G "${F}" | head -n1)
if [ -z "${FOUND}" ]
then
echo 0
else