praveen-symphony’s gists

praveen-symphony / virtualenvwrapper.md

Created February 17, 2017 09:45 — forked from guyhughes/virtualenvwrapper.md

virtualenvwrapper quickstart

virtualenvwrapper has broken dependencies right now on macOS, so install like this:

sudo pip install pbr
sudo pip install --no-deps stevedore
sudo pip install --no-deps virtualenvwrapper

So that the commands mkvirtualenv and workon stay in your shell, you're going to want to do this (if you use zsh or another shell, change the filename here):

praveen-symphony / emr_spark_thrift_on_yarn

Created February 17, 2017 04:55 — forked from elliottcordo/emr_spark_thrift_on_yarn

EMR spark thrift server

	#on cluster
	thrift /spark/sbin/start-thriftserver.sh --master yarn-client
	#ssh tunnel, direct 10000 to unused 8157
	ssh -i ~/caserta-1.pem -N -L 8157:ec2-54-221-27-21.compute-1.amazonaws.com:10000 [email protected]
	#see this for JDBC config on client http://blogs.aws.amazon.com/bigdata/post/TxT7CJ0E7CRX88/Using-Amazon-EMR-with-SQL-Workbench-and-other-BI-Tools

praveen-symphony / hello_analytics_api_v3_10krows_nosampling_ryanpraski_single_csv.py

Created October 28, 2016 09:25 — forked from ryanpraski/hello_analytics_api_v3_10krows_nosampling_ryanpraski_single_csv.py

Export more than 10,000 rows & a solution for the sampling limitations of Google Analytics using Python and the Google Analytics API. Includes functionality to pull data from multiple Google Analytics profiles. This version puts all data for all the profiles into a single csv file.

	#!/usr/bin/python
	# -- coding: utf-8 --
	#
	# Copyright 2012 Google Inc. All Rights Reserved.
	#
	# Licensed under the Apache License, Version 2.0 (the "License");
	# you may not use this file except in compliance with the License.
	# You may obtain a copy of the License at
	#
	# http://www.apache.org/licenses/LICENSE-2.0

praveen-symphony / move_github_repos.sh

Created October 27, 2016 02:01 — forked from subelsky/move_github_repos.sh

How to archive old github private projects on Dropbox

	# http://stackoverflow.com/questions/1960799/using-gitdropbox-together-effectively/1961515#1961515

	export REPONAME=????
	take ~/Dropbox/git/$REPONAME.git
	git init --bare

	cd ~/code/$REPONAME
	git remote rm origin
	git remote add origin ~/Dropbox/git/$REPONAME.git
	git push -u origin master

praveen-symphony / aliases.sh

Created October 27, 2016 01:59 — forked from subelsky/aliases.sh

Useful aliases for Ruby and Rails development and git maintenance

	alias a='ack'
	alias a?='alias \| grep -i'
	alias adx='rake db:drop && rake db:create && heroku pg:transfer --from black --to postgres://postgres@localhost/staq_development --confirm staqweb --app staqweb && rails r "User.all.each { \|u\| u.update_attribute(:password,%q(password)) }" && rake db:test:prepare'
	alias b='bundle'
	alias bb='bundle install --binstubs=.bundle/bin --path=.bundle/gems && bundle package --all && reload ; sd'
	alias bc='bin/console'
	alias be='bundle exec'
	alias bea='bundle exec annotate'
	alias bu='bundle update'
	alias bus='bundle update staq_extraction'

praveen-symphony / large_redshift_tables.sql

Created October 27, 2016 01:58 — forked from subelsky/large_redshift_tables.sql

Quick SQL command to find large tables in redshift

	-- based on http://stackoverflow.com/questions/21767780/how-to-find-size-of-database-schema-table-in-redshift
	SELECT name AS table_name, ROUND((COUNT(*) / 1024.0),2) as "Size in Gigabytes"
	FROM stv_blocklist
	INNER JOIN
	(SELECT DISTINCT id, name FROM stv_tbl_perm) names
	ON names.id = stv_blocklist.tbl
	GROUP BY name
	ORDER BY "Size in Gigabytes" DESC

praveen-symphony / spark-sql_error.log

Last active October 4, 2016 01:02

Inconsistent Hive versions on EMR 5.0.0 Cluster

	chgrp: '' does not match expected pattern for group
	Usage: hadoop fs [generic options] -chgrp [-R] GROUP PATH...
	16/10/04 00:30:44 WARN AvroSerDe: Encountered exception determining schema. Returning signal schema to indicate problem
	java.lang.NullPointerException
	at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:371)
	at org.apache.hadoop.hive.serde2.avro.AvroSerdeUtils.getSchemaFromFS(AvroSerdeUtils.java:131)
	at org.apache.hadoop.hive.serde2.avro.AvroSerdeUtils.determineSchemaOrThrowException(AvroSerdeUtils.java:112)
	at org.apache.hadoop.hive.serde2.avro.AvroSerDe.determineSchemaOrReturnErrorSchema(AvroSerDe.java:167)
	at org.apache.hadoop.hive.serde2.avro.AvroSerDe.initialize(AvroSerDe.java:103)
	at org.apache.spark.sql.hive.SparkHiveWriterContainer.newSerializer(hiveWriterContainers.scala:161)

praveen-symphony / hive_error_stack_trace.log

Created October 4, 2016 00:19

Hive on Spark on EMR 5.0.0 is not working ?

	Warning: Master yarn-cluster is deprecated since 2.0. Please use master "yarn" with specified deploy mode instead.
	Exception in thread "main" java.lang.NoClassDefFoundError: com/sun/jersey/api/client/config/ClientConfig
	at org.apache.hadoop.yarn.client.api.TimelineClient.createTimelineClient(TimelineClient.java:55)
	at org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.createTimelineClient(YarnClientImpl.java:181)
	at org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.serviceInit(YarnClientImpl.java:168)
	at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
	at org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:150)
	at org.apache.spark.deploy.yarn.Client.run(Client.scala:1119)
	at org.apache.spark.deploy.yarn.Client$.main(Client.scala:1175)
	at org.apache.spark.deploy.yarn.Client.main(Client.scala)

praveen-symphony / spark_ide.py

Created September 21, 2016 23:54 — forked from bigaidream/spark_ide.py

To enable IDE (PyCharm) syntax support for Apache Spark, adopted from http://www.abisen.com/spark-from-ipython-notebook.html

	#!/public/spark-0.9.1/bin/pyspark

	import os
	import sys

	# Set the path for spark installation
	# this is the path where you have built spark using sbt/sbt assembly
	os.environ['SPARK_HOME'] = "/public/spark-0.9.1"
	# os.environ['SPARK_HOME'] = "/home/jie/d2/spark-0.9.1"
	# Append to PYTHONPATH so that pyspark could be found