Skip to content

Instantly share code, notes, and snippets.

View vicenteg's full-sized avatar

Vince Gonzalez vicenteg

View GitHub Profile
@vicenteg
vicenteg / cluster.hosts.example
Created April 9, 2015 17:57
cluster inventory
server1 ansible_ssh_host=10.255.134.34
server2 ansible_ssh_host=10.255.134.35
server3 ansible_ssh_host=10.255.134.36
server4 ansible_ssh_host=10.255.134.37
server5 ansible_ssh_host=10.255.134.38
[cluster]
server[1:5]
@vicenteg
vicenteg / my_task.py
Created April 9, 2015 20:40
luigi test
import datetime
import luigi
class TaskX(luigi.Task):
x = luigi.IntParameter(default=777)
def run(self):
with self.output().open("w") as f:
print >>f, self.x
@vicenteg
vicenteg / gist:00d0b73c29898b0d86cd
Last active August 29, 2015 14:20
ES Marvel - add mapping for tweet timestamp
DELETE /tweets-2015-04-29
POST tweets-2015-04-28
PUT /tweets-2015-04-29
{
"settings": {
"analysis": {
"analyzer": {
"tweet_text_analyzer": {
"type": "english"
@vicenteg
vicenteg / luigi_test.py
Created May 20, 2015 10:38
small luigi example
mport os
import luigi
class Foo(luigi.Task):
def run(self):
print "Running Foo"
def requires(self):
@vicenteg
vicenteg / README.md
Last active August 29, 2015 14:21
Log synth lots of data, ETL and sort it with Drill

Good idea to create a volume, so you can get an idea of the space consumption and how compression helps you:

maprcli volume create -name eoddata -path /user/vgonzalez/eoddata

Assuming you have installed log-synth to /opt, the following will create 10 million rows in 50 threads, with each thread producing a file:

/opt/log-synth/synth -schema eoddata.json -count $((10 * 10**6)) -format json -output /mapr/se1/user/vgonzalez/eoddata/2015-05-18 -threads 50
@vicenteg
vicenteg / README.md
Last active August 29, 2015 14:22
Can Drill query Sequencefile?

Yes, Drill can query Sequencefile, via the hive metastore. Here's how.

Copy some sample data to an MapRFS/HDFS location

hadoop fs -put /opt/mapr/hive/hive-0.13/examples/files/kv1.seq /user/vgonzalez/tmp

Create an external table in Hive, referencing the sequencefile

<?xml version='1.0' encoding='UTF-8'?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?><!--
Licensed to the Apache Software Foundation (ASF) under one or more
contributor license agreements. See the NOTICE file distributed with
this work for additional information regarding copyright ownership.
The ASF licenses this file to You under the Apache License, Version 2.0
(the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
#!/bin/bash
# Copyright (c) 2009 & onwards. MapR Tech, Inc., All rights reserved
# Please set all environment variable you want to be used during MapR cluster
# runtime here.
# namely MAPR_HOME, JAVA_HOME, MAPR_SUBNETS
#set JAVA_HOME to override default search
#export JAVA_HOME=
export MAPR_SUBNETS=
#export MAPR_HOME=
#!/bin/bash
if maprcli node list -columns id; then
NODEID=$(maprcli node list -columns id -filter hostname==`hostname -f` -noheader | cut -f 1 -d ' ')
NODEVOLUMES=$(maprcli volume list -columns volumename | egrep "^mapr.`hostname -f`")
for volume in $NODEVOLUMES; do
maprcli volume remove -name $volume
done
@vicenteg
vicenteg / benchmark-commands.sh
Last active April 24, 2017 16:41 — forked from jkreps/benchmark-commands.txt
Kafka Benchmark Commands
export zookeepers=$(maprcli node listzookeepers -noheader)
export bootstrap_servers=$(maprcli node list -columns hostname -noheader -filter csvc==kafka | awk '{ print $1 }' | head -1)
# Producer
# Setup
bin/kafka-topics.sh --zookeeper $zookeepers --create --topic test-rep-one --partitions 6 --replication-factor 1
bin/kafka-topics.sh --zookeeper $zookeepers --create --topic test --partitions 6 --replication-factor 3
# Single thread, no replication