Vince Gonzalez vicenteg

Good idea to create a volume, so you can get an idea of the space consumption and how compression helps you:

maprcli volume create -name eoddata -path /user/vgonzalez/eoddata

Assuming you have installed log-synth to /opt, the following will create 10 million rows in 50 threads, with each thread producing a file:

/opt/log-synth/synth -schema eoddata.json -count $((10 * 10**6)) -format json -output /mapr/se1/user/vgonzalez/eoddata/2015-05-18 -threads 50

Yes, Drill can query Sequencefile, via the hive metastore. Here's how.

Copy some sample data to an MapRFS/HDFS location

hadoop fs -put /opt/mapr/hive/hive-0.13/examples/files/kv1.seq /user/vgonzalez/tmp

	import datetime
	import luigi

	class TaskX(luigi.Task):
	x = luigi.IntParameter(default=777)

	def run(self):
	with self.output().open("w") as f:
	print >>f, self.x

	<?xml version='1.0' encoding='UTF-8'?>
	<?xml-stylesheet type="text/xsl" href="configuration.xsl"?><!--
	Licensed to the Apache Software Foundation (ASF) under one or more
	contributor license agreements. See the NOTICE file distributed with
	this work for additional information regarding copyright ownership.
	The ASF licenses this file to You under the Apache License, Version 2.0
	(the "License"); you may not use this file except in compliance with
	the License. You may obtain a copy of the License at

	http://www.apache.org/licenses/LICENSE-2.0

	#!/bin/bash
	# Copyright (c) 2009 & onwards. MapR Tech, Inc., All rights reserved
	# Please set all environment variable you want to be used during MapR cluster
	# runtime here.
	# namely MAPR_HOME, JAVA_HOME, MAPR_SUBNETS

	#set JAVA_HOME to override default search
	#export JAVA_HOME=
	export MAPR_SUBNETS=
	#export MAPR_HOME=

	#!/bin/bash

	if maprcli node list -columns id; then
	NODEID=$(maprcli node list -columns id -filter hostname==`hostname -f` -noheader \| cut -f 1 -d ' ')
	NODEVOLUMES=$(maprcli volume list -columns volumename \| egrep "^mapr.`hostname -f`")

	for volume in $NODEVOLUMES; do
	maprcli volume remove -name $volume
	done

	export zookeepers=$(maprcli node listzookeepers -noheader)
	export bootstrap_servers=$(maprcli node list -columns hostname -noheader -filter csvc==kafka \| awk '{ print $1 }' \| head -1)

	# Producer

	# Setup
	bin/kafka-topics.sh --zookeeper $zookeepers --create --topic test-rep-one --partitions 6 --replication-factor 1
	bin/kafka-topics.sh --zookeeper $zookeepers --create --topic test --partitions 6 --replication-factor 3

	# Single thread, no replication