Hakan İlter hakanilter

Cloud & Data Engineering Consultant. Loves playing with data. Hadoop, Spark, Kafka, ElasticSearch, Solr are my interests.

hakanilter / flume-kafka-source.properties

Last active May 17, 2018 16:17

Example Flume Configuration For Kafka Source

	tier1.sources = source1
	tier1.channels = channel1
	tier1.sinks = sink1

	# sources
	tier1.sources.source1.type = org.apache.flume.source.kafka.KafkaSource
	tier1.sources.source1.zookeeperConnect = localhost:2181
	tier1.sources.source1.topic = network-data
	tier1.sources.source1.groupId = flume-kafka-test
	tier1.sources.source1.channels = channel1

hakanilter / impala-partitions-report.sh

Last active July 20, 2018 10:32

Script for generating CSV partitions report for Impala

	# Script for generating csv partitions report for Impala
	IMPALA_DAEMON=localhost

	databases=$(impala-shell --quiet -i $IMPALA_DAEMON -d default --delimited -q "SHOW DATABASES" \| cut -f1 \| grep -e dl -e ods_)
	for database in $databases
	do
	echo $database
	directory="partitions/$database"
	mkdir -p $directory

hakanilter / AvroJsonToDf.scala

Created September 25, 2018 20:45

Load Avro files and extract json string as dataframe

	import org.apache.spark.sql.functions.udf
	import spark.implicits._

	// read avro
	val input = "/Users/hakanilter/dev/workspace/mc/data/avroFiles/*"
	val data = spark.read
	.format("com.databricks.spark.avro")
	.option("header","true")
	.load(input)

hakanilter / elasticsearch.sh

Last active September 26, 2018 23:00

ES6 Setup Scripts

	#! /bin/sh
	### BEGIN INIT INFO
	# Provides: elasticsearch
	# Required-Start: $all
	# Required-Stop: $all
	# Default-Start: 2 3 4 5
	# Default-Stop: 0 1 6
	# Short-Description: Starts elasticsearch
	# Description: Starts elasticsearch using start-stop-daemon
	### END INIT INFO

hakanilter / CreateSparkDataFrameFromAzureBlobStorage.scala

Last active October 14, 2018 23:15

Create Spark DataFrame from Azure Blob Storage

	/*
	Add following dependencies:
	com.microsoft.azure:azure-storage:2.0.0
	org.apache.hadoop:hadoop-azure:2.7.3
	Exclude:
	com.fasterxml.jackson.core::
	*/
	spark.conf.set(
	"fs.azure.account.key.<your-storage-account-name>.blob.core.windows.net",
	"<your-storage-account-access-key>")

hakanilter / athena.sql

Created November 8, 2018 13:04

Athena create select query with location

	CREATE TABLE sampledb.test_empty_array_parquet
	WITH (
	format = 'PARQUET',
	external_location = 's3://somewhere'
	)
	AS SELECT *
	FROM sampledb.test_empty_array

hakanilter / awslogs-setup.sh

Last active November 22, 2018 10:44

Installing AWS Cloudwatch Agent in Debian

	sudo su
	apt-get install -y libyaml-dev python-dev python3-dev python3-pip
	pip3 install awscli-cwlogs
	if [ ! -d /var/awslogs/bin ] ; then
	mkdir -p /var/awslogs/bin
	ln -s /usr/local/bin/aws /var/awslogs/bin/aws
	fi
	mkdir /opt/awslogs
	cd /opt/awslogs
	curl https://s3.amazonaws.com/aws-cloudwatch/downloads/latest/awslogs-agent-setup.py -O

hakanilter / json_hive_definition.py

Last active November 28, 2018 16:41

Fastest way to get Hive definition for a given Json file

	def json_hive_def(path):
	spark.read.json(path).createOrReplaceTempView("temp_view")
	spark.sql("CREATE TABLE temp_table AS SELECT * FROM temp_view LIMIT 0")
	script = spark.sql("SHOW CREATE TABLE temp_table").take(1)[0].createtab_stmt.replace('\n', '')
	spark.sql("DROP TABLE temp_table")
	return script

hakanilter / ecs-run-and-wait.sh

Last active December 7, 2023 16:50

AWS ECS run task and wait for the result

	# Requies JSON as the output format and "jq" commandline tool
	# If task runs successfuly, exits 0
	run_result=$(aws ecs run-task \
	--cluster ${CLUSTER} \
	--task-definition ${TASK_DEFINITION} \
	--launch-type EC2 \
	--overrides "${OVERRIDES}")
	echo ${run_result}
	container_arn=$(echo $run_result \| jq -r '.tasks[0].taskArn')
	aws ecs wait tasks-stopped \

hakanilter / mongodb-setup.sh

Created February 27, 2019 10:22

Amazon Linux Single Node Simple MongoDB Setup