Tilak Patidar tilakpatidar

Install HUE on HDP

Prerequisites

Install required packages

yum install -y ant gcc g++ libkrb5-dev libmysqlclient-dev
yum install -y libssl-dev libsasl2-dev libsasl2-modules-gssapi-mit
yum install -y libsqlite3-dev libtidy-0.99-0 libxml2-dev libxslt-dev
yum install -y maven libldap2-dev python-dev python-simplejson python-setuptools
yum install -y libxslt-devel libxml++-devel libxml2-devel libffi-devel

Install HDP 2.6 and remove Cloudera 5.12

Uninstall cloudera

Follow the guide to uninstall. Link Also, remove the existing folders and users.

#Remove conf and logs
rm -rf rm -rf /etc/hadoop
rm -rf rm -rf /etc/hbase
rm -rf rm -rf /etc/hive

Source Schema and Converters

Source schema

A source schema has to be declared before extracting the data from the source. To define the source schema source.schema property is available which takes a JSON value defining the source schema. This schema is used by Converters to perform data type or data format conversions. The java class representation of a source schema can be found here Schema.java.

Converters

In Gobblin library a Converter is an interface for classes that implement data transformations, e.g., data type conversions, schema projections, data manipulations, data filtering, etc. This interface is responsible for converting both schema and data records. Classes implementing this interface are composible and can be chained together to achieve more complex data transformations.

A converter basically needs four inputs:

Input schema

Keybase proof

I hereby claim:

I am tilakpatidar on github.
I am tilakpatidar (https://keybase.io/tilakpatidar) on keybase.
I have a public key ASBrc8-ucimp_8n0hPOuAsj1mFBpAf84XYHuuGuTavTTewo

To claim this, I am signing this object:

	-- Finding difference in values for some key
	SELECT * from A a,b WHERE a.id = b.id AND NOT a.name = b.name;

	--Returns one row for each CHECK, UNIQUE, PRIMARY KEY, and/or FOREIGN KEY
	SELECT *
	FROM INFORMATION_SCHEMA.TABLE_CONSTRAINTS
	WHERE CONSTRAINT_NAME='XYZ'


	--Returns one row for each FOREIGN KEY constrain

	# Run postgres instance
	docker run --name postgres -p 5000:5432 debezium/postgres

	# Run zookeeper instance
	docker run -it --name zookeeper -p 2181:2181 -p 2888:2888 -p 3888:3888 debezium/zookeeper

	# Run kafka instance
	docker run -it --name kafka -p 9092:9092 --link zookeeper:zookeeper debezium/kafka

	# Run kafka connect

	import org.apache.spark.sql.functions.udf

	val csv = spark.read.format("csv").option("header", true).load("/Users/tilak/Downloads/Pam/SalesAnalysis/data/store_sales_unified_2017.csv")
	val uniqueKey: (String, String, String, String) => String = (x, y ,z , v) => x + "_" + y + "_" + z + "_" + v
	val someFn = udf(uniqueKey)
	val newData = csv.withColumn("unique", someFn(csv.col("receipt_id"), csv.col("cash_register_id"), csv.col("sale_time"), csv.col("date")))

	val countArticles = newData.groupBy("unique", "article_id").count()
	val sameBill = countArticles.crossJoin(countArticles).filter(x => x.getString(0) == x.getString(3) && x.getString(1) != x.getString(4))
	val newNames = sameBill.columns.toList.zipWithIndex.map((x) => x._1 + "_" + x._2)

	import org.apache.spark.sql.functions.udf
	import spark.sessionState.conf
	conf.setConfString("spark.sql.pivotMaxValues", "" + Int.MaxValue)
	val csv = spark.read.format("csv").option("header", true).load("/Users/tilak/Downloads/Pam/SalesAnalysis/data/store_sales_unified_2017.csv")
	val uniqueKey: (String, String, String, String) => String = (x, y, z, v) => x + "_" + y + "_" + z + "_" + v
	val someFn = udf(uniqueKey)
	val newData = csv.withColumn("unique", someFn(csv.col("receipt_id"), csv.col("cash_register_id"), csv.col("sale_time"), csv.col("date")))
	val countArticles = newData.groupBy("unique", "article_id").count()
	var articles = countArticles.select("article_id").distinct()
	val articleIds = articles.collect.map(x => x(0))

	check host appsrv1 with address 127.0.0.1
	start program = "/sbin/start myapp"
	stop program = "/sbin/stop myapp"
	alert [email protected] on {timeout,connection}
	if failed port 9009 protocol HTTP
	request /
	with timeout 3 seconds
	then restart
	if 10 restarts within 10 cycles then timeout
	if 10 restarts within 10 cycles then exec "/usr/bin/monit start aws-dns-healthcheck"

	val x = { println("x"); 15 }
	//x
	//x: Int = 15

	lazy val y = { println("y"); 13 }
	//y: Int = <lazy>
	x
	//res2: Int = 15
	y
	//y