yum install -y ant gcc g++ libkrb5-dev libmysqlclient-dev
yum install -y libssl-dev libsasl2-dev libsasl2-modules-gssapi-mit
yum install -y libsqlite3-dev libtidy-0.99-0 libxml2-dev libxslt-dev
yum install -y maven libldap2-dev python-dev python-simplejson python-setuptools
yum install -y libxslt-devel libxml++-devel libxml2-devel libffi-devel
-- Finding difference in values for some key | |
SELECT * from A a,b WHERE a.id = b.id AND NOT a.name = b.name; | |
--Returns one row for each CHECK, UNIQUE, PRIMARY KEY, and/or FOREIGN KEY | |
SELECT * | |
FROM INFORMATION_SCHEMA.TABLE_CONSTRAINTS | |
WHERE CONSTRAINT_NAME='XYZ' | |
--Returns one row for each FOREIGN KEY constrain |
# Run postgres instance | |
docker run --name postgres -p 5000:5432 debezium/postgres | |
# Run zookeeper instance | |
docker run -it --name zookeeper -p 2181:2181 -p 2888:2888 -p 3888:3888 debezium/zookeeper | |
# Run kafka instance | |
docker run -it --name kafka -p 9092:9092 --link zookeeper:zookeeper debezium/kafka | |
# Run kafka connect |
Follow the guide to uninstall. Link Also, remove the existing folders and users.
#Remove conf and logs
rm -rf rm -rf /etc/hadoop
rm -rf rm -rf /etc/hbase
rm -rf rm -rf /etc/hive
import org.apache.spark.sql.functions.udf | |
val csv = spark.read.format("csv").option("header", true).load("/Users/tilak/Downloads/Pam/SalesAnalysis/data/store_sales_unified_2017.csv") | |
val uniqueKey: (String, String, String, String) => String = (x, y ,z , v) => x + "_" + y + "_" + z + "_" + v | |
val someFn = udf(uniqueKey) | |
val newData = csv.withColumn("unique", someFn(csv.col("receipt_id"), csv.col("cash_register_id"), csv.col("sale_time"), csv.col("date"))) | |
val countArticles = newData.groupBy("unique", "article_id").count() | |
val sameBill = countArticles.crossJoin(countArticles).filter(x => x.getString(0) == x.getString(3) && x.getString(1) != x.getString(4)) | |
val newNames = sameBill.columns.toList.zipWithIndex.map((x) => x._1 + "_" + x._2) |
import org.apache.spark.sql.functions.udf | |
import spark.sessionState.conf | |
conf.setConfString("spark.sql.pivotMaxValues", "" + Int.MaxValue) | |
val csv = spark.read.format("csv").option("header", true).load("/Users/tilak/Downloads/Pam/SalesAnalysis/data/store_sales_unified_2017.csv") | |
val uniqueKey: (String, String, String, String) => String = (x, y, z, v) => x + "_" + y + "_" + z + "_" + v | |
val someFn = udf(uniqueKey) | |
val newData = csv.withColumn("unique", someFn(csv.col("receipt_id"), csv.col("cash_register_id"), csv.col("sale_time"), csv.col("date"))) | |
val countArticles = newData.groupBy("unique", "article_id").count() | |
var articles = countArticles.select("article_id").distinct() | |
val articleIds = articles.collect.map(x => x(0)) |
A source schema has to be declared before extracting the data from the source. To define the source schema source.schema
property is available which takes a JSON value defining the source schema. This schema is used by Converters to perform data type or data format conversions. The java class representation of a source schema can be found here Schema.java.
In Gobblin library a Converter is an interface for classes that implement data transformations, e.g., data type conversions, schema projections, data manipulations, data filtering, etc. This interface is responsible for converting both schema and data records. Classes implementing this interface are composible and can be chained together to achieve more complex data transformations.
A converter basically needs four inputs:
- Input schema
I hereby claim:
- I am tilakpatidar on github.
- I am tilakpatidar (https://keybase.io/tilakpatidar) on keybase.
- I have a public key ASBrc8-ucimp_8n0hPOuAsj1mFBpAf84XYHuuGuTavTTewo
To claim this, I am signing this object:
check host appsrv1 with address 127.0.0.1 | |
start program = "/sbin/start myapp" | |
stop program = "/sbin/stop myapp" | |
alert [email protected] on {timeout,connection} | |
if failed port 9009 protocol HTTP | |
request / | |
with timeout 3 seconds | |
then restart | |
if 10 restarts within 10 cycles then timeout | |
if 10 restarts within 10 cycles then exec "/usr/bin/monit start aws-dns-healthcheck" |
val x = { println("x"); 15 } | |
//x | |
//x: Int = 15 | |
lazy val y = { println("y"); 13 } | |
//y: Int = <lazy> | |
x | |
//res2: Int = 15 | |
y | |
//y |