Skip to content

Instantly share code, notes, and snippets.

View mannharleen's full-sized avatar
👽

Harleen Mann mannharleen

👽
View GitHub Profile
public class createAnyObjectRecord {
public Map<String, Schema.SObjectType> gd_map_full;
//public List<Schema.SObjecttype> gd_lst_full; **delete
public Map<String, Schema.SObjectType> gd;
/*public List<Schema.SObjectField> field_names_lst{get;set;}
Schema.sObjectField field_name;
public schema.SObjectType object_name;*/
@mannharleen
mannharleen / hive-insert-partition.scala
Created August 30, 2017 17:05
Use sparkSQL in hive context to create a managed partitioned table. Use temp table to insert data into managed table using substring hive function
//import necessary
import org.apache.spark.{SparkContext,SparkConf}
import org.apache.spark.sql.hive.HiveContext
//initializations
val conf = new SparkConf().setAppName("xx").setMaster("local[2]")
val sc = new SparkContext(conf)
val hiveContext = new HiveContext(sc)
//create and load temp table
@mannharleen
mannharleen / hive-retail_db.scala
Last active August 31, 2017 15:33
Create managed hive tables for retail-db tables
//Data can be found at https://github.com/mannharleen/data/blob/master/retail_db.zip
hiveContext.sql("create database retail_db")
hiveContext.sql("use retail_db")
hiveContext.sql("create table categories (category_id Int,category_department_id Int, category_name String) row format delimited fields terminated by ',' ")
hiveContext.sql("create table departments (department_id Int, department_name String) row format delimited fields terminated by ',' ")
hiveContext.sql("create table products (product_id Int, product_category_id Int, product_name String, product_description String, product_price Double, product_image String) row format delimited fields terminated by ',' ")
hiveContext.sql("create table order_items (order_item_id Int, order_item_order_id Int, order_item_product_id Int, order_item_quantity Int, order_item_subtotal Double, order_item_product_price Double) row format delimited fields terminated by ',' ")
hiveContext.sql("create table orders (order_id Int, order_date String, order_customer_id Int, o
@mannharleen
mannharleen / chmod-using-winutils.cmd
Created September 1, 2017 11:05
Error: java.lang.RuntimeException: java.lang.RuntimeException: The root scratch dir: /tmp/hive on HDFS should be writable. Current permissions are: rw-rw-rw-
/*
Spark and hive on windows environment
Error: java.lang.RuntimeException: java.lang.RuntimeException: The root scratch dir: /tmp/hive on HDFS should be writable. Current permissions are: rw-rw-rw-
Pre-requisites: Have winutils.exe placed in c:\winutils\bin\
Resolve as follows:
*/
C:\user>c:\Winutils\bin\winutils.exe ls
FindFileOwnerAndPermission error (1789): The trust relationship between this workstation and the primary domain failed.
@mannharleen
mannharleen / sparkSQL-JDBC.scala
Last active September 20, 2017 14:56
sparkSQL connect to JDBC (Extract data)
import org.apache.spark.{SparkContext,SparkConf}
import org.apache.spark.sql.hive.HiveContext
//Do add the following artifact in build.sbt
//libraryDependencies += "mysql" % "mysql-connector-java" % "5.1.43"
//initializations
val conf = new SparkConf().setAppName("xx").setMaster("local[2]")
val sc = new SparkContext(conf)
@mannharleen
mannharleen / rdd-to-DF-NOreflection.scala
Last active September 20, 2017 15:07
When case classes cannot be created in advance. e.g. schema is being read from a JSON file, we cannot use reflection method i.e. cannot use rdd.map(x=> case_class(x._1,x._2)).toDF()
//Programmatically Specifying the Schema
import org.apache.spark.{SparkContext,SparkConf}
import org.apache.spark.sql.hive.HiveContext
//initializations
val conf = new SparkConf().setAppName("xx").setMaster("local[2]")
val sc = new SparkContext(conf)
val hiveContext = new HiveContext(sc)
import org.apache.spark.{SparkContext,SparkConf}
import org.apache.spark.sql.hive.HiveContext
//initializations
val conf = new SparkConf().setAppName("xx").setMaster("local[2]")
val sc = new SparkContext(conf)
val hiveContext = new HiveContext(sc)
//
hiveContext.udf.register("myudf",(a:Int, b:Int) => a+b)
//pyspark
//sqlContext.udf.register("myudf",lambda x,y: x+y)
hiveContext.sql("select myudf(1,2)").show
//same can be used with sqlContext
/*
Output:
+---+
|_c0|
@mannharleen
mannharleen / DF-cube.scala
Created September 2, 2017 09:26
Spark dataframe cube basics
val rdd1 = sc.parallelize(List((1,"one"),(2,"two")))
val df1 = rdd1.toDF("col1","col2")
//user reflection to convert to DF
//create cube with dimentions as col1 and col2 & fact as average of col1
df1.cube("col1","col2").agg( Map( "col1" -> "avg" )).show
/*
Outputs:
+----+----+---------+
@mannharleen
mannharleen / sparkStreaming-basics.scala
Created September 2, 2017 15:19
Add numbers in a DStream
/*
Add the numbers coming into each DStream of data
*/
import org.apache.spark.streaming.StreamingContext
import org.apache.spark.streaming.Seconds
import org.apache.spark.{SparkConf, SparkContext}
val conf = new SparkConf().setAppName("sbtapp").setMaster("local[3]")
val ssc = new StreamingContext(conf,Seconds(5))