Login to LLAP host node
A) Test with Spark-shell
step 1:
cd /tmp
wget https://raw.githubusercontent.com/dbompart/hive_warehouse_connector/master/hwc_info_collect.sh
chmod +x hwc_info_collect.sh
./hwc_info_collect.sh
Read instructions in the output of above script and as user spark
start the spark-shell
step 2: Run below statement (replace ' < fill table name > ' with your table name )
import com.hortonworks.hwc.HiveWarehouseSession
import com.hortonworks.hwc.HiveWarehouseSession._
val hive = HiveWarehouseSession.session(spark).build()
hive.showDatabases().show()
hive.executeQuery("select * from <fill table name>").show
step 2.1 : exit spark-shell
B) Test with pyspark
step 1 : Replace spark-shell
with pyspark
and start pyspark
step 2 : Run below statement (replace ' < fill table name > ' with your table name )
from pyspark.sql import SparkSession
from pyspark_llap.sql.session import HiveWarehouseSession
hive = HiveWarehouseSession.session(spark).build()
hive.executeQuery("select * from <fill table name>").show
step 3 : exit pyspark
C) Test with spark submit step 1 : create a python file cat ~/pyspark_HWCDemo.py
import sys
from pyspark.sql import SparkSession
from pyspark_llap.sql.session import HiveWarehouseSession
def main(argv):
# initiate the spark session builder
spark = SparkSession.builder.appName("pyspark_HWCDemo.py").enableHiveSupport().getOrCreate()
hive = HiveWarehouseSession.session(spark).build()
statement = sys.argv[1]
if(statement.lower().startswith("select")):
hive.executeQuery(statement).show()
else:
hive.execute(statement).show()
if __name__ == '__main__':
main(sys.argv[1:])
step 2 : To Run in Yarn Client mode. Replace spark-shell
with spark-submit
and ensure to include following configurations
--conf spark.sql.hive.hiveserver2.jdbc.url.principal=hive/[email protected]
. // Replace with your hive principal name
--conf spark.security.credentials.hiveserver2.enabled=false
Select table eg:
spark-submit --master yarn --deploy-mode client --conf spark.datasource.hive.warehouse.load.staging.dir=/tmp --conf spark.datasource.hive.warehouse.metastoreUri=thrift://c4441-node3.coelab.cloudera.com:9083 --conf spark.hadoop.hive.llap.daemon.service.hosts=@llap0 --conf spark.jars=/usr/hdp/current/hive_warehouse_connector/hive-warehouse-connector-assembly-1.0.0.3.1.5.0-152.jar --conf spark.submit.pyFiles=/usr/hdp/current/hive_warehouse_connector/pyspark_hwc-1.0.0.3.1.5.0-152.zip --conf spark.sql.hive.hiveserver2.jdbc.url="jdbc:hive2://c4441-node2.coelab.cloudera.com:2181,c4441-node3.coelab.cloudera.com:2181,c4441-node4.coelab.cloudera.com:2181/;serviceDiscoveryMode=zooKeeper;zooKeeperNamespace=hiveserver2-interactive" --conf spark.sql.hive.zookeeper.quorum="c4441-node2.coelab.cloudera.com:2181,c4441-node3.coelab.cloudera.com:2181,c4441-node4.coelab.cloudera.com:2181" --conf spark.security.credentials.hiveserver2.enabled=false --conf spark.sql.hive.hiveserver2.jdbc.url.principal=hive/[email protected] ~/pyspark_HWCDemo.py "select * from hwc_db.crimes"
Show tables eg:
spark-submit --master yarn --deploy-mode client --conf spark.datasource.hive.warehouse.load.staging.dir=/tmp --conf spark.datasource.hive.warehouse.metastoreUri=thrift://c4441-node3.coelab.cloudera.com:9083 --conf spark.hadoop.hive.llap.daemon.service.hosts=@llap0 --conf spark.jars=/usr/hdp/current/hive_warehouse_connector/hive-warehouse-connector-assembly-1.0.0.3.1.5.0-152.jar --conf spark.submit.pyFiles=/usr/hdp/current/hive_warehouse_connector/pyspark_hwc-1.0.0.3.1.5.0-152.zip --conf spark.sql.hive.hiveserver2.jdbc.url="jdbc:hive2://c4441-node2.coelab.cloudera.com:2181,c4441-node3.coelab.cloudera.com:2181,c4441-node4.coelab.cloudera.com:2181/;serviceDiscoveryMode=zooKeeper;zooKeeperNamespace=hiveserver2-interactive" --conf spark.sql.hive.zookeeper.quorum="c4441-node2.coelab.cloudera.com:2181,c4441-node3.coelab.cloudera.com:2181,c4441-node4.coelab.cloudera.com:2181" --conf spark.security.credentials.hiveserver2.enabled=false --conf spark.sql.hive.hiveserver2.jdbc.url.principal=hive/[email protected] ~/pyspark_HWCDemo.py "show tables"
step 3 : To Run in Yarn Cluster mode.
--deploy-mode cluster
--conf spark.security.credentials.hiveserver2.enabled=true
eg:
spark-submit --master yarn --deploy-mode cluster --conf spark.datasource.hive.warehouse.load.staging.dir=/tmp --conf spark.datasource.hive.warehouse.metastoreUri=thrift://c4441-node3.coelab.cloudera.com:9083 --conf spark.hadoop.hive.llap.daemon.service.hosts=@llap0 --conf spark.jars=/usr/hdp/current/hive_warehouse_connector/hive-warehouse-connector-assembly-1.0.0.3.1.5.0-152.jar --conf spark.submit.pyFiles=/usr/hdp/current/hive_warehouse_connector/pyspark_hwc-1.0.0.3.1.5.0-152.zip --conf spark.sql.hive.hiveserver2.jdbc.url="jdbc:hive2://c4441-node2.coelab.cloudera.com:2181,c4441-node3.coelab.cloudera.com:2181,c4441-node4.coelab.cloudera.com:2181/;serviceDiscoveryMode=zooKeeper;zooKeeperNamespace=hiveserver2-interactive" --conf spark.sql.hive.zookeeper.quorum="c4441-node2.coelab.cloudera.com:2181,c4441-node3.coelab.cloudera.com:2181,c4441-node4.coelab.cloudera.com:2181" --conf spark.security.credentials.hiveserver2.enabled=true --conf spark.sql.hive.hiveserver2.jdbc.url.principal=hive/[email protected] ~/pyspark_HWCDemo.py "select * from hwc_db.crimes"
courtesy : David Bompart
Note : From llap node, if there is no direct internet connection to access https://raw.githubusercontent.com/dbompart/hive_warehouse_connector/master/hwc_info_collect.sh , copy following code into 'hwc_info_collect.sh'
set -e
#### For version: HDP 3.1.x
#### This script has to be run on the LLAP Host.
#### It can be ported to retrieve information using the Ambari REST APIs, allowing this to work from any host. Using bash for speed.
#### Pre-requisites:
# 1) LLAP is installed on this host.
# 2) Spark Client is installed on this host.
# File holder for required information:
hive_site_llap=/etc/hive_llap/conf/hive-site.xml
beeline_site_llap=/etc/hive_llap/conf/beeline-site.xml
if [ -r "$hive_site_llap" ] && [ -r "$beeline_site_llap" ]; then
hive_metastore_uris=$(grep -e "thrift.*9083" "$hive_site_llap" |awk -F"<|>" '{print $3}')
hive_llap_daemon_service_hosts=$(grep "hive.llap.daemon.service.hosts" -A1 "$hive_site_llap" |awk 'NR==2' | awk -F"[<|>]" '{print $3}')
hive_zookeeper_quorum=$(grep "hive.zookeeper.quorum" -A1 "$hive_site_llap" |awk 'NR==2' | awk -F"[<|>]" '{print $3}')
hwc_jar=$(find /usr/hdp/current/hive_warehouse_connector/ -name *assembly*.jar)
hwc_pyfile=$(find /usr/hdp/current/hive_warehouse_connector/ -name *hwc*.zip)
hive_jdbc_url=$(grep "beeline.hs2.jdbc.url.llap" -A1 "$beeline_site_llap" |awk 'NR==2' | awk -F"[<|>]" '{print $3}')
hive_jdbc_url_principal=$(grep "hive.server2.authentication.kerberos.principal" -A1 "$hive_site_llap" |awk 'NR==2' | awk -F"[<|>]" '{print $3}')
hive_zookeeper_quorum=$(grep "hive.zookeeper.quorum" -A1 "$hive_site_llap" |awk 'NR==2' | awk -F"[<|>]" '{print $3}')
echo -e "To apply this configuration cluster wide, copy and paste the following list of properties in Ambari UI -> Spark2 -> Configs -> Advanced -> Custom spark2-defaults (Bulk Property Add mode)\n"
echo -e "spark.datasource.hive.warehouse.load.staging.dir=/tmp"
echo -e "spark.datasource.hive.warehouse.metastoreUri="$hive_metastore_uris
echo -e "spark.hadoop.hive.llap.daemon.service.hosts="$hive_llap_daemon_service_hosts
echo -e "spark.jars="$hwc_jar
echo -e "spark.submit.pyFiles="$hwc_pyfile
echo -e "spark.security.credentials.hiveserver2.enabled=false"
echo -e "spark.sql.hive.hiveserver2.jdbc.url="$hive_jdbc_url
echo -e "spark.sql.hive.zookeeper.quorum="$hive_zookeeper_quorum
#If Kerberized:
[ ! -z "$hive_jdbc_url_principal"] && echo -e "spark.sql.hive.hiveserver2.jdbc.url.principal="$hive_jdbc_url_principal
echo -e "\n### Save and restart."
echo -e "\nNote: In a kerberized environment the property spark.security.credentials.hiveserver2.enabled has to be set to TRUE for deploy-mode cluster, i.e.:\n spark-submit --conf spark.security.credentials.hiveserver2.enabled=true"
echo -e "\nIf you'd like to test this per job instead of cluster wide, then use the following command as an example:\n
spark-shell --master yarn --conf spark.datasource.hive.warehouse.load.staging.dir=/tmp --conf spark.datasource.hive.warehouse.metastoreUri=$hive_metastore_uris --conf spark.hadoop.hive.llap.daemon.service.hosts=$hive_llap_daemon_service_hosts --conf spark.jars=$hwc_jar --conf spark.submit.pyFiles=$hwc_pyfile --conf spark.security.credentials.hiveserver2.enabled=false --conf spark.sql.hive.hiveserver2.jdbc.url=\"$hive_jdbc_url\" --conf spark.sql.hive.zookeeper.quorum=\"$hive_zookeeper_quorum\" \n"
echo -e "Once in the Scala REPL, run the following snippet example to test basic conectivity:\n"
echo -e "scala> import com.hortonworks.hwc.HiveWarehouseSession"
echo "scala> import com.hortonworks.hwc.HiveWarehouseSession._"
echo "scala> val hive = HiveWarehouseSession.session(spark).build()"
echo -e "scala> hive.showDatabases().show()\n"
else
echo -e $hive_site_llap" and/or "$beeline_site_llap" doesn't exist on this host, or the current user $(whoami) doesn't have access to the files\n"
echo "Try running this command as the root or hive user"
fi```