gokulsan · May 26, 2019 18:39
diff --git a/Impala_Quick_Reference.txt b/Impala_Quick_Reference.txt
 Impala and Big Data Ecosystem

 When it comes to SQL-on-Hadoop, there are handful frameworks available in market. Hive and Impala are most widely used to build data warehouse on the Hadoop framework.

 Impala Table Partioning

 Partitioning Impala table technique physically divides the data based on the different values in frequently queried or used columns in the impala tables. This technique allows queries to skip reading a large percentage of the data in a table, thus reducing the I/O operation and speed-up overall performance.

 Impala Schema Design

 Joins are important aspects of the SQL queries. Avoid using correlated queries and inline tables. Create temporary tables and try to use inner join wherever possible. Generate stats on the column and table level.

 Impala File Format Selection

 Typically, for large volume of the data, the Parquet file format performs best because of its combination of columnar storage layout, large I/O request size, and compression and encoding.

 Impala Metadata Load Techniques

 Impala caches metadata for speed. The caching mechanism requires loading metadata from persistent stores, like Hive MetaStore, NameNode, and Sentry by CatalogD. This is subsequently compressed and sent to the Statestore to be broadcast to dedicated coordinators. Such a complex system is easily subject to numerous bottlenecks which make it imperative to monitor the key relationships among Impala's components.

 Impala Metadata Load Antipatterns

 Computing incremental stats on wide (large number of columns) partitioned tables
 Large number of partitions/files/blocks[2] (click here for more information)
 Constantly and frequent REFRESHof large tables
 Indiscriminate use of INVALIDATE METADATAcommands
 High number of concurrent DDL operations[3]
 Catalog or Statestore service restarts
 High number of coordinator nodes ( > 10% of nodes on a cluster >= 150 nodes)

 Impala Case Study
 https://www.dezyre.com/hadoop-tutorial/impala-case-study-flight-data-analysis

 Competitor Landscape - 
 Apache Presto - 
 https://www.tutorialspoint.com/apache_presto/apache_presto_overview.htm
 https://en.wikipedia.org/wiki/Presto_(SQL_query_engine)
 Google Spanner - https://en.wikipedia.org/wiki/Spanner_(database)
 Google Dremel - https://en.wikipedia.org/wiki/Dremel_(software)
 Apache Drill - https://en.wikipedia.org/wiki/Apache_Drill
 Apache Hive - https://en.wikipedia.org/wiki/Apache_Hive


 Impala Assessment Metrics

 https://www.marketscreener.com/news/Cloudera-Assessment-of-Apache-Impala-Performance-using-Cloudera-Manager-Metrics-ndash-Part-1-of--27746017/

 Impala FAQ
 https://www.cloudera.com/documentation/enterprise/5-9-x/topics/impala_faq.html

 Impala Scalability
 https://www.cloudera.com/documentation/enterprise/5-9-x/topics/impala_scalability.html

 Impala Performance Benchmarking
 https://impala.apache.org/docs/build/html/topics/impala_perf_cookbook.html
 https://www.cloudera.com/documentation/enterprise/5-9-x/topics/impala_perf_benchmarking.html
 https://www.cloudera.com/documentation/enterprise/5-9-x/topics/impala_perf_testing.html

 Impala Presentations
 https://cwiki.apache.org/confluence/display/IMPALA/Impala+Presentations%2C+Papers+and+Blog+Posts

 Impala Resource Management
 https://cwiki.apache.org/confluence/display/IMPALA/Resource+Management+Best+Practices+in+Impala

 Impala Best Practices
 https://impala.apache.org/docs/build/html/topics/impala_perf_cookbook.html
 http://dwgeek.com/cloudera-impala-performance-tuning-best-practices.html/
 http://hadooptutorial.info/impala-best-practices/
	Impala and Big Data Ecosystem

	When it comes to SQL-on-Hadoop, there are handful frameworks available in market. Hive and Impala are most widely used to build data warehouse on the Hadoop framework.

	Impala Table Partioning

	Partitioning Impala table technique physically divides the data based on the different values in frequently queried or used columns in the impala tables. This technique allows queries to skip reading a large percentage of the data in a table, thus reducing the I/O operation and speed-up overall performance.

	Impala Schema Design

	Joins are important aspects of the SQL queries. Avoid using correlated queries and inline tables. Create temporary tables and try to use inner join wherever possible. Generate stats on the column and table level.

	Impala File Format Selection

	Typically, for large volume of the data, the Parquet file format performs best because of its combination of columnar storage layout, large I/O request size, and compression and encoding.

	Impala Metadata Load Techniques

	Impala caches metadata for speed. The caching mechanism requires loading metadata from persistent stores, like Hive MetaStore, NameNode, and Sentry by CatalogD. This is subsequently compressed and sent to the Statestore to be broadcast to dedicated coordinators. Such a complex system is easily subject to numerous bottlenecks which make it imperative to monitor the key relationships among Impala's components.

	Impala Metadata Load Antipatterns

	Computing incremental stats on wide (large number of columns) partitioned tables
	Large number of partitions/files/blocks[2] (click here for more information)
	Constantly and frequent REFRESHof large tables
	Indiscriminate use of INVALIDATE METADATAcommands
	High number of concurrent DDL operations[3]
	Catalog or Statestore service restarts
	High number of coordinator nodes ( > 10% of nodes on a cluster >= 150 nodes)

	Impala Case Study
	https://www.dezyre.com/hadoop-tutorial/impala-case-study-flight-data-analysis

	Competitor Landscape -
	Apache Presto -
	https://www.tutorialspoint.com/apache_presto/apache_presto_overview.htm
	https://en.wikipedia.org/wiki/Presto_(SQL_query_engine)
	Google Spanner - https://en.wikipedia.org/wiki/Spanner_(database)
	Google Dremel - https://en.wikipedia.org/wiki/Dremel_(software)
	Apache Drill - https://en.wikipedia.org/wiki/Apache_Drill
	Apache Hive - https://en.wikipedia.org/wiki/Apache_Hive


	Impala Assessment Metrics

	https://www.marketscreener.com/news/Cloudera-Assessment-of-Apache-Impala-Performance-using-Cloudera-Manager-Metrics-ndash-Part-1-of--27746017/

	Impala FAQ
	https://www.cloudera.com/documentation/enterprise/5-9-x/topics/impala_faq.html

	Impala Scalability
	https://www.cloudera.com/documentation/enterprise/5-9-x/topics/impala_scalability.html

	Impala Performance Benchmarking
	https://impala.apache.org/docs/build/html/topics/impala_perf_cookbook.html
	https://www.cloudera.com/documentation/enterprise/5-9-x/topics/impala_perf_benchmarking.html
	https://www.cloudera.com/documentation/enterprise/5-9-x/topics/impala_perf_testing.html

	Impala Presentations
	https://cwiki.apache.org/confluence/display/IMPALA/Impala+Presentations%2C+Papers+and+Blog+Posts

	Impala Resource Management
	https://cwiki.apache.org/confluence/display/IMPALA/Resource+Management+Best+Practices+in+Impala

	Impala Best Practices
	https://impala.apache.org/docs/build/html/topics/impala_perf_cookbook.html
	http://dwgeek.com/cloudera-impala-performance-tuning-best-practices.html/
	http://hadooptutorial.info/impala-best-practices/