Last active
May 26, 2019 18:39
-
-
Save gokulsan/d75070b4ac82227c5d97074cf8930668 to your computer and use it in GitHub Desktop.
Apache Impala as the Open Source SQL Engine
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Impala and Big Data Ecosystem | |
When it comes to SQL-on-Hadoop, there are handful frameworks available in market. Hive and Impala are most widely used to build data warehouse on the Hadoop framework. | |
Impala Table Partioning | |
Partitioning Impala table technique physically divides the data based on the different values in frequently queried or used columns in the impala tables. This technique allows queries to skip reading a large percentage of the data in a table, thus reducing the I/O operation and speed-up overall performance. | |
Impala Schema Design | |
Joins are important aspects of the SQL queries. Avoid using correlated queries and inline tables. Create temporary tables and try to use inner join wherever possible. Generate stats on the column and table level. | |
Impala File Format Selection | |
Typically, for large volume of the data, the Parquet file format performs best because of its combination of columnar storage layout, large I/O request size, and compression and encoding. | |
Impala Metadata Load Techniques | |
Impala caches metadata for speed. The caching mechanism requires loading metadata from persistent stores, like Hive MetaStore, NameNode, and Sentry by CatalogD. This is subsequently compressed and sent to the Statestore to be broadcast to dedicated coordinators. Such a complex system is easily subject to numerous bottlenecks which make it imperative to monitor the key relationships among Impala's components. | |
Impala Metadata Load Antipatterns | |
Computing incremental stats on wide (large number of columns) partitioned tables | |
Large number of partitions/files/blocks[2] (click here for more information) | |
Constantly and frequent REFRESHof large tables | |
Indiscriminate use of INVALIDATE METADATAcommands | |
High number of concurrent DDL operations[3] | |
Catalog or Statestore service restarts | |
High number of coordinator nodes ( > 10% of nodes on a cluster >= 150 nodes) | |
Impala Case Study | |
https://www.dezyre.com/hadoop-tutorial/impala-case-study-flight-data-analysis | |
Competitor Landscape - | |
Apache Presto - | |
https://www.tutorialspoint.com/apache_presto/apache_presto_overview.htm | |
https://en.wikipedia.org/wiki/Presto_(SQL_query_engine) | |
Google Spanner - https://en.wikipedia.org/wiki/Spanner_(database) | |
Google Dremel - https://en.wikipedia.org/wiki/Dremel_(software) | |
Apache Drill - https://en.wikipedia.org/wiki/Apache_Drill | |
Apache Hive - https://en.wikipedia.org/wiki/Apache_Hive | |
Impala Assessment Metrics | |
https://www.marketscreener.com/news/Cloudera-Assessment-of-Apache-Impala-Performance-using-Cloudera-Manager-Metrics-ndash-Part-1-of--27746017/ | |
Impala FAQ | |
https://www.cloudera.com/documentation/enterprise/5-9-x/topics/impala_faq.html | |
Impala Scalability | |
https://www.cloudera.com/documentation/enterprise/5-9-x/topics/impala_scalability.html | |
Impala Performance Benchmarking | |
https://impala.apache.org/docs/build/html/topics/impala_perf_cookbook.html | |
https://www.cloudera.com/documentation/enterprise/5-9-x/topics/impala_perf_benchmarking.html | |
https://www.cloudera.com/documentation/enterprise/5-9-x/topics/impala_perf_testing.html | |
Impala Presentations | |
https://cwiki.apache.org/confluence/display/IMPALA/Impala+Presentations%2C+Papers+and+Blog+Posts | |
Impala Resource Management | |
https://cwiki.apache.org/confluence/display/IMPALA/Resource+Management+Best+Practices+in+Impala | |
Impala Best Practices | |
https://impala.apache.org/docs/build/html/topics/impala_perf_cookbook.html | |
http://dwgeek.com/cloudera-impala-performance-tuning-best-practices.html/ | |
http://hadooptutorial.info/impala-best-practices/ |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment