Skip to content

Instantly share code, notes, and snippets.

@sany2k8
Forked from ian-whitestone/notes.md
Last active January 5, 2021 14:56
Show Gist options
  • Save sany2k8/f1d06bc30428ae32076050be2bef5ae2 to your computer and use it in GitHub Desktop.
Save sany2k8/f1d06bc30428ae32076050be2bef5ae2 to your computer and use it in GitHub Desktop.
Best practices for presto sql

Presto Specific

  • Don’t SELECT *, Specify explicit column names (columnar store)
  • Avoid large JOINs (filter each table first)
    • In PRESTO tables are joined in the order they are listed!!
    • Join small tables earlier in the plan and leave larger fact tables to the end
    • Avoid cross joins or 1 to many joins as these can degrade performance
  • Order by and group by take time
    • only use order by in subqueries if it is really necessary
  • When using GROUP BY, order the columns by the highest cardinality (that is, most number of unique values) to the lowest.
  • Use approx_distinct() instead of count(distinct) for very large datasets
  • Use approx_percentile(metric, 0.5) for median
  • Avoid UNIONs where possible
  • Use WITH statements vs. nested subqueries
  • Presto broadcasts the right side table in joins, declare larger tables first and filter right side tables to as small as possible
  • LIKE takes time, in particular when you add %s on both sides
    • Use REGEXP_LIKE() if multiple like statements
    • l_comment LIKE '%wake%' OR l_comment LIKE '%regular%' OR l_comment LIKE '%express%' --> regexp_like(l_comment, 'wake|regular|express')
  • SHOW COLUMNS FROM table_name;
  • SELECT * SYSTEM.JDBC.TABLES
    • Can also use COLUMNS, CATALOGS, SCHEMAS
    • Use WHERE COLUMN_NAME/TABLE_NAME Like ‘%client%’
  • SHOW TABLES FROM SYSTEM.JDBC

References

General SQL tips

  • start analysis on a small dataset
  • Use the query plan (EXPLAIN)
  • ALIAS tables in the case of multiple joins
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment