gbraccialli/spark_performance_hints.md

Last active November 10, 2021 01:30

Star (0) You must be signed in to star a gist
Fork (1) You must be signed in to fork a gist

Learn more about clone URLs
Clone this repository at <script src="https://gist.github.com/gbraccialli/b6380a04d443e6d6bc025cdec44424d6.js"></script>
Save gbraccialli/b6380a04d443e6d6bc025cdec44424d6 to your computer and use it in GitHub Desktop.

Download ZIP

Raw

spark_performance_hints.md

Programing / Best Practices:
http://www.slideshare.net/JenAman/rearchitecting-spark-for-performance-understandability-63065166
http://www.slideshare.net/MaksudIbrahimov/spark-performance-tuning-maksud-ibrahimov
http://www.slideshare.net/SparkSummit/spark-summit-eu-talk-by-qifan-pu
http://www.slideshare.net/julesdamji/jump-start-with-apache-spark-20-on-databricks-70214386
https://robertovitillo.com/2015/06/30/spark-best-practices/
https://github.com/beeva/beeva-best-practices/blob/master/big_data/spark/README.md
http://spark.apache.org/docs/latest/tuning.html
https://www.gitbook.com/book/databricks/databricks-spark-knowledge-base/details
https://www.linkedin.com/pulse/9-tips-best-practices-apache-spark-kumar-chinnakali
http://www.lewisgavin.co.uk/Spark-Performance/
http://www.slideshare.net/cloudera/top-5-mistakes-to-avoid-when-writing-apache-spark-applications
http://fdahms.com/2015/10/04/writing-efficient-spark-jobs/
http://data-informed.com/6-steps-to-get-top-performance-from-the-changes-in-spark-2-0/
https://martin.atlassian.net/wiki/pages/viewpage.action?pageId=67043332
https://bzhangusc.wordpress.com/2014/06/19/optimize-map-performamce-with-mappartitions/
https://medium.com/teads-engineering/spark-performance-tuning-from-the-trenches-7cbde521cf60

https://databricks.com/blog/2016/10/18/7-tips-to-debug-apache-spark-code-faster-with-databricks.html
https://databricks.com/blog/2015/04/28/project-tungsten-bringing-spark-closer-to-bare-metal.html
https://www.slideshare.net/SparkSummit/top-5-mistakes-when-writing-spark-applications-by-mark-grover-and-ted-malaska

https://community.hortonworks.com/content/kbentry/42803/spark-on-yarn-executor-resource-allocation-optimiz.html
https://community.hortonworks.com/content/idea/29810/spark-configuration-best-practices.html
https://community.hortonworks.com/articles/75022/optimizing-your-first-spark-program.html
https://community.hortonworks.com/content/idea/29810/spark-configuration-best-practices.html

http://blog.cloudera.com/blog/2015/03/how-to-tune-your-apache-spark-jobs-part-1/
http://blog.cloudera.com/blog/2015/03/how-to-tune-your-apache-spark-jobs-part-2/
https://www.cloudera.com/documentation/enterprise/latest/topics/admin_spark_tuning.html
https://www.cloudera.com/documentation/enterprise/latest/topics/cdh_ig_running_spark_on_yarn.html
http://blog.cloudera.com/blog/2015/05/working-with-apache-spark-or-how-i-learned-to-stop-worrying-and-love-the-shuffle/
http://blog.cloudera.com/blog/2015/01/improving-sort-performance-in-apache-spark-its-a-double/

Shuffle
http://blog.hydronitrogen.com/2016/05/07/apache-spark-shuffles-explained-in-depth/
http://www.slideshare.net/colorant/spark-shuffle-introduction
http://www.slideshare.net/cloudera/top-5-mistakes-to-avoid-when-writing-apache-spark-applications
https://trongkhoanguyenblog.wordpress.com/2015/04/04/understand-the-shuffle-component-in-spark-core/
http://www.ijert.org/view-pdf/12333/shuffle-performance-in-apache-spark
http://stackoverflow.com/questions/35479876/why-does-spark-save-map-phase-output-to-local-disk
https://jaceklaskowski.gitbooks.io/mastering-apache-spark/content/spark-blockmanager.html
http://www.slideshare.net/colorant/spark-shuffle-introduction
http://www.ijert.org/view-pdf/12333/shuffle-performance-in-apache-spark
http://blog.cloudera.com/blog/2015/01/improving-sort-performance-in-apache-spark-its-a-double/
http://blog.hydronitrogen.com/2016/05/07/apache-spark-shuffles-explained-in-depth/
https://trongkhoanguyenblog.wordpress.com/2015/04/04/understand-the-shuffle-component-in-spark-core/

Joins:
http://blog.hydronitrogen.com/2016/05/13/shuffle-free-joins-in-spark-sql/
http://blog.hydronitrogen.com/2016/02/22/in-the-code-spark-sql-query-planning-and-execution/
https://deepsense.io/optimize-spark-with-distribute-by-and-cluster-by/
https://databricks.com/blog/2016/05/23/apache-spark-as-a-compiler-joining-a-billion-rows-per-second-on-a-laptop.html
https://databricks.com/blog/2017/02/16/processing-trillion-rows-per-second-single-machine-can-nested-loop-joins-fast.html
http://people.csail.mit.edu/matei/papers/2015/sigmod_spark_sql.pdf

S3:
https://medium.com/@subhojit20_27731/apache-spark-and-amazon-s3-gotchas-and-best-practices-a767242f3d98#.v32m90ifk
https://hortonworks.github.io/hdp-aws/s3-spark/#improving-performance-for-spark-jobs
http://arnon.me/2015/08/spark-parquet-s3/
http://hortonworks.com/blog/history-apache-hadoops-support-amazon-s3/
https://docs.databricks.com/spark/latest/faq/append-slow-with-spark-2.0.0.html

Profiling / Benchmarks:
https://databricks.com/blog/2016/08/31/apache-spark-scale-a-60-tb-production-use-case.html
https://code.facebook.com/posts/1671373793181703/apache-spark-scale-a-60-tb-production-use-case/

https://databricks.com/blog/2015/06/22/understanding-your-spark-application-through-visualization.html
http://www.slideshare.net/SparkSummit/making-sense-of-spark-performancekay-ousterhout
http://kayousterhout.github.io/trace-analysis/
http://db-blog.web.cern.ch/blog/luca-canali/2016-09-spark-20-performance-improvements-investigated-flame-graphs
https://gist.github.com/kayousterhout/7008a8ebf2babeedc7ce6f8723fd1bf4
http://www.brendangregg.com/flamegraphs.html

OS:
https://community.hortonworks.com/articles/34621/apache-spark-performance-improvement-on-numa-capab.html
https://gist.github.com/leosouzadias/e37cd189794bb78de502ac25cb605576
https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/6/html/Performance_Tuning_Guide/main-cpu.html#s-cpu-numa-topology
https://community.hortonworks.com/articles/69541/capturing-hardware-statistics-for-spark-hadoop-nif.html
http://www.tecmint.com/glances-an-advanced-real-time-system-monitoring-tool-for-linux/

graph:
http://blog.cloudera.com/blog/2016/10/how-to-do-scalable-graph-analytics-with-apache-spark/

streaming
https://www.mapr.com/blog/performance-tuning-apache-kafkaspark-streaming-system

pyspark
https://www.slideshare.net/SparkSummit/getting-the-best-performance-with-pyspark

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment