danish-rehman/spark_debug.md

Last active July 1, 2016 21:28

Select an option

Learn more about clone URLs
Clone this repository at <script src="https://gist.github.com/danish-rehman/32f7ee7b417d6f1acef87d85fad00418.js"></script>
Save danish-rehman/32f7ee7b417d6f1acef87d85fad00418 to your computer and use it in GitHub Desktop.

Spark : Issues and solutions

Raw

JSON format as string must contain double quotes around keys and values.
If schema is coming out as corrupt then JSON is un-parsable by Spark standard.
Can not have more then one scContenxt on a pyspark shell.
By specifying a schema, you can speed up your Spark job by cutting down the time Spark uses to infer the schema.
If you have a lot of keys that you don't care about, you can filter for only the keys you need.
Too many keys in your JSON data can trigger an OOM error on your Spark Driver when you infer the schema.
When declaring schema if int is very large do not use IntegerType.
Whenever import file use file:// protocol.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Select an option