- JSON format as string must contain double quotes around keys and values.
- If schema is coming out as corrupt then JSON is un-parsable by Spark standard.
- Can not have more then one scContenxt on a pyspark shell.
- By specifying a schema, you can speed up your Spark job by cutting down the time Spark uses to infer the schema.
- If you have a lot of keys that you don't care about, you can filter for only the keys you need.
- Too many keys in your JSON data can trigger an OOM error on your Spark Driver when you infer the schema.
- When declaring schema if int is very large do not use IntegerType.
- Whenever import file use file:// protocol.
Last active
July 1, 2016 21:28
-
-
Save danish-rehman/32f7ee7b417d6f1acef87d85fad00418 to your computer and use it in GitHub Desktop.
Spark : Issues and solutions
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment