Skip to content

Instantly share code, notes, and snippets.

@danish-rehman
Last active July 1, 2016 21:28
Show Gist options
  • Select an option

  • Save danish-rehman/32f7ee7b417d6f1acef87d85fad00418 to your computer and use it in GitHub Desktop.

Select an option

Save danish-rehman/32f7ee7b417d6f1acef87d85fad00418 to your computer and use it in GitHub Desktop.
Spark : Issues and solutions
  1. JSON format as string must contain double quotes around keys and values.
  2. If schema is coming out as corrupt then JSON is un-parsable by Spark standard.
  3. Can not have more then one scContenxt on a pyspark shell.
  4. By specifying a schema, you can speed up your Spark job by cutting down the time Spark uses to infer the schema.
  5. If you have a lot of keys that you don't care about, you can filter for only the keys you need.
  6. Too many keys in your JSON data can trigger an OOM error on your Spark Driver when you infer the schema.
  7. When declaring schema if int is very large do not use IntegerType.
  8. Whenever import file use file:// protocol.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment