Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Save mh0w/34cf72f1015cb4318c8dc4e5d1d1717f to your computer and use it in GitHub Desktop.
Save mh0w/34cf72f1015cb4318c8dc4e5d1d1717f to your computer and use it in GitHub Desktop.
Pyspark or Pydoop - json (read and write) locally or on hdfs
from pyspark.sql import SparkSession
import databricks.koalas as ks
import json
import pydoop.hdfs as hdfs
spark = SparkSession.builder.enableHiveSupport().getOrCreate()
my_data = {"name": ["John", "Mary", "Kevin"],
"area": ["London", "Munich", "Berlin"],
"age": [33, 56, 44]}
############################################
############################################
# Working with json files on a local drive #
############################################
############################################
# Write dict to json file
with open("db.json", "w") as f:
json.dump(my_data, f, indent=4)
# Read json file
with open("db.json", "r") as f:
my_data = json.load(f)
###################################
###################################
# Working with json files on HDFS #
###################################
###################################
file_path = '/dapsen/path/to/my/file.json'
######################
# Write json to hdfs #
######################
# Using Pydoop - wt means write as plain text
with hdfs.open(file_path, 'wt') as file:
json.dump(my_data, file)
# Using PySpark/Koalas
ks.DataFrame(my_data).to_json(my_path + "export_test", num_files=1)
#######################
# Read json from hdfs #
#######################
# Using Pydoop - read the json file back from hdfs to check it worked properly
with hdfs.open(file_path, 'r') as file:
file_imported_from_hdfs = json.load(file)
# Using Koalas
koalas_df = ks.read_json(my_path + "export_test")
# Using PySpark read.json()
df = spark.read.json(my_path + "export_test")
# Using PySpark read.format.load()
df = spark.read.format('org.apache.spark.sql.json').load(my_path + "export_test")
# Using PySpark read.option().json()
df = spark.read.option("multiline", "true").json(my_path + "export_test")
# Then view the data...
koalas_df.head(5)
df.printSchema()
df.show()
@mh0w
Copy link
Author

mh0w commented Oct 18, 2023

Regarding the need to explicitly specify wt rather than just specifying w:

I think it's a matter of matching the write mode (e.g., wt or wb) with the object you're writing (e.g., a json module object, which will be a string rather than a bytes object).

With a json module string object, you need to write text, such as by using wt rather than wb. Your approach of using wt is more sensible than the approach I previously suggested.

My understanding is that with python open(), the default write (w) mode is usually text (wt) but sometimes binary (wb), so in this case we have to explicitly state that we want the text (wt) write mode.

Some things I recall reading to come to these conclusions:

https://stackoverflow.com/questions/23051062/open-files-in-r...

https://docs.python.org/3.0/whatsnew/3.0.html#text-vs-data-i...

https://stackoverflow.com/questions/53402930/use-pyarrow-and...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment