Last active
October 19, 2023 14:21
-
-
Save mh0w/34cf72f1015cb4318c8dc4e5d1d1717f to your computer and use it in GitHub Desktop.
Pyspark or Pydoop - json (read and write) locally or on hdfs
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
from pyspark.sql import SparkSession | |
import databricks.koalas as ks | |
import json | |
import pydoop.hdfs as hdfs | |
spark = SparkSession.builder.enableHiveSupport().getOrCreate() | |
my_data = {"name": ["John", "Mary", "Kevin"], | |
"area": ["London", "Munich", "Berlin"], | |
"age": [33, 56, 44]} | |
############################################ | |
############################################ | |
# Working with json files on a local drive # | |
############################################ | |
############################################ | |
# Write dict to json file | |
with open("db.json", "w") as f: | |
json.dump(my_data, f, indent=4) | |
# Read json file | |
with open("db.json", "r") as f: | |
my_data = json.load(f) | |
################################### | |
################################### | |
# Working with json files on HDFS # | |
################################### | |
################################### | |
file_path = '/dapsen/path/to/my/file.json' | |
###################### | |
# Write json to hdfs # | |
###################### | |
# Using Pydoop - wt means write as plain text | |
with hdfs.open(file_path, 'wt') as file: | |
json.dump(my_data, file) | |
# Using PySpark/Koalas | |
ks.DataFrame(my_data).to_json(my_path + "export_test", num_files=1) | |
####################### | |
# Read json from hdfs # | |
####################### | |
# Using Pydoop - read the json file back from hdfs to check it worked properly | |
with hdfs.open(file_path, 'r') as file: | |
file_imported_from_hdfs = json.load(file) | |
# Using Koalas | |
koalas_df = ks.read_json(my_path + "export_test") | |
# Using PySpark read.json() | |
df = spark.read.json(my_path + "export_test") | |
# Using PySpark read.format.load() | |
df = spark.read.format('org.apache.spark.sql.json').load(my_path + "export_test") | |
# Using PySpark read.option().json() | |
df = spark.read.option("multiline", "true").json(my_path + "export_test") | |
# Then view the data... | |
koalas_df.head(5) | |
df.printSchema() | |
df.show() |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Regarding the need to explicitly specify
wt
rather than just specifyingw
:I think it's a matter of matching the write mode (e.g., wt or wb) with the object you're writing (e.g., a json module object, which will be a string rather than a bytes object).
With a json module string object, you need to write text, such as by using
wt
rather thanwb
. Your approach of using wt is more sensible than the approach I previously suggested.My understanding is that with python open(), the default write (w) mode is usually text (wt) but sometimes binary (wb), so in this case we have to explicitly state that we want the text (wt) write mode.
Some things I recall reading to come to these conclusions:
https://stackoverflow.com/questions/23051062/open-files-in-r...
https://docs.python.org/3.0/whatsnew/3.0.html#text-vs-data-i...
https://stackoverflow.com/questions/53402930/use-pyarrow-and...