Created
July 18, 2018 19:29
-
-
Save justinnaldzin/2510da265f598497d99dbb5217581754 to your computer and use it in GitHub Desktop.
Estimate size of Spark DataFrame in bytes
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
# Function to convert python object to Java objects | |
def _to_java_object_rdd(rdd): | |
""" Return a JavaRDD of Object by unpickling | |
It will convert each Python object into Java object by Pyrolite, whenever the | |
RDD is serialized in batch or not. | |
""" | |
rdd = rdd._reserialize(AutoBatchedSerializer(PickleSerializer())) | |
return rdd.ctx._jvm.org.apache.spark.mllib.api.python.SerDe.pythonToJava(rdd._jrdd, True) | |
# Convert DataFrame to an RDD | |
JavaObj = _to_java_object_rdd(df.rdd) | |
# Estimate size in bytes | |
bytes = sc._jvm.org.apache.spark.util.SizeEstimator.estimate(JavaObj) |
I am getting 43 Mb with your code, but in my storage stats it shows that this df has 82Mb, any suggestions?
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
HI Justin Thanks for this info. I would like to ask a question. When I am using this function in my local I am getting the data frame size as 3 MB for 150 row dataset. When I use the same in databricks i am getting the values as 30 MB. Any thought?