Created
July 18, 2018 19:29
-
-
Save justinnaldzin/2510da265f598497d99dbb5217581754 to your computer and use it in GitHub Desktop.
Estimate size of Spark DataFrame in bytes
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
# Function to convert python object to Java objects | |
def _to_java_object_rdd(rdd): | |
""" Return a JavaRDD of Object by unpickling | |
It will convert each Python object into Java object by Pyrolite, whenever the | |
RDD is serialized in batch or not. | |
""" | |
rdd = rdd._reserialize(AutoBatchedSerializer(PickleSerializer())) | |
return rdd.ctx._jvm.org.apache.spark.mllib.api.python.SerDe.pythonToJava(rdd._jrdd, True) | |
# Convert DataFrame to an RDD | |
JavaObj = _to_java_object_rdd(df.rdd) | |
# Estimate size in bytes | |
bytes = sc._jvm.org.apache.spark.util.SizeEstimator.estimate(JavaObj) |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
I am getting 43 Mb with your code, but in my storage stats it shows that this df has 82Mb, any suggestions?