Created
December 6, 2018 22:45
-
-
Save YordanGeorgiev/28c32c69ee7ffcd46de330dc9708fb63 to your computer and use it in GitHub Desktop.
[iterate over df rdd] how-to iterate over df rdd #dataframe #scala #rdd #iterate
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
def process(df: DataFrame): DataFrame = { | |
val encoder = RowEncoder(df.schema) // provide the Catalyst codegen info about the datatypes of the data to avoid reflection | |
df.map(row => { | |
val rowIn = row.toArray | |
var rowOut = rowIn | |
// ... do here some kind of rowOut modifications | |
Row.fromSeq(rowOut) | |
})(encoder) // much faster than simple rdd iteration , because avoids the reflection overhead | |
} |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment