Skip to content

Instantly share code, notes, and snippets.

@YordanGeorgiev
Created December 6, 2018 22:45
Show Gist options
  • Save YordanGeorgiev/28c32c69ee7ffcd46de330dc9708fb63 to your computer and use it in GitHub Desktop.
Save YordanGeorgiev/28c32c69ee7ffcd46de330dc9708fb63 to your computer and use it in GitHub Desktop.
[iterate over df rdd] how-to iterate over df rdd #dataframe #scala #rdd #iterate
def process(df: DataFrame): DataFrame = {
val encoder = RowEncoder(df.schema) // provide the Catalyst codegen info about the datatypes of the data to avoid reflection
df.map(row => {
val rowIn = row.toArray
var rowOut = rowIn
// ... do here some kind of rowOut modifications
Row.fromSeq(rowOut)
})(encoder) // much faster than simple rdd iteration , because avoids the reflection overhead
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment