Suppose you need to apply the same function to multiple columns in one DataFrame, one straight way is like this:
val newDF = oldDF.withColumn("colA", func("colA")).withColumn("colB", func("colB")).withColumn("colC", func("colC"))
If you want to save some type, you can try this:
- Use
select
with varargs including*
:
import spark.implicits._
df.select($"*" +: Seq("A", "B", "C").map( c => func(c) ): _*)
Here:
- Maps column names to
func
withSeq("A", ...).map(...)
- Prepends all pre-existing columns with
$"*" +: ...
- Unpacks combined sequences with
... : _*
and can be generalized as:
import org.apache.spark.sql.{Column, DataFrame}
/**
* @param cols a sequence of columns to transform
* @param df an input DataFrame
* @param f a function to be applied on each col in cols
*/
def withColumns(cols: Seq[String], df: DataFrame, f: String => Column) =
df.select($"*" +: cols.map(c => f(c)): _*)
Note: If you want to change the result column name, you can use column.as/alias(...)
; but generally you can not replace the original column (not like withColumn
).
- With
withColumn
you can usefoldLeft
:
Seq("A","B","C").foldLeft(df)( (df, c) => df.withColumn( c, func(c) ) )
which can be generalized to :
/**
* @param cols a sequence of columns to transform
* @param df an input DataFrame
* @param f a function to be applied on each col in cols
* @param name a function mapping from input to output name.
*/
def withColumns(cols: Seq[String], df: DataFrame,
f: String => Column, name: String => String = identity) =
cols.foldLeft(df)((df, c) => df.withColumn(name(c), f(c)))
Note here you can replace the original columns.
One example of func
:
import org.apache.spark.sql._
def datefmt(c: String): Column = from_unixtime(col(c) / 1000, "yyyy-MM-dd'T'HH:mm:ss.SSSXXX")
Another example:
// casting of all columns with idiomatic approach in scala
def castAllTypedColumnsTo(df: DataFrame, sourceType: DataType, targetType: DataType) = {
df.schema.filter(_.dataType == sourceType).foldLeft(df) {
case (acc, col) => acc.withColumn(col.name, df(col.name).cast(targetType))
}
}
References:
Thanks a lot! That's just what I needed.