Skip to content

Instantly share code, notes, and snippets.

@CanoeFZH
Forked from zoltanctoth/pyspark-udf.py
Created April 11, 2018 09:29
Show Gist options
  • Save CanoeFZH/573c16943f97ea227ccf0ac8b19a1bc7 to your computer and use it in GitHub Desktop.
Save CanoeFZH/573c16943f97ea227ccf0ac8b19a1bc7 to your computer and use it in GitHub Desktop.
Writing an UDF for withColumn in PySpark
from pyspark.sql.types import StringType
from pyspark.sql.functions import udf
maturity_udf = udf(lambda age: "adult" if age >=18 else "child", StringType())
df = sqlContext.createDataFrame([{'name': 'Alice', 'age': 1}])
df.withColumn("maturity", maturity_udf(df.age))
@CanoeFZH
Copy link
Author

def return_age_bracket(age):
  if (age <= 12):
    return 'Under 12'
  elif (age >= 13 and age <= 19):
    return 'Between 13 and 19'
  elif (age > 19 and age < 65):
    return 'Between 19 and 65'
  elif (age >= 65):
    return 'Over 65'
  else: return 'N/A'

from pyspark.sql.functions import udf

maturity_udf = udf(return_age_bracket)
df = sqlContext.createDataFrame([{'name': 'Alice', 'age': 1}])
df.withColumn("maturity", maturity_udf(df.age))

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment