Last active
July 15, 2023 13:23
-
-
Save zoltanctoth/2deccd69e3d1cde1dd78 to your computer and use it in GitHub Desktop.
Writing an UDF for withColumn in PySpark
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
from pyspark.sql.types import StringType | |
from pyspark.sql.functions import udf | |
maturity_udf = udf(lambda age: "adult" if age >=18 else "child", StringType()) | |
df = spark.createDataFrame([{'name': 'Alice', 'age': 1}]) | |
df.withColumn("maturity", maturity_udf(df.age)) | |
df.show() |
I have a question. When I have a data frame with date columns in the format of 'Mmm dd,yyyy' then can I use this udf?
1 Change date fields
review_date_udf = fn.udf(
lambda x: datetime.strptime(x, ' %b %d, %Y'), DateType()
)
reviews_df = reviews_df.withColumn("dates", review_date_udf(reviews_df['dates']))
But when I try to view the data frame it starts throwing an error of Caused by: java.net.SocketTimeoutException: Accept timed out. Any ideas to solve this issue?
Thanks !
TypeError: a bytes-like object is required, not 'NoneType'
I am getting this error while trying 'mrandrewandrade' input.
How can I resolve this error?
Thanks.
It was nice to come across my teacher's code even after graduation. Thank you!
I encountered this problem too. Have you solved it? Thank you!
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Thanks ! :)