-
-
Save zoltanctoth/2deccd69e3d1cde1dd78 to your computer and use it in GitHub Desktop.
from pyspark.sql.types import StringType | |
from pyspark.sql.functions import udf | |
maturity_udf = udf(lambda age: "adult" if age >=18 else "child", StringType()) | |
df = spark.createDataFrame([{'name': 'Alice', 'age': 1}]) | |
df.withColumn("maturity", maturity_udf(df.age)) | |
df.show() |
Thanks ! :)
I have a question. When I have a data frame with date columns in the format of 'Mmm dd,yyyy' then can I use this udf?
1 Change date fields
review_date_udf = fn.udf(
lambda x: datetime.strptime(x, ' %b %d, %Y'), DateType()
)
reviews_df = reviews_df.withColumn("dates", review_date_udf(reviews_df['dates']))
But when I try to view the data frame it starts throwing an error of Caused by: java.net.SocketTimeoutException: Accept timed out. Any ideas to solve this issue?
Thanks !
TypeError: a bytes-like object is required, not 'NoneType'
I am getting this error while trying 'mrandrewandrade' input.
How can I resolve this error?
Thanks.
It was nice to come across my teacher's code even after graduation. Thank you!
I encountered this problem too. Have you solved it? Thank you!
This is awesome but I wanted to give a couple more examples and info.
Let's say your UDF is longer, then it might be more readable as a stand alone def instead of a lambda:
With a small to medium dataset this may take many minutes to run. To debug, you can run
df.explain
, and will get a query plan like:The badness here might be the
pythonUDF
as it might not be optimized. Instead, you should look to use any of thepyspark.functions
as they are optimized to run faster. In this example,when((condition), result).otherwise(result)
is a much better way of doing things:The query will look something like: