Skip to content

Instantly share code, notes, and snippets.

@j-thepac
Last active March 19, 2023 06:46
Show Gist options
  • Save j-thepac/938f6b0bb9b9b36bdc0ad339039e92ce to your computer and use it in GitHub Desktop.
Save j-thepac/938f6b0bb9b9b36bdc0ad339039e92ce to your computer and use it in GitHub Desktop.
from pyspark.sql.window import Window
"""
Aggregate: min, max, avg, count, and sum.
Ranking: rank, dense_rank, percent_rank, row_num, and ntile
Analytical: cume_dist, lag, and lead
Custom boundary: rangeBetween and rowsBetween
"""
1. Only Order :lead(c) , lag(c)
Note: 1) perform OrderBy 2) lags
df.withColumn("c" , lead(df["id"],1).over(Window.orderBy(df.id)))
` select lead(c, 1) over ( order by c2 ) from T `
2. Only partition: max(c), min(c)
df.withColumn("c" , max(df["id"]).over(Window.partitionBy(df.id)))
` select max(sal) over ( partition by age ) from T `
3. Both partition and Order: first() ,last(), rank() ,dense_rank() , udf
df.withColumn("c" , rank().over(Window.orderBy(df.id).partitionBy(df.c)))
` SELECT name,country,sal, rank() over (partition by country order by sal asc) FROM EMP; `
#As Variables
winSpec = Window.partitionBy("depName").orderBy("salary".desc)
empsalary.withColumn("rank", rank().over(winSpec))
winSpec = Window.partitionBy("depName").orderBy("salary")
empsalary.withColumn("lag", lag("salary", 2).over(winSpec))
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment