Skip to content

Instantly share code, notes, and snippets.

@saswata-dutta
Created March 26, 2023 06:49
Show Gist options
  • Save saswata-dutta/29ebbfd0535e7d5a5bc16d60d9804875 to your computer and use it in GitHub Desktop.
Save saswata-dutta/29ebbfd0535e7d5a5bc16d60d9804875 to your computer and use it in GitHub Desktop.
Spark Scala Approx Percentile over group
val a_s = Seq.fill(9)("a" -> 1):+ ("a" -> 10)
// a_s: Seq[(String, Int)] = List((a,1), (a,1), (a,1), (a,1), (a,1), (a,1), (a,1), (a,1), (a,1), (a,10))
val b_s = Seq.fill(9)("b" -> 2):+ ("b" -> 10)
// b_s: Seq[(String, Int)] = List((b,2), (b,2), (b,2), (b,2), (b,2), (b,2), (b,2), (b,2), (b,2), (b,10))
val df = (a_s ++ b_s).toDF("kind", "value")
// df: org.apache.spark.sql.DataFrame = [kind: string, value: int]
df.groupBy("kind").agg(expr("approx_percentile(value, 0.90, 20)").as("x_percentile")).show
"""
+----+------------+
|kind|x_percentile|
+----+------------+
| a| 1|
| b| 2|
+----+------------+
"""
// https://spark.apache.org/docs/latest/api/sql/index.html#approx_percentile
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment