Skip to content

Instantly share code, notes, and snippets.

@saswata-dutta
Created April 17, 2021 06:36
Show Gist options
  • Save saswata-dutta/9ff658510ceb791370feea731b8810b2 to your computer and use it in GitHub Desktop.
Save saswata-dutta/9ff658510ceb791370feea731b8810b2 to your computer and use it in GitHub Desktop.
averages across multiple dimensions
/*
Inner_Collapsingkey - Outer_GroupingKey = Implicit_ObservationKey
collapsingKey == primaryKey(default)
what is the average daily amount of items sold for each employee ?
Inner_Collapsingkey = (day, employee) => what we want to define as an observation for this analysis
Outer_GroupingKey = employee => key defining the required groups to report
Implicit_ObservationKey = day => defines count of rows for the denom in avg
*/
import spark.implicits._
val sales = Seq(
("Monday", "Mary", 5),
("Monday", "Bob", 4),
("Tuesday", "Bob", 8),
("Thursday", "Jane", 10),
("Thursday", "Jane", 6)
).toDF("day", "employee", "items_sold")
sales.createOrReplaceTempView("sales")
spark.sql("select * from sales").show()
/*
+--------+--------+---------+
| day|employee|ItemsSold|
+--------+--------+---------+
| Monday| Mary| 5|
| Monday| Bob| 4|
| Tuesday| Bob| 8|
|Thursday| Jane| 10|
|Thursday| Jane| 6|
+--------+--------+---------+
*/
spark.sql("""
select employee, avg(total_items_sold)
from (
select day, employee, sum(items_sold) as total_items_sold
from sales
group by day, employee
) group by employee
""").show()
/*
+--------+---------------------+
|employee|avg(total_items_sold)|
+--------+---------------------+
| Mary| 5.0|
| Bob| 6.0|
| Jane| 16.0|
+--------+---------------------+
*/
sales.groupBy("day", "employee").agg(sum("items_sold").as("total_items_sold")).show()
/*
+--------+--------+----------------+
| day|employee|total_items_sold|
+--------+--------+----------------+
| Tuesday| Bob| 8|
| Monday| Mary| 5|
| Monday| Bob| 4|
|Thursday| Jane| 16|
+--------+--------+----------------+
*/
sales.groupBy("day", "employee").agg(sum("items_sold").as("total_items_sold")).groupBy("employee").avg("total_items_sold").show()
/*
+--------+---------------------+
|employee|avg(total_items_sold)|
+--------+---------------------+
| Mary| 5.0|
| Bob| 6.0|
| Jane| 16.0|
+--------+---------------------+
*/
sales.groupBy("day", "employee").sum("items_sold").groupBy("employee").avg("sum(items_sold)").show()
/*
+--------+--------------------+
|employee|avg(sum(items_sold))|
+--------+--------------------+
| Mary| 5.0|
| Bob| 6.0|
| Jane| 16.0|
+--------+--------------------+
*/
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment