Create a spark application written in Scala or PySpark that reads in the provided signals dataset, processes the data, and stores the entire output as specified below.
For each entity_id in the signals dataset, find the item_id with the oldest and newest month_id.In some cases it may be the same item. If there are 2 different items with the same month_id then take the item with the lower item_id. Finally sum the count of signals for each entity and output as the total_signals. The correct output should contain 1 row per unique entity_id.
- Create a Scala SBT project Or Pyspark Project (If you know scala then please use the same as we give higher preference to that).
- Use the Spark Scala/Pyspark API and Dataframes/Datasets
- Please do not use Spark SQL with a sql string!