Skip to content

Instantly share code, notes, and snippets.

@belenaj
Last active August 16, 2020 08:38
Show Gist options
  • Save belenaj/de4603a85bade6838f419b02f871cc01 to your computer and use it in GitHub Desktop.
Save belenaj/de4603a85bade6838f419b02f871cc01 to your computer and use it in GitHub Desktop.

Exercises

  1. Select all "Harry Potter" books
  2. Book with more pages
  3. Top 5 authors with more written books (assume author in first position in the array, "key" field) (assuming each row is a different book)
  4. Top 5 Genres with more books

  1. Avg. number of pages (needs cleaning)
  2. Per publish year, get the number of authors that published at least one book

Solutions

pre-cleaning

val sqlStmt = """
    select
      title
      , regexp_replace(authors[0].key, '/authors/', '') as author
      , cast(regexp_extract(publish_date,'\\d{4}', 0) as int) as publish_year
      , number_of_pages
      , genres
    from clean_table
    where title is not null
"""
val transformedDF = spark.sql(sqlStmt)
transformedDF.createOrReplaceTempView("transformed_table")
  1. Select all "Harry Potter" books
select *
from transformed_table
where title like '%Harry Potter%'
limit 100
  1. Book with more pages
select title, number_of_pages
from transformed_table 
where number_of_pages = (
    select max(number_of_pages)
    from transformed_table
)
  1. Top 5 authors with more written books (assume author in first position in the array, "key" field) (assuming each row is a different book)
select 
     author, count(*) as num_written_books
from transformed_table
where author is not null
group by author
order by count(*) desc
limit 5

--https://openlibrary.org/authors/OL1224818A/California._Dept._of_Water_Resources.
--https://openlibrary.org/authors/OL4283462A/Jir%C5%8D_Akagawa
  1. Top 5 Genres with more books
transformedDF.select($"title",explode($"genres").alias("genre")).createOrReplaceTempView("exploded_genres")
select genre, count(*)
from exploded_genres
group by genre
order by count(*) desc
limit 5
  1. Avg. number of pages (needs cleaning)
import org.apache.spark.ml.feature.Imputer
import org.apache.spark.sql.types.DoubleType

val df = transformedDF.select(transformedDF("number_of_pages").cast(DoubleType).as("number_of_pages"))

val imputer = new Imputer()
  .setInputCols(df.columns)
  .setOutputCols(df.columns.map(c => s"${c}_imputed"))
  .setStrategy("median")

val fixed = imputer.fit(df).transform(df)
fixed.show(20, false)

fixed.select(avg($"number_of_pages_imputed")).show()
fixed.select(avg($"number_of_pages")).show()
  1. Per publish year, get the number of authors that published at least one book
select 
    publish_year, count(distinct author) as num_authors
from transformed_table
where author is not null and publish_year is not null  
group by publish_year
order by publish_year asc
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment