Skip to content

Instantly share code, notes, and snippets.

@fwbrasil
Last active November 16, 2017 19:32
Show Gist options
  • Save fwbrasil/2892681396c84be6aad90dd9b12d553a to your computer and use it in GitHub Desktop.
Save fwbrasil/2892681396c84be6aad90dd9b12d553a to your computer and use it in GitHub Desktop.
def topHashtags(tweets: Dataset[Tweet], n: Int): Dataset[(String, BigInt)] =
tweets
.select($"text".as[String]) // select the text column (Dataframe)
.flatMap(_.split("\\s+")) // split it into words (Dataset)
.filter(_.startsWith("#")) // filter hashtag words (Dataset)
.map(_.toLowerCase) // normalize hashtags (Dataset)
.groupBy($"value") // group by each hashtag (Dataframe)
.agg(count("*") as "count") // aggregate the count (Dataframe)
.orderBy($"count" desc) // order (Datafeame)
.limit(n) // limit to top results (Dataframe)
.as[(String, BigInt)] // set the type again (Dataset)
@cfeduke
Copy link

cfeduke commented Nov 16, 2017

Line 9: =~ s/DataFeame/DataFrame/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment