Skip to content

Instantly share code, notes, and snippets.

@Mageswaran1989
Last active August 29, 2015 14:23
Show Gist options
  • Select an option

  • Save Mageswaran1989/0fcf04b63505ba2a5dea to your computer and use it in GitHub Desktop.

Select an option

Save Mageswaran1989/0fcf04b63505ba2a5dea to your computer and use it in GitHub Desktop.
/**
* Get the stackexchange data from https://archive.org/details/stackexchange
* Data set used here : math.stackexchange.com
**/
//Open the file. The text file is an RDD (Resilient Distributed Dataset)
//of Strings, which are the lines of the file.
val postXML = sc.textFile("Posts.xml")
//Count the lines. Note: Run twice and see the difference ;)
val numberOfLines = postXML.count()
//With flatMap() get each line and split it into words and flatten it as
//single array of strings. Now map/intialize each word to a tuple of
//(word,1). And finally add all the values against each word
val wordsAndTheirCounts = postXML.flatMap(line => line.split(" ")).
map(word => (word,1)).
reduceByKey(_ + _)
wordsAndTheirCounts.top(5)
val numberOfLineWithgeometry = postXML.filter(lines => lines.contains("geometry")).count
val postIDTags = postsXML.flatMap { line =>
// Matches Id="..." ... Tags="..." in line
val idTagRegex = "Id=\"(\\d+)\".+Tags=\"([^\"]+)\"".r
// // Finds tags like <TAG> value from above
val tagRegex = "&lt;([^&]+)&gt;".r
// Yields 0 or 1 matches:
idTagRegex.findFirstMatchIn(line) match {
// No match -- not a line
case None => None
// Match, and can extract ID and tags from m
case Some(m) => {
val postID = m.group(1).toInt
val tagsString = m.group(2)
// Pick out just TAG matching group
val tags = tagRegex.findAllMatchIn(tagsString).map(_.group(1)).toList
// Keep only question with at least 4 tags, and map to (post,tag) tuples
if (tags.size >= 4) tags.map((postID,_)) else None
}
}
// Because of flatMap, individual lists will concatenate
// into one collection of tuples
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment