Last active
August 29, 2015 14:23
-
-
Save Mageswaran1989/0fcf04b63505ba2a5dea to your computer and use it in GitHub Desktop.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| /** | |
| * Get the stackexchange data from https://archive.org/details/stackexchange | |
| * Data set used here : math.stackexchange.com | |
| **/ | |
| //Open the file. The text file is an RDD (Resilient Distributed Dataset) | |
| //of Strings, which are the lines of the file. | |
| val postXML = sc.textFile("Posts.xml") | |
| //Count the lines. Note: Run twice and see the difference ;) | |
| val numberOfLines = postXML.count() | |
| //With flatMap() get each line and split it into words and flatten it as | |
| //single array of strings. Now map/intialize each word to a tuple of | |
| //(word,1). And finally add all the values against each word | |
| val wordsAndTheirCounts = postXML.flatMap(line => line.split(" ")). | |
| map(word => (word,1)). | |
| reduceByKey(_ + _) | |
| wordsAndTheirCounts.top(5) | |
| val numberOfLineWithgeometry = postXML.filter(lines => lines.contains("geometry")).count | |
| val postIDTags = postsXML.flatMap { line => | |
| // Matches Id="..." ... Tags="..." in line | |
| val idTagRegex = "Id=\"(\\d+)\".+Tags=\"([^\"]+)\"".r | |
| // // Finds tags like <TAG> value from above | |
| val tagRegex = "<([^&]+)>".r | |
| // Yields 0 or 1 matches: | |
| idTagRegex.findFirstMatchIn(line) match { | |
| // No match -- not a line | |
| case None => None | |
| // Match, and can extract ID and tags from m | |
| case Some(m) => { | |
| val postID = m.group(1).toInt | |
| val tagsString = m.group(2) | |
| // Pick out just TAG matching group | |
| val tags = tagRegex.findAllMatchIn(tagsString).map(_.group(1)).toList | |
| // Keep only question with at least 4 tags, and map to (post,tag) tuples | |
| if (tags.size >= 4) tags.map((postID,_)) else None | |
| } | |
| } | |
| // Because of flatMap, individual lists will concatenate | |
| // into one collection of tuples |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment