Created
December 23, 2011 16:54
-
-
Save willf/1514745 to your computer and use it in GitHub Desktop.
Create an inverted index from a file
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
/** | |
* From a file that contains | |
* doc_id w1 w2 w3 ... lines, separated by tabs | |
* return an inverted index Map of w -> Set(doc_id) | |
* | |
* @param filename well isn't it obvious | |
* @return Map[String,Set[String]] | |
*/ | |
import scala.collection.immutable.Map | |
def invertedIndex(filename:String) = { | |
io.Source.fromFile(filename).getLines. // this is an iterator over lines | |
map(_.split("\t")). // split at tabs | |
filter(_.size > 0). // make sure there is at least one item | |
map(x => x.drop(1).map(y => (y,x(0)) )). // get inverted pairs for all lines | |
toList. // ? required but i'm not sure why... | |
flatMap(x => x). // flatten to pairs -- you could filter on these | |
groupBy(_._1). // group by the first key | |
map(p => (p._1,p._2.map(_._2).toSet)) // map over groups values, turning 2nd value into sets | |
} |
Ah, but it is defined on Streams:
def invertedIndex(fn:String) = {
io.Source.fromFile(fn).getLines. // this is an iterator over lines
map(_.split("\t")). // split at tabs
filter(_.size > 0). // make sure there is at least one item
map(x => x.drop(1).map(y => (y,x(0)) )) // get inverted pairs for all lines
toStream. // convert to Stream
flatten. // flatten to pairs -- you could filter on these
groupBy(_._1). // group by the first key
map(p => (p._1,p._2.map(_._2).toSet)) // map over groups values, turning 2nd value into sets
}
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Yes -- thanks.
I think the mapValues is incorrect, since I need to return String -> Set[String] pairs, so:
I still don't think I understand why flatten isn't defined on Iterators.