Created
December 23, 2011 16:54
-
-
Save willf/1514745 to your computer and use it in GitHub Desktop.
Create an inverted index from a file
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
/** | |
* From a file that contains | |
* doc_id w1 w2 w3 ... lines, separated by tabs | |
* return an inverted index Map of w -> Set(doc_id) | |
* | |
* @param filename well isn't it obvious | |
* @return Map[String,Set[String]] | |
*/ | |
import scala.collection.immutable.Map | |
def invertedIndex(filename:String) = { | |
io.Source.fromFile(filename).getLines. // this is an iterator over lines | |
map(_.split("\t")). // split at tabs | |
filter(_.size > 0). // make sure there is at least one item | |
map(x => x.drop(1).map(y => (y,x(0)) )). // get inverted pairs for all lines | |
toList. // ? required but i'm not sure why... | |
flatMap(x => x). // flatten to pairs -- you could filter on these | |
groupBy(_._1). // group by the first key | |
map(p => (p._1,p._2.map(_._2).toSet)) // map over groups values, turning 2nd value into sets | |
} |
Yes -- thanks.
I think the mapValues is incorrect, since I need to return String -> Set[String] pairs, so:
def invertedIndex(filename:String) = {
io.Source.fromFile(filename).getLines. // this is an iterator over lines
map(_.split("\t")). // split at tabs
filter(_.size > 0). // make sure there is at least one item
map(x => x.drop(1).map(y => (y,x(0)) )). // get inverted pairs for all lines
toList. // convert to list (sigh)
flatten. // flatten to pairs -- you could filter on these
groupBy(_._1). // group by the first key
map(p => (p._1,p._2.map(_._2).toSet)) // map over groups values, turning 2nd value into sets
}
I still don't think I understand why flatten isn't defined on Iterators.
Ah, but it is defined on Streams:
def invertedIndex(fn:String) = {
io.Source.fromFile(fn).getLines. // this is an iterator over lines
map(_.split("\t")). // split at tabs
filter(_.size > 0). // make sure there is at least one item
map(x => x.drop(1).map(y => (y,x(0)) )) // get inverted pairs for all lines
toStream. // convert to Stream
flatten. // flatten to pairs -- you could filter on these
groupBy(_._1). // group by the first key
map(p => (p._1,p._2.map(_._2).toSet)) // map over groups values, turning 2nd value into sets
}
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Well, the issue at hand is that you get an Iterator[String] from io.Source.fromFile(foo).getLines and groupBy isn't defined on Iterators, only on Iterables. As far as the flatten, that is unavoidable, I think. You start with a list of strings and then break each of those lists in to their own list of strings, so you're going to have that nested list no matter what you do, so you'll need to flatten it somehow. The only reason the for-comprehension solution doesn't have one is because it has a flatMap hidden inside of it.