Created
December 23, 2011 16:54
-
-
Save willf/1514745 to your computer and use it in GitHub Desktop.
Create an inverted index from a file
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
/** | |
* From a file that contains | |
* doc_id w1 w2 w3 ... lines, separated by tabs | |
* return an inverted index Map of w -> Set(doc_id) | |
* | |
* @param filename well isn't it obvious | |
* @return Map[String,Set[String]] | |
*/ | |
import scala.collection.immutable.Map | |
def invertedIndex(filename:String) = { | |
io.Source.fromFile(filename).getLines. // this is an iterator over lines | |
map(_.split("\t")). // split at tabs | |
filter(_.size > 0). // make sure there is at least one item | |
map(x => x.drop(1).map(y => (y,x(0)) )). // get inverted pairs for all lines | |
toList. // ? required but i'm not sure why... | |
flatMap(x => x). // flatten to pairs -- you could filter on these | |
groupBy(_._1). // group by the first key | |
map(p => (p._1,p._2.map(_._2).toSet)) // map over groups values, turning 2nd value into sets | |
} |
Author
willf
commented
Dec 23, 2011
via email
Thanks so much -- I'm wondering if there is a way to remove the
'toList.flatten' section, since this is essentially a flatMap call?
And the toList serializes out the iterable, right?
Will
On Fri, Dec 23, 2011 at 12:36 PM, Turing Eret ***@***.*** wrote:
Looks pretty good. Only a few minor things:
1) flatMap(x => x) has it's own method called flatten
2) map(p => (p._1,p._2.map(_._2).toSet)) can be simplified a bit using mapValues and is made a bit more understandable.
So, it would look like this:
def invertedIndex(filename:String) = {
io.Source.fromFile(filename).getLines. // this is an iterator over lines
map(_.split(" ")). // split at tabs
filter(_.size > 0). // make sure there is at least one item
map(x => x.drop(1).map(y => (y,x(0)) )). // get inverted pairs for all lines
toList.
flatten. // flatten to pairs -- you could filter on these
groupBy(_._1). // group by the first key
mapValues(_.map(_._2).toSet) // map over groups values, turning 2nd value into sets
}
Alternatively, you could do it with a for comprehension:
def invertedIndex(filename:String) = {
(for {
line <- io.Source.fromFile(filename).getLines
(doc_id :: ws) = line.split("\t").toList if ws.size > 0
w <- ws
} yield {
w -> doc_id
}).toList.groupBy(_._1).mapValues(_.map(_._2).toSet)
}
I like that method a bit better, but the choice is yours!
---
Reply to this email directly or view it on GitHub:
https://gist.github.com/1514745
##
Will
Well, the issue at hand is that you get an Iterator[String] from io.Source.fromFile(foo).getLines and groupBy isn't defined on Iterators, only on Iterables. As far as the flatten, that is unavoidable, I think. You start with a list of strings and then break each of those lists in to their own list of strings, so you're going to have that nested list no matter what you do, so you'll need to flatten it somehow. The only reason the for-comprehension solution doesn't have one is because it has a flatMap hidden inside of it.
Yes -- thanks.
I think the mapValues is incorrect, since I need to return String -> Set[String] pairs, so:
def invertedIndex(filename:String) = {
io.Source.fromFile(filename).getLines. // this is an iterator over lines
map(_.split("\t")). // split at tabs
filter(_.size > 0). // make sure there is at least one item
map(x => x.drop(1).map(y => (y,x(0)) )). // get inverted pairs for all lines
toList. // convert to list (sigh)
flatten. // flatten to pairs -- you could filter on these
groupBy(_._1). // group by the first key
map(p => (p._1,p._2.map(_._2).toSet)) // map over groups values, turning 2nd value into sets
}
I still don't think I understand why flatten isn't defined on Iterators.
Ah, but it is defined on Streams:
def invertedIndex(fn:String) = {
io.Source.fromFile(fn).getLines. // this is an iterator over lines
map(_.split("\t")). // split at tabs
filter(_.size > 0). // make sure there is at least one item
map(x => x.drop(1).map(y => (y,x(0)) )) // get inverted pairs for all lines
toStream. // convert to Stream
flatten. // flatten to pairs -- you could filter on these
groupBy(_._1). // group by the first key
map(p => (p._1,p._2.map(_._2).toSet)) // map over groups values, turning 2nd value into sets
}
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment