Skip to content

Instantly share code, notes, and snippets.

@Cowa
Last active October 20, 2015 18:05
Show Gist options
  • Select an option

  • Save Cowa/05aa369e266ec3ca94a4 to your computer and use it in GitHub Desktop.

Select an option

Save Cowa/05aa369e266ec3ca94a4 to your computer and use it in GitHub Desktop.
N-grams models in Scala - Step by step
// Tokenized text
val words = List("bonjour", "je", "suis", "ici", ".", "bonjour", "je", "suis", "là")
// 2-grams
val n = 2
val ngram = words.sliding(n).toList
// ngram: List[List[String]] = List(List(bonjour, je), List(je, suis), List(suis, ici), List(ici, .), List(., bonjour), List(bonjour, je), List(je, suis), List(suis, là))
val ngramWithCount = ngram.groupBy(identity).mapValues(_.size)
// ngramWithCount: Map[List[String],Int] = Map(List(bonjour, je) -> 2, List(je, suis) -> 2, List(suis, ici) -> 1, List(suis, là) -> 1, List(ici, .) -> 1, List(., bonjour) -> 1)
val ngramWithProbability = ngramWithCount.map { case (k, v) =>
(k, v.toDouble / ngramWithCount.filterKeys(_.take(n - 1) == k.take(n - 1)).values.sum)
}
// ngramWithProbability: Map(
// List(bonjour, je) -> 1.0,
// List(je, suis) -> 1.0,
// List(suis, ici) -> 0.5,
// List(suis, là) -> 0.5,
// List(ici, .) -> 1.0,
// List(., bonjour) -> 1.0
// )
// IMPORTANT: ngramWithCount.filterKeys(_.take(n - 1) == k.take(n - 1)).values.sum
// Is very very costy, so it takes forever on medium+ size documents
// To fix it, create an index for sums:
val sumIndex = ngramWithCount.groupBy { case (k, v) => k.take(n - 1) }.mapValues(_.values.sum)
// And voilà!
val ngramWithProbabilityFaster = ngramWithCount.map { case (k, v) => (k, v.toDouble / sumIndex(k.take(n - 1))) }
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment