Skip to content

Instantly share code, notes, and snippets.

@zouzias
Last active October 5, 2017 14:51
Show Gist options
  • Save zouzias/f4accfcd1c1a7ab6724b6e0bdeffe7e0 to your computer and use it in GitHub Desktop.
Save zouzias/f4accfcd1c1a7ab6724b6e0bdeffe7e0 to your computer and use it in GitHub Desktop.
LuceneRDD: Example using Quora Question Pairs Dataset
import org.zouzias.spark.lucenerdd.LuceneRDD
import org.zouzias.spark.lucenerdd._
val df = spark.read.parquet("spark-lucenerdd/quora_duplicate_questions.parquet")
val linker = {r: Row => { val tokens = r.getString(r.fieldIndex("question1")).split(" ").map(_.replaceAll("[^a-zA-Z0-9]", "")).filter(_.length > 3).mkString(" AND ")
if (tokens.nonEmpty) s"question1:(${tokens})" else "*:*"}}
val linkage = rdd.linkDataFrame(df, linker,1)
linkage.take(10).map(x => (x._1.getString(3), x._2.headOption)).foreach(println)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment