maziyarpanahi’s gists

maziyarpanahi / enwiki-global-warming-LDA-results.txt

Last active October 22, 2017 13:48

The results of Spark LDA ran over English Wikipedia pages (different queries). The topics are sorted by coherence of each topic (Word2Vec).

	====================
	Stanford CoreNLP (Sentence splitter and POS Tagging - NN and NNS), StopWordsRemover, TF-IDF, word2vec and OnlineLDAOptimizer
	Query: Global Warming (5000 pages)

	==========Parameteres==========
	val numTopics: Int = 50
	val maxIterations: Int = 100
	val vocabSize: Int = 10000
	val minDF: Int = 10
	val minTF: Int = 1

maziyarpanahi / enwiki-gas-emissions-LDA-results.txt

Last active July 3, 2017 17:02

The results of Spark LDA ran over English Wikipedia pages (different queries). The topics are sorted by coherence of each topic (Word2Vec).

	Stanford CoreNLP (Sentence splitter and POS Tagging - extract noun phrases), StopWordsRemover, TF-IDF, word2vec and OnlineLDAOptimizer
	Query: Global Warming (5000 pages)

	==========Parameteres==========
	val numTopics: Int = 50
	val maxIterations: Int = 100
	val vocabSize: Int = 10000
	val minDF: Int = 1
	val minTF: Int = 1
	val maxItems: Int = 15

maziyarpanahi / pubmed-cancer-LDA-results.txt

Last active October 22, 2017 13:49

Results of LDA over PubMed dataset "Cancer" sub-corpora

	Stanford CoreNLP (Sentence splitter and POS Tagging - extract noun phrases), StopWordsRemover, TF-IDF, word2vec and OnlineLDAOptimizer

	==========
	Query: "cancer"
	Sample: 500K abstracts
	Dataset: PubMed
	==========
	val numTopics: Int = 50
	val maxIterations: Int = 100
	val vocabSize: Int = 10000

maziyarpanahi / top-500-enwiki.txt

Created October 22, 2017 14:26

Top 500 phrases in English Wikipedia

	Phrases were extracted by Stanford CoreNLP/Spark 2.2 (6minutes) from English Wikipeida (+5 million pages)

	+---------------------------+-----+ [441/9895]
	\|value \|count\|
	+---------------------------+-----+
	\|square miles \|59821\|
	\|unique feature \|46463\|
	\|id form \|46101\|
	\|administrative district \|45963\|
	\|first time \|41423\|

maziyarpanahi / tours.json

Created February 4, 2018 18:26

JSON array of demo Tours for MongoDB

	[
	{
	"tourBlurb" : "Big Sur is big country. The Big Sur Retreat takes you to the most majestic part of the Pacific Coast and show you the secret trails.",
	"tourName" : "Big Sur Retreat",
	"tourPackage" : "Backpack Cal",
	"tourBullets" : "\"Accommodations at the historic Big Sur River Inn, Privately guided hikes through any of the 5 surrounding national parks, Picnic lunches prepared by the River Inn kitchen, Complimentary country breakfast, Admission to the Henry Miller Library and the Point Reyes Lighthouse \"",
	"tourRegion" : "Central Coast",
	"tourDifficulty" : "Medium",
	"tourLength" : 3,
	"tourPrice" : 750,

maziyarpanahi / Spark-NLP-POS.scala

Last active May 22, 2018 12:37

	import com.johnsnowlabs.nlp.{DocumentAssembler, Finisher}
	import com.johnsnowlabs.nlp.annotators.{Normalizer, Stemmer, Tokenizer}
	import com.johnsnowlabs.nlp.annotator._
	import com.johnsnowlabs.nlp.base._
	import com.johnsnowlabs.util.Benchmark
	import org.apache.spark.ml.feature.NGram

	import org.apache.spark.ml.Pipeline
	import org.apache.spark.ml.feature.{StopWordsRemover, IDF, HashingTF, CountVectorizer, Word2Vec}

maziyarpanahi / gist:aee182aab3e320749fbc9a81031deab3

Created August 25, 2018 17:16

Wikipedia mapping error in ES 6.3.1

	{
	"error": {
	"root_cause": [
	{
	"type": "mapper_parsing_exception",
	"reason": "Root mapping definition has unsupported parameters: [namespace : {dynamic=false, properties={wiki={analyzer=keyword, type=text, index_options=docs}, name={analyzer=near_match_asciifolding, type=text, index_options=docs}}}] [archive : {dynamic=false, properties={wiki={analyzer=keyword, type=text, index_options=docs}, namespace={type=long}, title={search_analyzer=text_search, similarity=BM25, analyzer=text, position_increment_gap=10, type=text, fields={trigram={similarity=BM25, analyzer=trigram, type=text, index_options=docs}, prefix_asciifolding={search_analyzer=near_match_asciifolding, similarity=BM25, analyzer=prefix_asciifolding, type=text, index_options=docs}, plain={search_analyzer=plain_search, similarity=BM25, analyzer=plain, position_increment_gap=10, type=text}, prefix={search_analyzer=near_match, similarity=BM25, analyzer=prefix, type=text, index_options=docs}, keyword={s

maziyarpanahi / yarn-cluster-error.txt

Created February 5, 2019 10:10

	org.apache.spark.SparkException: Task not serializable
	at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:403)
	at org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:393)
	at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:162)
	at org.apache.spark.SparkContext.clean(SparkContext.scala:2338)
	at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1.apply(RDD.scala:850)
	at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1.apply(RDD.scala:849)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
	at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)

maziyarpanahi / zeppelin-pyspark-yarn-client.txt

Created February 6, 2019 21:41

	INFO [2019-02-06 22:23:16,364] ({main} RemoteInterpreterServer.java[<init>]:148) - Starting remote interpreter server on port 0, intpEventServerAddress: IP_ADDRESS:36131
	INFO [2019-02-06 22:23:16,384] ({main} RemoteInterpreterServer.java[<init>]:175) - Launching ThriftServer at IP_ADDRESS:46727
	INFO [2019-02-06 22:23:16,549] ({pool-6-thread-1} RemoteInterpreterServer.java[createInterpreter]:333) - Instantiate interpreter org.apache.zeppelin.spark.SparkInterpreter
	INFO [2019-02-06 22:23:16,553] ({pool-6-thread-1} RemoteInterpreterServer.java[createInterpreter]:333) - Instantiate interpreter org.apache.zeppelin.spark.SparkSqlInterpreter
	INFO [2019-02-06 22:23:16,556] ({pool-6-thread-1} RemoteInterpreterServer.java[createInterpreter]:333) - Instantiate interpreter org.apache.zeppelin.spark.DepInterpreter
	INFO [2019-02-06 22:23:16,560] ({pool-6-thread-1} RemoteInterpreterServer.java[createInterpreter]:333) - Instantiate interpreter org.apache.zeppelin.spark.PySparkInterpreter
	INFO [2019-02-06 22:23:16,563] ({pool

maziyarpanahi / zeppelin-pyspark-yarn.txt

Created February 18, 2019 10:35

	DEBUG [2019-02-18 11:27:25,397] ({YARN application state monitor} ProtobufRpcEngine.java[invoke]:249) - Call: getApplicationReport took 2ms
	DEBUG [2019-02-18 11:27:25,878] ({FIFOScheduler-Worker-1} InterpreterOutputStream.java[processLine]:81) - Interpreter output:import org.apache.spark.sql.functions._
	INFO [2019-02-18 11:27:25,931] ({pool-6-thread-2} RemoteInterpreterServer.java[getStatus]:818) - job:null
	DEBUG [2019-02-18 11:27:25,931] ({pool-6-thread-2} Interpreter.java[getProperty]:204) - key: zeppelin.spark.concurrentSQL, value: false
	INFO [2019-02-18 11:27:25,931] ({pool-6-thread-2} RemoteInterpreterServer.java[getStatus]:818) - job:null
	INFO [2019-02-18 11:27:25,931] ({pool-6-thread-2} RemoteInterpreterServer.java[getStatus]:818) - job:null
	INFO [2019-02-18 11:27:25,931] ({pool-6-thread-2} RemoteInterpreterServer.java[getStatus]:818) - job:org.apache.zeppelin.interpreter.remote.RemoteInterpreterServer$InterpretJob@f7c36f41
	INFO [2019-02-18 11:27:25,931] ({pool-6-thread-2} RemoteInterpreterServer.

	Phrases were extracted by Stanford CoreNLP/Spark 2.2 (6minutes) from English Wikipeida (+5 million pages)

	+---------------------------+-----+ [441/9895]
	\|value \|count\|
	+---------------------------+-----+
	\|square miles \|59821\|
	\|unique feature \|46463\|
	\|id form \|46101\|
	\|administrative district \|45963\|
	\|first time \|41423\|

Maziyar Panahi maziyarpanahi