didier deshommes dfdeshom

We're coming up to a year now since SolrCloud (Solr 4.0) has been released. The company I work for has recently switched to Solr 4.3 and the overall impression has been good, although there has been some growing pains. What follows are my impressions about what I've liked and not liked so far about SolrCloud

The Bad

You can still run Solr in "non-cloud" mode. This means that there are 2 code paths in the lucene-solr repo. It also means that support questions can get a little more complicated. There are a some issues that come up because of this separation:

Configuration is somewhat in flux. The solr.xml file is scheduled for a major change in Solr5 (http://wiki.apache.org/solr/Solr.xml%204.4%20and%20beyond) and might completely disappear. schema.xml and solrconfig.xml now live in Zookeeper.
There seems to be some confusion over the cores API and the collections API. The collections API is a nice superset of the cores API but some think they can be used interchangeably. Peo

Paralellism in ES and Hadoop/Spark

1 shard corresponds to 1 Spark partition.

Reading from ES: https://www.elastic.co/guide/en/elasticsearch/hadoop/current/arch.html#arch-reading . Beware of increasing the number of shards on ES for performance reasons:

A common concern (read optimization) for improving performance is to increase the number of shards and thus increase the number of tasks on the Hadoop side. Unless such gains are demonstrated through benchmarks, we recommend against such a measure since in most cases, an Elasticsearch shard can easily handle data streaming to a Hadoop or Spark task.

Writing from ES: https://www.elastic.co/guide/en/elasticsearch/hadoop/current/arch.html#arch-writing . Write performance can be increased by having more partitions:

elasticsearch-hadoop detects the number of (primary) shards where the write will occur and distributes the writes between these. The more splits/partitions available, the more mappers/reducers can write data in parallel to Elasticsear

	import redis, tldextract,json

	def get_spider_info_for_url(url):
	h = redis.Redis()
	extracted = tldextract.extract(url)
	wholedomain = ".".join([extracted.domain, extracted.tld])
	wholesubdomain = ".".join([extracted.subdomain, extracted.domain, extracted.tld])

	# look for subdomain first
	info = h.hget('spider_info',wholesubdomain)

	#!/usr/bin/env python2.7

	import json
	from mrjob.job import MRJob
	import random


	class TestStdIn(MRJob):

	@staticmethod