Skip to content

Instantly share code, notes, and snippets.

import redis, tldextract,json
def get_spider_info_for_url(url):
h = redis.Redis()
extracted = tldextract.extract(url)
wholedomain = ".".join([extracted.domain, extracted.tld])
wholesubdomain = ".".join([extracted.subdomain, extracted.domain, extracted.tld])
# look for subdomain first
info = h.hget('spider_info',wholesubdomain)
@dfdeshom
dfdeshom / gist:6277295
Last active December 21, 2015 08:19
SolrCloud: The good and the bad

We're coming up to a year now since SolrCloud (Solr 4.0) has been released. The company I work for has recently switched to Solr 4.3 and the overall impression has been good, although there has been some growing pains. What follows are my impressions about what I've liked and not liked so far about SolrCloud

The Bad

You can still run Solr in "non-cloud" mode. This means that there are 2 code paths in the lucene-solr repo. It also means that support questions can get a little more complicated. There are a some issues that come up because of this separation:

  • Configuration is somewhat in flux. The solr.xml file is scheduled for a major change in Solr5 (http://wiki.apache.org/solr/Solr.xml%204.4%20and%20beyond) and might completely disappear. schema.xml and solrconfig.xml now live in Zookeeper.

  • There seems to be some confusion over the cores API and the collections API. The collections API is a nice superset of the cores API but some think they can be used interchangeably. Peo

Paralellism in ES and Hadoop/Spark

1 shard corresponds to 1 Spark partition.

Reading from ES: https://www.elastic.co/guide/en/elasticsearch/hadoop/current/arch.html#arch-reading . Beware of increasing the number of shards on ES for performance reasons:

A common concern (read optimization) for improving performance is to increase the number of shards and thus increase the number of tasks on the Hadoop side. Unless such gains are demonstrated through benchmarks, we recommend against such a measure since in most cases, an Elasticsearch shard can easily handle data streaming to a Hadoop or Spark task.

Writing from ES: https://www.elastic.co/guide/en/elasticsearch/hadoop/current/arch.html#arch-writing . Write performance can be increased by having more partitions:

elasticsearch-hadoop detects the number of (primary) shards where the write will occur and distributes the writes between these. The more splits/partitions available, the more mappers/reducers can write data in parallel to Elasticsear

#!/usr/bin/env python2.7
import json
from mrjob.job import MRJob
import random
class TestStdIn(MRJob):
@staticmethod