Skip to content

Instantly share code, notes, and snippets.

View amontalenti's full-sized avatar

Andrew Montalenti amontalenti

View GitHub Profile

Lucene Fundamentals

A useful set of Lucene fundamentals that are good for grok'ing Elasticsearch.

Jargon Glossary

  • document: a record; the unit of search; the thing returned as search results
  • field: a typed slot in a document for storing and indexing values
  • index: a collection of documents, typically with the same field mappings or schema
  • corpus: the entire set of documents in an index
Policies are often the result of something that once went wrong. It’s
organizational scar tissue developed from a This Can Never Happen Again mandate.
And its almost always ill-considered.
The problem with policies are that they compound and eventually add up to the
rigidity of bureaucracy that everyone says they despise. Policies are not free.
They demean the intellect of the executer (“I know this is stupid, but…”)
and obsolve the ability to deal with a situation in context (“I sympathize,
but…”).
>>> import re
>>> eml = re.compile(r"([^@|\s]+@[^@]+\.[^@|\s]+)")
>>> match = eml.search("some text that has a [email protected] email address")
>>> match.group(1)
"[email protected]"
ERROR [IndexSummaryManager:1] 2014-10-28 17:04:51,369 CassandraDaemon.java (line 153) Exception in thread Thread[IndexSummaryManager:1,1,main]
java.lang.AssertionError: null
at org.apache.cassandra.io.util.Memory.size(Memory.java:307) ~[apache-cassandra-2.1.1.jar:2.1.1]
at org.apache.cassandra.io.sstable.IndexSummary.getOffHeapSize(IndexSummary.java:192) ~[apache-cassandra-2.1.1.jar:2.1.1]
at org.apache.cassandra.io.sstable.SSTableReader.getIndexSummaryOffHeapSize(SSTableReader.java:1069) ~[apache-cassandra-2.1.1.jar:2.1.1]
at org.apache.cassandra.io.sstable.IndexSummaryManager.redistributeSummaries(IndexSummaryManager.java:294) ~[apache-cassandra-2.1.1.jar:2.1.1]
at org.apache.cassandra.io.sstable.IndexSummaryManager.redistributeSummaries(IndexSummaryManager.java:238) ~[apache-cassandra-2.1.1.jar:2.1.1]
at org.apache.cassandra.io.sstable.IndexSummaryManager$1.runMayThrow(IndexSummaryManager.java:139) ~[apache-cassandra-2.1.1.jar:2.1.1]
at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.
// GET casterisk-1hour-2014.10/_search?search_type=count
{
"query": {
"filtered": {
"filter": {
"bool": {
"must": [
{
"term": {
"apikey": "arstechnica.com"
ERROR:cassandra.connection:Error decoding response from Cassandra. opcode: 0008; message contents: '\x83\x00\x00\x0e\x08\x00\x00\x94\xfb\x00\x00\x00\x02\x00\x00\x00\x01\x00\x00\x00\x02\x00\x10casterisk_pixels\x00\x0bapikey_urls\x00\x06apikey\x00\r\x00\x04hour\x00\x0b\x00\x00\x05\x03\x00\x00\x00\x17architecturaldigest.com\x00\x00\x00\x08\x00\x05\x05\xde{\xb6\xa8\x00\x00\x00\x00\x16brighthubeducation.com\x00\x00\x00\x08\x00\x05\x05\xdfRJL\x00\x00\x00\x00\x0blatimes.com\x00\x00\x00\x08\x00\x05\x05\xde{\xb6\xa8\x00\x00\x00\x00\x0cbetabeat.com\x00\x00\x00\x08\x00\x05\x05\xdc\xce\x8f`\x00\x00\x00\x00\x16brighthubeducation.com\x00\x00\x00\x08\x00\x05\x05\xdc\xce\x8f`\x00\x00\x00\x00\x06al.com\x00\x00\x00\x08\x00\x05\x05\xde{\xb6\xa8\x00\x00\x00\x00\x16www.greentechmedia.com\x00\x00\x00\x08\x00\x05\x05\xe0\xffq\x94\x00\x00\x00\x00\x0fpixelmonkey.org\x00\x00\x00\x08\x00\x05\x05\xdd\xa5#\x04\x00\x00\x00\x00\x07inc.com\x00\x00\x00\x08\x00\x05\x05\xdc\xce\x8f`\x00\x00\x00\x00\x14technologyreview.com\x00\x00\x00\x08\x00\x
>>> today = pd.Timestamp("now").to_period("1D").to_timestamp()
>>> today
Timestamp('2014-10-16 00:00:00')
>>> today = today.tz_localize(pytz.timezone("US/Eastern"))
Timestamp('2014-10-16 00:00:00-0400', tz='US/Eastern')
>>> today = today.astimezone(pytz.timezone("UTC")).astimezone(pytz.timezone("US/Eastern"))
>>> today
Timestamp('2014-10-16 00:00:00-0400', tz='US/Eastern')
>>> isinstance(today, dt.datetime)
True
CREATE TABLE site_urls (
site text,
hour timestamp,
url text,
PRIMARY KEY ((site, hour), url));
-- it's a trap!
CREATE INDEX ON site_urls (hour);
-- XXX: all of this is a bad idea, but it was a nice idea at the time :)
CREATE TABLE IF NOT EXISTS apikey_changed_urls (
process_minute timestamp, -- current minute of processing
apikey text, -- apikey
url text, -- url where data changed
change_time timestamp, -- 5-min period where data changed
process_hour timestamp, -- current hour of processing
process_day timestamp, -- current day of processing
PRIMARY KEY (process_minute, apikey, url, change_time));
-- XXX: never use Cassandra as a queue is some advice I should heed more often
CREATE TABLE IF NOT EXISTS apikey_changed_urls (
process_time timestamp, -- current minute of processing
apikey text, -- apikey
url text, -- url where data changed
change_time timestamp, -- 5-min period where data changed
PRIMARY KEY (process_time, apikey, url, change_time));
# as data comes in, we do inserts like this: