Identify all reasons why (eg) Elasticsearch cannot provide acceptable performance for standard requests and Qualifying load. The "Qualifying load" for each performance bound is double (or so) the worst-case scenario for that bound across all our current clients.
- Performance
- bandwidth (rec/s, MB/s) and latency (s) for 100B, 1kB, 100kB records
- under read, write, read/write
- in degraded state: a) loss of one/two servers and recovering; b) elevated packet latency + drop rate between "regions"
- High concurrency
- keepalive
- bad input flood
- restart of service; reboot of machine; stop/start of machine
- Utilization, Saturation, Errors
- commonly observed errors and their meaning
- exemplars and mountweasels
- Five queries everyone should know
- their performance at baseline
- Field Cache usage vs number of records
- Write throughput in a) full-weight records; b) Cache map use case (lots of deletes on compaction)
- Version upgrade
- Recovery
- plugin for recovery strategy
- Shard assignment
- Separate Read/write/transport boxes
- probably only one or the other types of nodes are masters
- Cross-geo replication?
- Machine sizes: m1.x vs m3.x; ebs optimized vs not; for write nodes, c1.xl?
- Failover and backup
- CacheMap metrics, tuning
- In-stream database calls
- Can I "push"/"flush" DRPC calls?
- What happens when I fail a tuple?
- fail-forever / fail-retriably
- "failure" streams
- Tracing
- "tracing" stream
- Wukong shim
- failure/error handling
- tuple vs record
- serialization
- Batch size tradeoffs