https://github.com/nathanmarz/storm/
Seems like far more flexible, reliable solution than Celery and the ilk Define topologies that manage jobs, their assignment to workers, aggregation of return values. Highly parallel, reliable etc..
- Motivation behind storm
- Intro to Storm
- Trident demo (abstraction layer for Storm)
Queues and workers
- jobs go into queue, workers work them, put results into another queues, more workers push to db etc...
- in the specific case, was important to match workers to keys every time
- didn't scale (all workers must be re-deployed dues to hashing)
- Faul tolerance is pooer
- Coding is tedious (low level of abstraction, message serialization, where messages go, biz logic takes backseat)
- Guaranteed processing
- Horizontal scalability
- Fault-toleration
- Remove queuing brokers (no more intermediate message brokers)
- Higher level abstraction than message passing message passing is a lower level of abstraction
- "Just works" realtime system needs to be robust
- Storm has 'em all!
- 1M tuple per sec per node (100 byte tuples)
- 1.6M 10 byte tuples
Steam processing Distributed RPC Continuous computation
Nimbus
-
master node Zookeeper
-
Coordinate cluster Workers w/Supervisor
-
Starting topology storm jar jar name
-
Kill it storm kill
Unbounded sequence of
Produce streams
Execute tasks Filters, aggregations, etc
Topology is a collection streams, spouts and bolts
When a tuple is emittec where does it go? Stream grouping tells Storm how to partition the streams.
- Shuffle random, pick a random task
- Fields: mod hashing on a subset of tuple fields
- All grouping: goes to all tasks
- Global: task with lowest id
TopologyBuilder is used to construct topologies in Java can define spouts
- Grab data from somewhere can define bolt
- Split sentences from spout via shuffle grouping etc...
Splitsentence.py class SplitSentenceBold(storm.Basic)...
Submit topologies to a Storm cluster to be run You can simulate these with a local mode, which creates a cluster locally
Streaming word count
- Stream procesing (find source) Distributed RPC
- send function and args to server, stream function indications, topology processing stream and sends result back to drpc server, comes back to client
- compute "reach" of a URL on the fly -- find followers of followers of followers and count -- spout emits requests to get tweeters, emits get followers, distincts on the set, aggregate count
Uhh.. too fast, read examples online
What happens to tuples? "Tuple tree" bolts emit new tuples based on input creates a tree of tuples
- a spout is not exhausted until the whole tree is processed
- storm track the tuple trees, and lets the spout know when the tuple trees are complete
- tuple trees need a constant amount of space to track and removes need for message brokers
How do you count with an at leas one delivery guarantee? Transactional topologies
- Enables fault tolerant, exactly-once messaging semantics
- Processing small batches of tuples at one time
- If a part of batch fails, batch is retried
- bolts can be commiters
- commits are ordered
- you don't move on until the batch is commited -- only commite if transaction is different
- Multiple batches can still be processed in parallel -- lower latency -- higher throughput
High level abstraction for Storm State + stream processing + DRPC Exactly-once consistent semantics
Streaming word count DRPC queries to get sum of counts of words