Storm: Distributed and Fault-toleration realtime computing

https://github.com/nathanmarz/storm/

Seems like far more flexible, reliable solution than Celery and the ilk Define topologies that manage jobs, their assignment to workers, aggregation of return values. Highly parallel, reliable etc..

Motivation behind storm
Intro to Storm
Trident demo (abstraction layer for Storm)

Before Storm

Queues and workers

jobs go into queue, workers work them, put results into another queues, more workers push to db etc...
in the specific case, was important to match workers to keys every time
didn't scale (all workers must be re-deployed dues to hashing)
Faul tolerance is pooer
Coding is tedious (low level of abstraction, message serialization, where messages go, biz logic takes backseat)

What we want Instead

Guaranteed processing
Horizontal scalability
Fault-toleration
Remove queuing brokers (no more intermediate message brokers)
Higher level abstraction than message passing message passing is a lower level of abstraction
"Just works" realtime system needs to be robust
Storm has 'em all!

Performance

1M tuple per sec per node (100 byte tuples)
1.6M 10 byte tuples

Use cases

Steam processing Distributed RPC Continuous computation

Storm Cluster

Nimbus

master node Zookeeper
Coordinate cluster Workers w/Supervisor
Starting topology storm jar jar name
Kill it storm kill

Streams

Unbounded sequence of

Spouts

Produce streams

Bolts

Execute tasks Filters, aggregations, etc

Topology is a collection streams, spouts and bolts

Stream grouping

When a tuple is emittec where does it go? Stream grouping tells Storm how to partition the streams.

Shuffle random, pick a random task
Fields: mod hashing on a subset of tuple fields
All grouping: goes to all tasks
Global: task with lowest id

Streaming word count

TopologyBuilder is used to construct topologies in Java can define spouts

Grab data from somewhere can define bolt
Split sentences from spout via shuffle grouping etc...

Splitsentence.py class SplitSentenceBold(storm.Basic)...

Submit topologies to a Storm cluster to be run You can simulate these with a local mode, which creates a cluster locally

Demo

Streaming word count

Stream procesing (find source) Distributed RPC
send function and args to server, stream function indications, topology processing stream and sends result back to drpc server, comes back to client
compute "reach" of a URL on the fly -- find followers of followers of followers and count -- spout emits requests to get tweeters, emits get followers, distincts on the set, aggregate count

Uhh.. too fast, read examples online

Guaranteeding message processing

What happens to tuples? "Tuple tree" bolts emit new tuples based on input creates a tree of tuples

a spout is not exhausted until the whole tree is processed
storm track the tuple trees, and lets the spout know when the tuple trees are complete
tuple trees need a constant amount of space to track and removes need for message brokers

How do you count with an at leas one delivery guarantee? Transactional topologies

Enables fault tolerant, exactly-once messaging semantics
Processing small batches of tuples at one time
If a part of batch fails, batch is retried
bolts can be commiters
commits are ordered
you don't move on until the batch is commited -- only commite if transaction is different
Multiple batches can still be processed in parallel -- lower latency -- higher throughput

Trident

High level abstraction for Storm State + stream processing + DRPC Exactly-once consistent semantics

Example

Streaming word count DRPC queries to get sum of counts of words

gthomas/gist:3152172

Storm: Distributed and Fault-toleration realtime computing

Before Storm

What we want Instead

Performance

Use cases

Storm Cluster

Streams

Spouts

Bolts

Stream grouping

Streaming word count

Demo

Guaranteeding message processing

Trident

Example