Skip to content

Instantly share code, notes, and snippets.

@gthomas
Created July 20, 2012 17:44
Show Gist options
  • Save gthomas/3152172 to your computer and use it in GitHub Desktop.
Save gthomas/3152172 to your computer and use it in GitHub Desktop.
Storm

Storm: Distributed and Fault-toleration realtime computing

https://github.com/nathanmarz/storm/

Seems like far more flexible, reliable solution than Celery and the ilk Define topologies that manage jobs, their assignment to workers, aggregation of return values. Highly parallel, reliable etc..

  • Motivation behind storm
  • Intro to Storm
  • Trident demo (abstraction layer for Storm)

Before Storm

Queues and workers

  • jobs go into queue, workers work them, put results into another queues, more workers push to db etc...
  • in the specific case, was important to match workers to keys every time
  • didn't scale (all workers must be re-deployed dues to hashing)
  • Faul tolerance is pooer
  • Coding is tedious (low level of abstraction, message serialization, where messages go, biz logic takes backseat)

What we want Instead

  • Guaranteed processing
  • Horizontal scalability
  • Fault-toleration
  • Remove queuing brokers (no more intermediate message brokers)
  • Higher level abstraction than message passing message passing is a lower level of abstraction
  • "Just works" realtime system needs to be robust
  • Storm has 'em all!

Performance

  • 1M tuple per sec per node (100 byte tuples)
  • 1.6M 10 byte tuples

Use cases

Steam processing Distributed RPC Continuous computation

Storm Cluster

Nimbus

  • master node Zookeeper

  • Coordinate cluster Workers w/Supervisor

  • Starting topology storm jar jar name

  • Kill it storm kill

Streams

Unbounded sequence of

Spouts

Produce streams

Bolts

Execute tasks Filters, aggregations, etc

Topology is a collection streams, spouts and bolts

Stream grouping

When a tuple is emittec where does it go? Stream grouping tells Storm how to partition the streams.

  • Shuffle random, pick a random task
  • Fields: mod hashing on a subset of tuple fields
  • All grouping: goes to all tasks
  • Global: task with lowest id

Streaming word count

TopologyBuilder is used to construct topologies in Java can define spouts

  • Grab data from somewhere can define bolt
  • Split sentences from spout via shuffle grouping etc...

Splitsentence.py class SplitSentenceBold(storm.Basic)...

Submit topologies to a Storm cluster to be run You can simulate these with a local mode, which creates a cluster locally

Demo

Streaming word count

  • Stream procesing (find source) Distributed RPC
  • send function and args to server, stream function indications, topology processing stream and sends result back to drpc server, comes back to client
  • compute "reach" of a URL on the fly -- find followers of followers of followers and count -- spout emits requests to get tweeters, emits get followers, distincts on the set, aggregate count

Uhh.. too fast, read examples online

Guaranteeding message processing

What happens to tuples? "Tuple tree" bolts emit new tuples based on input creates a tree of tuples

  • a spout is not exhausted until the whole tree is processed
  • storm track the tuple trees, and lets the spout know when the tuple trees are complete
  • tuple trees need a constant amount of space to track and removes need for message brokers

How do you count with an at leas one delivery guarantee? Transactional topologies

  • Enables fault tolerant, exactly-once messaging semantics
  • Processing small batches of tuples at one time
  • If a part of batch fails, batch is retried
  • bolts can be commiters
  • commits are ordered
  • you don't move on until the batch is commited -- only commite if transaction is different
  • Multiple batches can still be processed in parallel -- lower latency -- higher throughput

Trident

High level abstraction for Storm State + stream processing + DRPC Exactly-once consistent semantics

Example

Streaming word count DRPC queries to get sum of counts of words

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment