- deploy with deb packages
- no daemon (rely on an external process manager)
- log into syslog
- take a config file in argument
- focus on one stream only
- consume the stream in realtime (given rate limiting)
- aggregate in memory
- dump in a normalized s3 object
- store checkpoint on disk
- notify of the new batch by SNS (with ample details, stream, location, hash)
- push metrics to cloudwatch / statsd
boto connection pool for
- kinesis
- s3
- cloudwatch
- sns
python-statsd
connect to kinesis
loop
get stream description
wait for the stream to exists or be active
start / stop shard activities according to shards
on SIGTERM
- signal seppuku to shard activity
- join
- exit
on SIGHUP or periodically
- signal seppuku to shard activity
- join
- goto loop
connect to kinesis
read checkpoint from disk
get shard iterator from checkpoint or TRIM_HORIZON
loop
get_records
accumulate in memory
if time_limit or record_limit or seppuku
- serialize batch
- upload batch to s3
- store batch to local disk
- store checkpoint
- send notification by SNS