Anivia is Walmart's mobile analytics platform. It collects user-interaction metrics from mobile devices -- iPhone, iPad, Android, and mWeb. It also processes logging and other metrics from a bunch of mobile services. Anivia allows the business to have real-time insight and reporting into what is going on in the mobile business and provides vital capabilities for developers and ops folks to monitor the health of their services.
Anivia is built on Node.js, Hapi, RabbitMQ, and a multitude of downstream systems including Splunk and Omniture. Anivia is taking in 7,000 events per second on average (as of this writing), which after some fan-out and demuxing comes out to around 20,000 messages per second in flight. These rates are expected to soar leading up to and including Black Friday. The platform has grown in recent months to over 1,000 node processes spanning multiple data centers, gaining features such as link resiliency in the process.
- Timestamp Correction for misconfigured client devices
- Demuxing to allow clients to send batched payloads
- Tranformation/Mutation/Decoration of events to make them digestible and useful to downstream systems
- Forwarding allows data to be fanned out to any number of downstream systems that can use it
With a few exceptions, Anivia is data agnostic and does not perform data aggreation. It relys on downstream systems (such as Splunk, Omniture, and many more) to crunch the numbers.
- Elmer - Hapi web server responsible for collecting events as http requests
- RabbitMQ - The message bus and safe zone
- Prospector - Responsible for consuming events, trabsforming, and sending them downstream
- Splunk - The system of record for all captured events
- Anivia primarily receives analytic messages via Elmer. Elmer performs absolutely no processing on the messages. It's sole purpose is to get the message into RabbitMQ as quickly as possible, where the message will remain safe until delivered to it's final destination(s).
- Once in RabbitMQ, the message waits (typically less than a few milliseconds) to be picked up by Prospector. Prospector will inspect the message, and break it apart into multiple messages, each containing a single event destined for a single downstream system (many events are destined for multiple downstream systems, so these will be duplicated). These messages are re-queued into RabbitMQ, where they wait to be picked up again.
- Prospector will pick up the re-queued, demuxed messages and deliver them downstream, to their final destination.
By de-muxing and re-queueing events individually, we protect ourselves against backpressure from downstream systems.