- An hourly rotated log system
- Issues:
- EOF generation and propagation
- Manual intervention required to continue after errors
https://labs.spotify.com/2016/02/25/spotifys-event-delivery-the-road-to-the-cloud-part-i/
- Key requirement: To deliver complete data with a predictable latency and make it available to our developers via well-defined interface.
- Event (structured data) as unit of streamed information.
syslog
as the events source- Completeness vs Latency Dataflow
- Hourly buckets
- Avro format in destination (Hadoop)
- Checkout Apache Crunch
- Issues:
- Highly coupled system
- Transmitting the data across DCs weakens the system
- ...
https://labs.spotify.com/2016/03/03/spotifys-event-delivery-the-road-to-the-cloud-part-ii/ & https://labs.spotify.com/2016/03/10/spotifys-event-delivery-the-road-to-the-cloud-part-iii/ & Reliable export of Pub/Sub streams to CloudStorage
- Introduce a queue for:
- Low latency
- Reliability
- Persistence of undelivered messages, even in the face of errors on other systems (such as Hadoop)
- Add structure to data early on
- One topic per event type
- From Kafka to CG Pub/Sub.
- To avoid Kafka's instabilities
- Load testing
- GC clients are auto generated, not necessarily efficient. Luckily exposed API is clear enough to write your own one.
- Batch and compress data.
- From Hadoop + Hive to GCS + BigQuery
- Reduce operations (have others solve hard problems for us)
- Dataflow:
- Offers batch + streaming. Crunch only batch
- Windowing to partition streams per time.
- GroupByKey works in memory
- Watermark feature to close a window
- Avro
- One topic per data type
- Only ACK when the data is in the final destination (final GCS bucket).
- Bucket the data
- Each consumer is auto scaled based on CPU usage, to optimize resource utilisation.
- Data may come delayed. When should we close a bucket then?
- Data, once saved is immutable (no backfilling should occur if late events)
- Reduce operations:
- Recover automatically from errors if possible. Otherwise you can't scale