-
Kinesis is a managed alternative to Apache Kafka
-
Great for application logs, metrics, IoT, clickstreams
-
Great for "real-time" big data
-
Great for streaming processing frameworks (Spark, NiFi, etc)
-
Data is automatically replicated to 3 AZ
-
Kinesis Streams: low latency streaming ingest at scale
-
Kinesis Analytics: perform real-time analytics on streams using SQL
-
Kinesis Firehose: load streams into S3, Redshift, ElasticSearch
- Streams are divided in ordered Shards / Partitions
- Data retention is 1 day by default, can go up to 7 days
- Ability to reprocess / replay data
- Multiple applications can consume the same stream
- Real-time processing will scale of throughput
- Once data is inserted in Kinesis, it can't be deleted (immutability)
- One stream is made of many different shards
- 1MB/s or 1000 messages/s at write PER SHARD
- 2MB/s at read PER SHARD
- Billing is per shard provisioned, can have as many shards as you want
- Batching available or per message calls
- The number of shards can evolve over time (reshard/merge)
- Records are ordered per shard
- PutRecord API + Partition key that gets hashed
- The same key goes to the same partition (helps with ordering for a specific key)
- Messages sent get a "sequence number"
- Choose a partition key that is highly distributed (helps prevent "hot partition")
user_idif many users- Not
country_idif 90% of the users are in one country
- Use Batching with PutRecords to reduce costs and increase throughput
ProvisionedThroughputExceededif we go over the limites- Can use CLI, AWS SDK or producer libraries from various frameworks
-
ProvisionedThroughputExceededexceptions- Happens when sending more data (exceeding MB/s or TPS for any shard)
- Make sure you don't have a hot shard (such as your partition key is bad and too much data goes into that partition)
-
Solution:
- Retries with backoff
- Increase shards (scaling)
- Ensure your partition key is a good one