nikkaroraa/kinesis.md

Created September 25, 2020 08:56

Star (0) You must be signed in to star a gist
Fork (0) You must be signed in to fork a gist

Learn more about clone URLs
Clone this repository at <script src="https://gist.github.com/nikkaroraa/4ec7cd711227559626bbc98b38866270.js"></script>
Save nikkaroraa/4ec7cd711227559626bbc98b38866270 to your computer and use it in GitHub Desktop.

Download ZIP

Amazon Kinesis

Raw

AWS Kinesis

Overview

Kinesis is a managed alternative to Apache Kafka
Great for application logs, metrics, IoT, clickstreams
Great for "real-time" big data
Great for streaming processing frameworks (Spark, NiFi, etc)
Data is automatically replicated to 3 AZ
Kinesis Streams: low latency streaming ingest at scale
Kinesis Analytics: perform real-time analytics on streams using SQL
Kinesis Firehose: load streams into S3, Redshift, ElasticSearch

Kinesis Streams Overview

Streams are divided in ordered Shards / Partitions
Data retention is 1 day by default, can go up to 7 days
Ability to reprocess / replay data
Multiple applications can consume the same stream
Real-time processing will scale of throughput
Once data is inserted in Kinesis, it can't be deleted (immutability)

Kinesis Streams Shards

One stream is made of many different shards
1MB/s or 1000 messages/s at write PER SHARD
2MB/s at read PER SHARD
Billing is per shard provisioned, can have as many shards as you want
Batching available or per message calls
The number of shards can evolve over time (reshard/merge)
Records are ordered per shard

AWS Kinesis API - Put records

PutRecord API + Partition key that gets hashed
The same key goes to the same partition (helps with ordering for a specific key)
Messages sent get a "sequence number"
Choose a partition key that is highly distributed (helps prevent "hot partition")
- user_id if many users
- Not country_id if 90% of the users are in one country
Use Batching with PutRecords to reduce costs and increase throughput
ProvisionedThroughputExceeded if we go over the limites
Can use CLI, AWS SDK or producer libraries from various frameworks

AWS Kinesis API - Exceptions

ProvisionedThroughputExceeded exceptions
- Happens when sending more data (exceeding MB/s or TPS for any shard)
- Make sure you don't have a hot shard (such as your partition key is bad and too much data goes into that partition)
Solution:
- Retries with backoff
- Increase shards (scaling)
- Ensure your partition key is a good one

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment