Skip to content

Instantly share code, notes, and snippets.

@nikkaroraa
Created September 25, 2020 08:56
Show Gist options
  • Save nikkaroraa/4ec7cd711227559626bbc98b38866270 to your computer and use it in GitHub Desktop.
Save nikkaroraa/4ec7cd711227559626bbc98b38866270 to your computer and use it in GitHub Desktop.
Amazon Kinesis

AWS Kinesis

Overview

  • Kinesis is a managed alternative to Apache Kafka

  • Great for application logs, metrics, IoT, clickstreams

  • Great for "real-time" big data

  • Great for streaming processing frameworks (Spark, NiFi, etc)

  • Data is automatically replicated to 3 AZ

  • Kinesis Streams: low latency streaming ingest at scale

  • Kinesis Analytics: perform real-time analytics on streams using SQL

  • Kinesis Firehose: load streams into S3, Redshift, ElasticSearch

Kinesis Streams Overview

  • Streams are divided in ordered Shards / Partitions
  • Data retention is 1 day by default, can go up to 7 days
  • Ability to reprocess / replay data
  • Multiple applications can consume the same stream
  • Real-time processing will scale of throughput
  • Once data is inserted in Kinesis, it can't be deleted (immutability)

Kinesis Streams Shards

  • One stream is made of many different shards
  • 1MB/s or 1000 messages/s at write PER SHARD
  • 2MB/s at read PER SHARD
  • Billing is per shard provisioned, can have as many shards as you want
  • Batching available or per message calls
  • The number of shards can evolve over time (reshard/merge)
  • Records are ordered per shard

AWS Kinesis API - Put records

  • PutRecord API + Partition key that gets hashed
  • The same key goes to the same partition (helps with ordering for a specific key)
  • Messages sent get a "sequence number"
  • Choose a partition key that is highly distributed (helps prevent "hot partition")
    • user_id if many users
    • Not country_id if 90% of the users are in one country
  • Use Batching with PutRecords to reduce costs and increase throughput
  • ProvisionedThroughputExceeded if we go over the limites
  • Can use CLI, AWS SDK or producer libraries from various frameworks

AWS Kinesis API - Exceptions

  • ProvisionedThroughputExceeded exceptions

    • Happens when sending more data (exceeding MB/s or TPS for any shard)
    • Make sure you don't have a hot shard (such as your partition key is bad and too much data goes into that partition)
  • Solution:

    • Retries with backoff
    • Increase shards (scaling)
    • Ensure your partition key is a good one
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment