This document is written based on below:
Apache Kafka Series - Learn Apache Kafka for Beginners v3 https://learning.oreilly.com/course/apache-kafka-series/9781789342604/ by Stéphane Maarek
- able to adjust protocols (convert TCP, HTTP, FTP etc) into what the subscriber wants
- able to adjust data formats (binary, csv, json, avro, protobuf)
- able to adjust data scheme (adding new field/remove new field)
- publisher doesn't need to know the subscriber (where should we sent etc)
- created by LinkedIn, OSS
- distributed, resilient architecture, fault torelant
- horizonal scalability
- able to scale 100s of brokers
- scale millinons of messages per sec
- 10ms latency
- messaging
- activity tracking
- gatgher metrics
- application logs gathering
- stream processing
- decoupling of system dependencies
- integration with Spark, Flink, Hadoop etc
- micro-service pub/sub
Netflix: apply recommmendation in real-time while you're watching TV show
Topic
- Particular stream of data
- like a table in a database
- you can have many topics as you want
- tiouc us identifies by its name
- any kind of message format
- json,avro,protobuf etc
- the sequence of messages is called data stream
Topics are split in partitions

Partitions are created for the topic, each message holds id that called offset
. each messages are immutable, once it's written, cannot be changed.
Important notes about topic:
- once the data is written, it cannot be changes, so called immutability.
- retention period of msg: 1week by default
- order is guaranteed only within a partition (not across partitions)
- data is assigned randomly to a partition unless key is provided
- we can have as many partitions per topics as we cant
Producers
- who write data to topics
- producers know to which partition to write to
- in case of kafka broker failures, producers will automatically recover
Message keys
- producer can choose to send a key with the message (str, int, binary etc)
- key==null, data is setnt round robin to the partitions
- key != null, all messages for that key will always go to the same partition (hashing)

kafka messages anatomy

Hashing method:
targetPartitions = Math.abs(Utils.murmur2(keyBytes)) % (numPartitions - 1)
- consumers read data from a topic - pull model
- consumers automatically know which broker to read from
- in case of broker failuers, consumers know how to recover
- data is read in order from low to high offset within each partitions
for consumer, we can apply one of three semantics:
- at least once (in app side, need to impl logic as idempotent)
- at most once (message might be lost)
- exactly once
Kafka cluster is composed of multiple brokers (=~ servers). each broker is identified with its ID. each broker contain topic partitions. After connecting to any broker (called a bootstrap broker), you will be connected to the entire cluster.
Exampe) Topic A: 3 partitions, TopicB: 2 partitions



Able to choose from three ways, these are the tradeoff of performance vs durability:
- no wait: data might loss in case of master cannot receive data / replica cannot receive data when master failure
- wait for master ack: data might loss in case of replica cannot receive data when the master failure
- wait for all acks (master + replica): data won't loss, but performance might be down