Brick-by-Brick: Exploring the Elements of Apache Kafka®

Speaker: Danica Fine
YouTube Link: Youtube

Apache Kafka is a distributed event streaming platform that enables real-time data applications. It supports:
- Reactive, accurate, loosely coupled, and resilient systems.
- Various data streaming patterns such as:
  - Publish/Subscribe
  - Queuing
  - Broadcasting
  - Batch processing

Events in Kafka are records of something that has happened, defined by:
- A timestamp and a description of what occurred.
- Examples: Adding an item to an online cart or tracking the location of a ship.
Immutability: Events are immutable by nature and can’t be changed after they’ve occurred.

Kafka topics are logs, not queues.
- Events persist even after being read by consumers.
- Events in topics are ordered, immutable, and assigned a monotonically increasing offset.
Durability: Topics are append-only logs that can store data indefinitely, with configurable cleanup policies based on time or size.
Partitions: Topics are divided into partitions, which are durable logs as well.
- Partitions allow for scalability, but require careful configuration to balance performance.

Brokers are the nodes in a Kafka cluster.
- Partitions are distributed across brokers to ensure even load.
- Replication ensures that partitions have multiple copies across different brokers for fault tolerance.

Producer: When writing to Kafka, the required fields are:
- Topic: Defines where the event is written.
- Value: The event or message itself.
Kafka supports producers in multiple programming languages.
Serialization: Kafka only processes data in byte format, so data needs to be serialized (e.g., using Avro, JSON, or Protobuf).
Partitioning: Producers either assign a partition for the data or rely on a default partitioning strategy.

Consumers can start reading from the:
- Earliest event,
- Most recent event,
- Or from a specific offset or timestamp.
Offset tracking: Consumers commit their processed offset back to Kafka to ensure they can resume where they left off in case of failure.
Consumer groups: Multiple consumers can be grouped to process data in parallel.

Schema Registry: Manages and maintains schemas across topics, supporting schema evolution and compatibility.
Kafka Connect:
- A framework that connects Kafka with other data systems (sources and sinks).
- Offers low-code/no-code options for easy integration.
- Can be used to move data from Kafka to other systems or into Kafka.
Kafka Streams:
- A Java/Scala library for stream processing with built-in support for stateful processing.
- ksqlDB: A SQL-based interface for stream processing built on Kafka Streams, making stream processing more accessible.
Other Frameworks: Kafka can integrate with tools like Apache Spark and Apache Flink for stream processing that may offer advantages in certain use cases (e.g., different languages or features).

Financial Services: Used for fraud detection and real-time financial processing.
IoT and Manufacturing: Real-time event processing for tracking devices and systems.
Inventory Management: Kafka can manage real-time inventory data for supply chain optimization.

Speaker: https://linktr.ee/thedanicafine

skryvets/2024-jconf-kafka.md