Speaker: Danica Fine
YouTube Link: Youtube
- Apache Kafka is a distributed event streaming platform that enables real-time data applications. It supports:
- Reactive, accurate, loosely coupled, and resilient systems.
- Various data streaming patterns such as:
- Publish/Subscribe
- Queuing
- Broadcasting
- Batch processing
- Events in Kafka are records of something that has happened, defined by:
- A timestamp and a description of what occurred.
- Examples: Adding an item to an online cart or tracking the location of a ship.
- Immutability: Events are immutable by nature and can’t be changed after they’ve occurred.
- Kafka topics are logs, not queues.
- Events persist even after being read by consumers.
- Events in topics are ordered, immutable, and assigned a monotonically increasing offset.
- Durability: Topics are append-only logs that can store data indefinitely, with configurable cleanup policies based on time or size.
- Partitions: Topics are divided into partitions, which are durable logs as well.
- Partitions allow for scalability, but require careful configuration to balance performance.
- Brokers are the nodes in a Kafka cluster.
- Partitions are distributed across brokers to ensure even load.
- Replication ensures that partitions have multiple copies across different brokers for fault tolerance.
- Producer: When writing to Kafka, the required fields are:
- Topic: Defines where the event is written.
- Value: The event or message itself.
- Kafka supports producers in multiple programming languages.
- Serialization: Kafka only processes data in byte format, so data needs to be serialized (e.g., using Avro, JSON, or Protobuf).
- Partitioning: Producers either assign a partition for the data or rely on a default partitioning strategy.
- Consumers can start reading from the:
- Earliest event,
- Most recent event,
- Or from a specific offset or timestamp.
- Offset tracking: Consumers commit their processed offset back to Kafka to ensure they can resume where they left off in case of failure.
- Consumer groups: Multiple consumers can be grouped to process data in parallel.
-
Schema Registry: Manages and maintains schemas across topics, supporting schema evolution and compatibility.
-
Kafka Connect:
- A framework that connects Kafka with other data systems (sources and sinks).
- Offers low-code/no-code options for easy integration.
- Can be used to move data from Kafka to other systems or into Kafka.
-
Kafka Streams:
- A Java/Scala library for stream processing with built-in support for stateful processing.
- ksqlDB: A SQL-based interface for stream processing built on Kafka Streams, making stream processing more accessible.
-
Other Frameworks: Kafka can integrate with tools like Apache Spark and Apache Flink for stream processing that may offer advantages in certain use cases (e.g., different languages or features).
- Financial Services: Used for fraud detection and real-time financial processing.
- IoT and Manufacturing: Real-time event processing for tracking devices and systems.
- Inventory Management: Kafka can manage real-time inventory data for supply chain optimization.
Speaker: https://linktr.ee/thedanicafine