Flink, Spark, Storm, Kafka: Comparison Table of Features

Aspect	Flink	Spark	Storm	Kafka
Type	Hybrid (batch and stream)	Hybrid (batch and stream)	Stream-only	Stream-only
Support for 3rd party systems	Multiple source and sink	Yes (Kafka, HDFS, Cassandra, etc.)	Yes (Kafka, HDFS, Cassandra, etc.)	Tightly coupled with Kafka (Kafka Connect)
Distributed	Full (cluster deployment, HA, Fault Tolerant)	Yes	Yes	Partial (needs external support)
Stateful	Yes (RocksDB)	Yes (with checkpointing)	Yes (with Trident)	Yes (with Kafka Streams, RocksDB)
Table API	Yes	Yes (with Spark SQL)	No	Yes
Supports handling late arrival	Yes	Yes (with Spark SQL)	No	Yes
Learning curve	Moderate	Easy	Hard	Easy (for Kafka Streams) Hard (for Kafka Connect)
Complex event processing	Yes (native support)	Yes (with Spark Structured Streaming)	No	No (developer needs to handle)
Streaming window	Tumbling, Sliding, Session, Count	Time-based and count-based	Time-based and count-based	Tumbling, Hopping/Sliding, Session
Data Processing	Batch/Stream (native)	Batch/Stream (micro Batch)	Stream-only	Stream-only
Iterations	Supports iterative algorithms natively	Supports iterative algorithms with micro-batches	No	No
SQL	Table, SQL API	Spark SQL	No	Supports SQL queries on streaming data with Kafka SQL API (KSQL)
Optimization	Auto (data flow graph and the available resources)	Manual (directed acyclic graph (DAG) and the available resources)	No native support	No native support
State Backend	Memory, file system, RocksDB or custom backends	Memory, file system, HDFS or custom backends	Memory, file system, HBase or custom backends	Memory, file system, RocksDB or custom backends
Language	Java, Scala, Python and SQL APIs	Java, Scala, Python, R, C#, F# and SQL APIs	Java, Scala, Clojure and Python APIs	Java, Scala and SQL APIs
License	Apache License 2.0	Apache License 2.0	Apache License 2.0	Apache License 2.0
Backpressure	Auto (adjusting the processing speed)	Auto (adjusting the batch size)	Manual (tuning the spout configuration parameters)	Manual (tuning the producer configuration)
Geo-distribution	Flink Stateful Functions API	No native support	No native support	Kafka MirrorMaker tool
Latency	Streaming: very low latency (milliseconds)	Micro-batching: near real-time latency (seconds)	Tuple-by-tuple: very low latency (milliseconds)	Log-based: very low latency (milliseconds)
Geo-fencing	No	No	No	No
Data model	True streaming with bounded and unbounded data sets	Micro-batching with RDDs and DataFrames	Tuple-based streaming	Log-based streaming
Processing engine	One unified engine for batch and stream processing	Separate engines for batch (Spark Core) and stream processing (Spark Streaming)	Stream processing only	Stream processing only
Delivery guarantees	Exactly-once for both batch and stream processing	Exactly-once for batch processing, at-least-once for stream processing	At-least-once or at-most-once depending on the configuration	At-least-once
Throughput	High throughput due to pipelined execution and in-memory caching	High throughput due to in-memory caching and parallel processing	Moderate throughput due to spout-bolt architecture and network overhead	High throughput due to log compaction and compression
State management	Rich support for stateful operations with various state backends and time semantics	Limited support for stateful operations with mapWithState and updateStateByKey functions	No native support for stateful operations, rely on external databases or Trident API	No native support for stateful operations, rely on external databases or Kafka Streams API
Machine learning support	Yes (Flink ML library)	Yes (Spark MLlib library)	No (use external libraries like SAMOA or StormCV)	No (use external libraries like TensorFlow or H2O)
Graph processing support	Yes (Gelly library)	Yes (GraphX library)	No (use external libraries like GraphTuple or StormGraph)	No (use external libraries like Neo4j or Titan)
Architecture	True streaming engine that treats batch as a special case of streaming with bounded data. Uses a streaming dataflow model that allows for more optimization than Spark’s DAG model.	Batch engine that supports streaming as micro-batching (processing small batches of data at regular intervals). Uses a DAG model that divides the computation into stages and tasks.	Stream engine that processes each record individually as it arrives. Uses a topology model that consists of spouts (sources) and bolts (processors).	Stream engine that acts as both a message broker and a stream processor. Uses a log model that stores and processes records as an ordered sequence of events.
Delivery Guarantees	Supports exactly-once processing semantics by using checkpoints and state snapshots. Also supports at-least-once and at-most-once semantics.	Supports at-least-once processing semantics by using checkpoints and write-ahead logs. Can achieve exactly-once semantics for some output sinks by using idempotent writes or transactions.	Supports at-least-once processing semantics by using acknowledgments and retries. Can achieve exactly-once semantics by using Trident API, which provides transactions and state management.	Supports exactly-once processing semantics by using transactions and idempotent producers. Also supports at-least-once and at-most-once semantics.
Fault Tolerance	Provides high availability and fast recovery from failures by using checkpoints and state snapshots stored in external storage systems. Supports local recovery for partial failures.	Provides fault tolerance by using checkpoints and write-ahead logs stored in external storage systems. Also uses lineage information to recompute lost data from resilient distributed datasets (RDDs).	Provides fault tolerance by using acknowledgments and retries to ensure reliable message delivery. Also uses ZooKeeper to store the state of the topology and the spouts’ offsets.	Provides fault tolerance by replicating the log partitions across multiple brokers and using ZooKeeper to store the cluster metadata. Also uses transactions and idempotent producers to ensure consistent output.
Performance	Achieves high performance and low latency by using in-memory processing, pipelined execution, incremental checkpoints, network buffers, and operator chaining. Also supports batch and iterative processing modes for higher throughput.	Achieves high performance and low latency by using in-memory processing, lazy evaluation, RDD caching, and code generation. However, micro-batching introduces some latency overhead compared to true streaming engines.	Achieves high performance and low latency by using in-memory processing, parallel execution, local state management, and backpressure control. However, Storm does not support batch or iterative processing modes natively.	Achieves high performance and low latency by using log compaction, zero-copy transfer, batch compression, and client-side caching. However, Kafka does not support complex stream processing operations natively.
API	Provides a rich and expressive API in Java, Scala, Python, SQL, and Table languages. Supports both low-level (DataStream) and high-level (DataStream SQL) abstractions for stream processing. Also provides libraries for complex event processing (CEP), machine learning (ML), graph processing (Gelly), etc.	Provides a unified API in Java, Scala, Python, R, SQL, and Structured Streaming languages. Supports both low-level (RDD) and high-level (DataFrame/Dataset) abstractions for batch and stream processing. Also provides libraries for machine learning (MLlib), graph processing (GraphX), etc.	Provides a core API in Java and Clojure languages for defining topologies using spouts and bolts. Also provides a higher-level API (Trident) for defining topologies using operations like joins, aggregations, grouping, etc. Supports integration with external libraries for machine learning, SQL, etc.	Provides a producer-consumer API in Java, Scala, Python, C/C++, Go, etc. languages for reading and writing data to Kafka topics. Also provides a stream processing API (Kafka Streams) in Java and Scala languages for defining topologies using operations like map, filter, join, aggregate, etc. Supports integration with external libraries for SQL (KSQL), machine learning, etc.
Community support	Active and growing community with frequent releases and contributions	Large and mature community with stable releases and contributions	Declining community with infrequent releases and contributions	Large and active community with frequent releases and contributions

EliFuzz/Flink, Spark, Storm, Kafka: Comparison Table of Features.md