jamesrajendran · May 11, 2017 11:04
diff --git a/FlumeVsKafka b/FlumeVsKafka
 kafka vs Flume:
 Kafka - general purpose infrastructure with producer and consumer API, so can be used with any system/sinks.
 Enterprise messaging system, to connect any systems not just Hadoop - Flume is primarily meant for Hadoop.

 Asynchronous kafka - producer and consumer can work on their own pace, for example no data losss if the consumer is down for some time.
 	Consumer can come up later and resume where left off - it can even pull from a particular older offset if needed.
 	Overflow events will be in the kafka "persistent buffer"
 	Events spike can be easily accomodated in Kafka, can handle 100K per second

 High durability/Fault Tolerance:
 		Replication possible in Kafka
 		Flume though supports file channel, data not in sink yet but in channel cannot be used until agent comes back up.
 		If the file channel host server crashes, data recovery is not possible - so entails expensive RAID/SAN systems.
 		Kafka supports synchronous/asynchronous replication hence commodity hardware is good enough.
 	
 High Scalability:
 			Consumers can be added on the fly without affecting performance or downtime.
 			In Flume, the topology of Flume pipeline has to be changed, replicating the channel for new sink and hence some downtime.
	kafka vs Flume:
	Kafka - general purpose infrastructure with producer and consumer API, so can be used with any system/sinks.
	Enterprise messaging system, to connect any systems not just Hadoop - Flume is primarily meant for Hadoop.

	Asynchronous kafka - producer and consumer can work on their own pace, for example no data losss if the consumer is down for some time.
	Consumer can come up later and resume where left off - it can even pull from a particular older offset if needed.
	Overflow events will be in the kafka "persistent buffer"
	Events spike can be easily accomodated in Kafka, can handle 100K per second

	High durability/Fault Tolerance:
	Replication possible in Kafka
	Flume though supports file channel, data not in sink yet but in channel cannot be used until agent comes back up.
	If the file channel host server crashes, data recovery is not possible - so entails expensive RAID/SAN systems.
	Kafka supports synchronous/asynchronous replication hence commodity hardware is good enough.

	High Scalability:
	Consumers can be added on the fly without affecting performance or downtime.
	In Flume, the topology of Flume pipeline has to be changed, replicating the channel for new sink and hence some downtime.