idris75 · March 2, 2020 13:35
diff --git a/Kafka performance tuning b/Kafka performance tuning
 1.Producer
 	1.request.required.acks=[0,1,all/-1]  0 no acknowledgement but ver fast, 1 acknowledged after leader commits, all acknowledged after replicated
 	
 	2.use Async producer	- use callback for the acknowledgement, using property  producer.type=1
 	3.Batching data - send multiple messages together.
 		batch.num.messages
 		queue.buffer.max.ms
 	4.Compression for Large files - gzip, snappy supported
 		very large files can be stored in shared location and just the file path can be logged by the kafka producer.
 		
 	5.timeouts/retry settings - defaults may be longer, so change based on use case - using property request.timout.ms
 	
 2.Brokers:
 	1. choose high partitions - as cannot have more consumers than partitions.
 	2. try 1 partition per physical disc, to avoid IO bottleneck.
 	3. load balance partitions - use tool - 
 		kafka-reassign-partitions.sh --generate (the plan) --execute --verify.
 	some parameters to tune:
 		num.io.thread - at least as many threads as the disks.
 		log.flush.interval - higher interval will increase the speed, but risk data loss of server crashes.
 		
 3.Consumers:
 	1.have as many consumers in a group as there are partitions.
 	2.keep up with the number of producers
 	3.adding more consumers to a group will enhance performance but not adding a consumer group.
 	4.checkpoint interval 
 		replica.high.watermark.checkpoint.interval.ms - high value will increase performance, as checkpointing is done infrequently - slight data risk possibility
 		
 4.pipeline performance:
 	The other systems in the pipeline, for example data is written to HDFS, should also perform as good as Kafka.



 extra:	
 kafka vs other messaging system:
 	1.Decoupling of producer & consumer	- data is persisted when consumers go down, work periodically like ETL
 	2.Data is not stored per consumer like in a queue - data saved once - any number of consumer can independently read with their own offset values.
 	3.Replication is by default - not in just specialized cases with complicated configuration.
	1.Producer
	1.request.required.acks=[0,1,all/-1] 0 no acknowledgement but ver fast, 1 acknowledged after leader commits, all acknowledged after replicated

	2.use Async producer - use callback for the acknowledgement, using property producer.type=1
	3.Batching data - send multiple messages together.
	batch.num.messages
	queue.buffer.max.ms
	4.Compression for Large files - gzip, snappy supported
	very large files can be stored in shared location and just the file path can be logged by the kafka producer.

	5.timeouts/retry settings - defaults may be longer, so change based on use case - using property request.timout.ms

	2.Brokers:
	1. choose high partitions - as cannot have more consumers than partitions.
	2. try 1 partition per physical disc, to avoid IO bottleneck.
	3. load balance partitions - use tool -
	kafka-reassign-partitions.sh --generate (the plan) --execute --verify.
	some parameters to tune:
	num.io.thread - at least as many threads as the disks.
	log.flush.interval - higher interval will increase the speed, but risk data loss of server crashes.

	3.Consumers:
	1.have as many consumers in a group as there are partitions.
	2.keep up with the number of producers
	3.adding more consumers to a group will enhance performance but not adding a consumer group.
	4.checkpoint interval
	replica.high.watermark.checkpoint.interval.ms - high value will increase performance, as checkpointing is done infrequently - slight data risk possibility

	4.pipeline performance:
	The other systems in the pipeline, for example data is written to HDFS, should also perform as good as Kafka.



	extra:
	kafka vs other messaging system:
	1.Decoupling of producer & consumer - data is persisted when consumers go down, work periodically like ETL
	2.Data is not stored per consumer like in a queue - data saved once - any number of consumer can independently read with their own offset values.
	3.Replication is by default - not in just specialized cases with complicated configuration.