rayhassan · July 27, 2018 12:36
diff --git a/kafka-flume-spark-notes b/kafka-flume-spark-notes


 Flume
 =====

 ingest : number or sources of origin for events  / number of events (per source) <<<<* this is required 

 R.o.T : one aggregator for every 4-16 agents

 so using the above info work from outer tier to final inner tier - and we know final destination of events : hdfs ?

 R.o.T. HDFS can sink 1000 events per sink on the backend - we assume that a "fan-in" type configuration (so many to one from outer to 
 innner tier)

 Start with this rule of thumb:   
 # Cores = (# Sources + # Sinks) / 2 
 • If using Memory Channel, RAM should be sufficient to hold maximum channel capacity 
 • If using File Channel, more disks are better for throughput 
 - in both cases need - expected resolution for amy downstream failure ...2-4hrs?

 Example : 

 Number of Tiers For aggregation flow from 100 web servers 
 • Using ratio of 1:16 for outer tier, no load balancing or failover requirements 
 	-? Number of tier 1 agents = 100/16 ~ 7 
 • Using ratio of 1:4 for inner tier, no load balancing or failover 
 	- Number of tier 2 agents = 7/4 ~ 2 

 Total of 2 tiers, 9 agents 

 Kafka
 =====

 Amount of data to ingest : request/sec say or the total amount of data in TBs? once we have total throughput we can calculate number of 
 required partitions (for parallelism) based on throughput per partition for both producer and consumer (usually needs to be neasured, 
 but R.o.T 10MB/s??) <<<<* this is required

 Additional info ...
 
 kafka heap - 5G max
 64G/24vCPUs configs are most common
 6 vdisks RAID 10 - xfs filesystem
 RF2 - replication
 Kafka uses a very large number of files and a large number of sockets to communicate with the clients. All of this requires a relatively
 high number of available file descriptors. You should increase your file descriptor count to to at least 100,000

 You measure the throughout that you can achieve on a single partition for production (call it p) and consumption (call it c). Let’s say 
 your target throughput is t. Then you need to have at least max(t/p, t/c) partitions. The per-partition throughput that one can achieve 
 on the producer depends on configurations such as the batching size, compression codec, type of acknowledgement, replication factor, etc


 Spark
 =====

 ingest rate/ amount of ingest - can configure on the HDFS worker (reduces latency) or as a label (not usually recommended?), then only 
 need history server on master nodes <<<<* this is required

 additional info...

 4-8 disks per node, up to 75% or RAM to spark (8- ~200s of GB, more than 200GB, JVM unstable, need to use additional JVMs on a host. 
 Not an issue using a VM), 8-16 cores per machine. Install config as standby master with ZK - most likely done via cloudera manager.

 https://legacy.gitbook.com/book/umbertogriffo/apache-spark-best-practices-and-tuning/details


	Flume
	=====

	ingest : number or sources of origin for events / number of events (per source) <<<<* this is required

	R.o.T : one aggregator for every 4-16 agents

	so using the above info work from outer tier to final inner tier - and we know final destination of events : hdfs ?

	R.o.T. HDFS can sink 1000 events per sink on the backend - we assume that a "fan-in" type configuration (so many to one from outer to
	innner tier)

	Start with this rule of thumb:
	# Cores = (# Sources + # Sinks) / 2
	• If using Memory Channel, RAM should be sufficient to hold maximum channel capacity
	• If using File Channel, more disks are better for throughput
	- in both cases need - expected resolution for amy downstream failure ...2-4hrs?

	Example :

	Number of Tiers For aggregation flow from 100 web servers
	• Using ratio of 1:16 for outer tier, no load balancing or failover requirements
	-? Number of tier 1 agents = 100/16 ~ 7
	• Using ratio of 1:4 for inner tier, no load balancing or failover
	- Number of tier 2 agents = 7/4 ~ 2

	Total of 2 tiers, 9 agents

	Kafka
	=====

	Amount of data to ingest : request/sec say or the total amount of data in TBs? once we have total throughput we can calculate number of
	required partitions (for parallelism) based on throughput per partition for both producer and consumer (usually needs to be neasured,
	but R.o.T 10MB/s??) <<<<* this is required

	Additional info ...

	kafka heap - 5G max
	64G/24vCPUs configs are most common
	6 vdisks RAID 10 - xfs filesystem
	RF2 - replication
	Kafka uses a very large number of files and a large number of sockets to communicate with the clients. All of this requires a relatively
	high number of available file descriptors. You should increase your file descriptor count to to at least 100,000

	You measure the throughout that you can achieve on a single partition for production (call it p) and consumption (call it c). Let’s say
	your target throughput is t. Then you need to have at least max(t/p, t/c) partitions. The per-partition throughput that one can achieve
	on the producer depends on configurations such as the batching size, compression codec, type of acknowledgement, replication factor, etc


	Spark
	=====

	ingest rate/ amount of ingest - can configure on the HDFS worker (reduces latency) or as a label (not usually recommended?), then only
	need history server on master nodes <<<<* this is required

	additional info...

	4-8 disks per node, up to 75% or RAM to spark (8- ~200s of GB, more than 200GB, JVM unstable, need to use additional JVMs on a host.
	Not an issue using a VM), 8-16 cores per machine. Install config as standby master with ZK - most likely done via cloudera manager.

	https://legacy.gitbook.com/book/umbertogriffo/apache-spark-best-practices-and-tuning/details