Last active
July 27, 2018 12:36
-
-
Save rayhassan/db16512fd30f822b8f897d4ca81e3846 to your computer and use it in GitHub Desktop.
Kafka , Flume, Spark sizing notes
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Flume | |
===== | |
ingest : number or sources of origin for events / number of events (per source) <<<<* this is required | |
R.o.T : one aggregator for every 4-16 agents | |
so using the above info work from outer tier to final inner tier - and we know final destination of events : hdfs ? | |
R.o.T. HDFS can sink 1000 events per sink on the backend - we assume that a "fan-in" type configuration (so many to one from outer to | |
innner tier) | |
Start with this rule of thumb: | |
# Cores = (# Sources + # Sinks) / 2 | |
• If using Memory Channel, RAM should be sufficient to hold maximum channel capacity | |
• If using File Channel, more disks are better for throughput | |
- in both cases need - expected resolution for amy downstream failure ...2-4hrs? | |
Example : | |
Number of Tiers For aggregation flow from 100 web servers | |
• Using ratio of 1:16 for outer tier, no load balancing or failover requirements | |
-? Number of tier 1 agents = 100/16 ~ 7 | |
• Using ratio of 1:4 for inner tier, no load balancing or failover | |
- Number of tier 2 agents = 7/4 ~ 2 | |
Total of 2 tiers, 9 agents | |
Kafka | |
===== | |
Amount of data to ingest : request/sec say or the total amount of data in TBs? once we have total throughput we can calculate number of | |
required partitions (for parallelism) based on throughput per partition for both producer and consumer (usually needs to be neasured, | |
but R.o.T 10MB/s??) <<<<* this is required | |
Additional info ... | |
kafka heap - 5G max | |
64G/24vCPUs configs are most common | |
6 vdisks RAID 10 - xfs filesystem | |
RF2 - replication | |
Kafka uses a very large number of files and a large number of sockets to communicate with the clients. All of this requires a relatively | |
high number of available file descriptors. You should increase your file descriptor count to to at least 100,000 | |
You measure the throughout that you can achieve on a single partition for production (call it p) and consumption (call it c). Let’s say | |
your target throughput is t. Then you need to have at least max(t/p, t/c) partitions. The per-partition throughput that one can achieve | |
on the producer depends on configurations such as the batching size, compression codec, type of acknowledgement, replication factor, etc | |
Spark | |
===== | |
ingest rate/ amount of ingest - can configure on the HDFS worker (reduces latency) or as a label (not usually recommended?), then only | |
need history server on master nodes <<<<* this is required | |
additional info... | |
4-8 disks per node, up to 75% or RAM to spark (8- ~200s of GB, more than 200GB, JVM unstable, need to use additional JVMs on a host. | |
Not an issue using a VM), 8-16 cores per machine. Install config as standby master with ZK - most likely done via cloudera manager. | |
https://legacy.gitbook.com/book/umbertogriffo/apache-spark-best-practices-and-tuning/details |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment