Kafka benchmarks are typically run using a single producer and consumer against a single topic, and the producer and consumer are run at close to maximum write/read speeds. In the real world, a Kafka cluster is more often serving many lower throughput producers and consumers. Ansible allows for a benchmarking method that sets up any number of topics and many producers and consumers.
Ansible playbooks allow us to run a number of tasks against a distributed set of clients both synchronously and asynchronously.
Before we can run tests we need topics to test against. This play sets up a number of topics with various partition configurations:
- name : Setup
hosts: tools
tags:
- setup
vars:
bootstrap: 172.20.10.101:9092
topicname_prefix: perftest
replicas: 3
retention: 300000 #5 mins
topics_partitions: [5, 10, 20, 30, 50]
topic_count: "{{ range(1, 30 + 1) | list }}" # create 30 topics for each
tasks:
- name: create topics
shell:
cmd: >-
kafka-topics --bootstrap-server {{ bootstrap }} --list
register: topiclist
- name: create topic if it doesn't exist
shell:
cmd: >-
kafka-topics --bootstrap-server {{ bootstrap }}
--topic {{ topicname_prefix }}-{{ item.0 }}-{{ item.1 }}
--create --partitions {{ item.0 }}
--replication-factor {{ replicas }}
--config retention.ms={{ retention }}
--config min.insync.replicas=2
loop: "{{ topics_partitions|product(topic_count)|list }}"
when: "'-'.join((topicname_prefix, item.0|string, item.1|string)) not in topiclist.stdout"
Some notes on the above play:
-
This play runs on the tools host, it’s a single host defined in the inventory, so these commands will only be run once
-
This is an idempotent play, meaning it can be run multiple times with the same results, it will check for the existence of topics and won’t attempt to recreate them.
-
The loop item uses the product filter in Ansible to create a cartesian product of the topic partitions and topic counts:
[[5, 1], [5, 2], [5, 3] . . . [50, 29], [50, 30]]
-
We use this loop to create 30 topics with 5 partitions, 30 topics with 10 partitions and so on
-
On line 28 the when is a conditional that checks for membership of the topic in the topiclist we registered on line 17
Now that we have topics, let’s produce some data to them:
- name: Tests
strategy: free
hosts:
- kafka_connect
gather_facts: no
vars:
bootstrap: 172.20.10.101:9092
topicname_prefix: perftest
throughput: 5000
record_size: 512
num_records: 10000000
num_producers: 10
acks: all
topics_partitions: [5, 10, 20, 30, 50]
topic_count: "{{ range(1, 30 + 1) | list }}" #
topics: []
tasks:
- name: set list of topics
set_fact:
topics: "{{topics + [ topicname_prefix + '-' + item.0|string + '-' + item.1|string ]}}"
loop: "{{ topics_partitions|product(topic_count)|list }}"
- name: producer test
async: 5184000
poll: 0
register: producer_output
command: >-
kafka-producer-perf-test
--topic {{ topics | random }}
--producer-props bootstrap.servers={{ bootstrap }}
acks={{ acks }}
client.id={{ inventory_hostname }}-producer-{{ item }}
linger.ms=10
--record-size {{ record_size }}
--throughput {{ throughput }}
--num-records {{ num_records}}
loop: "{{ range(1, num_producers + 1) | list }}"
Notes on this play:
-
The execution strategy is free, which means that Ansible won’t wait for hosts to finish executing, and will run them as fast as possible (line 2)
-
We’re creating the list of topics in the set_fact method using the same cartesian product approach
-
The producer command uses the random filter to select a topic at random from the list
-
The producer test will be run on each host in the specified group, in this case, kafka_connect, using the loop method on line 37 we will create a number of producers equal to num_producers on each host
-
Producer throughput will be determined by the throughput variable and record_size, in this case 512 bytes * 5000 records per second is about 2.5MB/s. There were 20 kafka_connect hosts and the num_producers per host is set to 10, so this producer test will try and write ~500MB/s to Kafka.
-
This play doesn’t wait for the producers to complete (see the poll: 0 entry in Ansible docs), it simply launches them and moves on, we are monitoring the results via Control Center, and JMX with Prometheus and Grafana. This play could be modified to run all producers asynchronously and poll for when they’re finished to collect any output back to the tools host.