Skip to content

Instantly share code, notes, and snippets.

@mike-seger
Last active July 18, 2025 10:39
Show Gist options
  • Select an option

  • Save mike-seger/8ff68662c54e6409a90badd4f964663d to your computer and use it in GitHub Desktop.

Select an option

Save mike-seger/8ff68662c54e6409a90badd4f964663d to your computer and use it in GitHub Desktop.

Given a setup of 4 instances of a SpringBoot 2.x application with Kafka listeners.

  • 2 instances are running in Azure
  • 2 instances are running in a local DC.

The Kafka cluster is configured equally in all instances and is also running in the local DC. When messages arrive in a topic, they will only be processed by one of the app instances from the local DC. Only when the local app instances are stopped, the messages will be processed by one of the app instances in Azure.

Answer:

Your observation suggests a clear imbalance in Kafka consumer group behavior, favoring the local data center (DC) instances over the Azure instances, despite them being part of the same Spring Boot app and Kafka consumer group. The key question is: why do only local instances get partitions assigned and process messages?

Let’s break this down based on how Kafka works and what could cause this behavior:


🔍 What Kafka Guarantees

Kafka ensures that:

  • Each partition is consumed by only one consumer in a consumer group at a time.

  • Partition assignment is done centrally by the broker during rebalances.

  • Kafka client heartbeats are used to determine liveness of consumers.


✅ Likely Root Cause: Consumer Latency or Connectivity Disparity

The Azure instances may be failing to maintain healthy heartbeats due to higher network latency to the local Kafka brokers, causing the broker to avoid assigning partitions to them.

Supporting Evidence:

  • Kafka assumes consumers that don’t send heartbeats on time are dead and removes them from partition assignment.

  • Spring Kafka clients use max.poll.interval.ms, session.timeout.ms, and heartbeat.interval.ms to configure timing. If the latency from Azure → local DC is high, heartbeats may be delayed or missed.


✅ Additional Possibilities

  1. Firewall/NAT delays or packet loss between Azure and local DC.

  2. Kafka broker configuration, such as connections.max.idle.ms, may close idle connections prematurely if clients are far away.

  3. Spring Boot/Kafka client version mismatches or bugs—but if you're on the same version everywhere, this is less likely.

  4. Kafka's rebalance strategy (e.g., range vs round-robin) interacts with consumer ordering and can give bias when some consumers appear unreliable.

  5. Azure consumers get assigned partitions but crash or fail silently—check logs carefully.


🔬 Diagnostics

1. Check consumer group membership

Use:

kafka-consumer-groups.sh --bootstrap-server <broker> --describe --group <group-id>

This will show:

  • Which instance (client.id/host) has which partitions

  • If the Azure instances are members but have 0 partitions

2. Enable DEBUG logging

In application.yml:

logging:
  level:
    org.apache.kafka.clients.consumer.internals: DEBUG

Look for logs indicating rebalances, heartbeats, or partition assignments.

3. Latency / packet loss tests

Run ping, traceroute, and ideally something like iperf or tcpdump from Azure → Kafka brokers to measure actual round-trip time and packet behavior.


🛠 Recommendations

1. Reduce Kafka client timeouts

Make Azure clients more tolerant of delay:

spring.kafka.consumer.session.timeout.ms=10000
spring.kafka.consumer.heartbeat.interval.ms=3000
spring.kafka.consumer.max.poll.interval.ms=300000
  • Make sure heartbeat.interval.ms is 1/3 of session.timeout.ms.

2. Rebalance strategy

Switch to org.apache.kafka.clients.consumer.RoundRobinAssignor to avoid “locality bias” if using range assignor (the default).

spring.kafka.consumer.properties.partition.assignment.strategy=org.apache.kafka.clients.consumer.RoundRobinAssignor

3. Broker-side tuning

Ensure:

  • connections.max.idle.ms is long enough

  • advertised.listeners are correctly set for cross-DC clients

  • Broker isn't overloaded causing delays in heartbeat processing

4. Kafka client metrics

Expose JMX or Prometheus metrics to monitor consumer lags and heartbeat timing.


❗Caveat

If Azure connectivity to the Kafka brokers is fundamentally unreliable (e.g., via VPN, or NAT), no amount of tuning will make this reliable. In that case, deploying Kafka in Azure (or using Confluent Cloud) might be the only clean solution.


If you share:

  • Your Spring Kafka client config

  • The output of kafka-consumer-groups.sh

  • The logs of an Azure instance when the local ones are still active

…we can validate or rule out the hypothesis more concretely.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment