Given a setup of 4 instances of a SpringBoot 2.x application with Kafka listeners.
- 2 instances are running in Azure
- 2 instances are running in a local DC.
The Kafka cluster is configured equally in all instances and is also running in the local DC. When messages arrive in a topic, they will only be processed by one of the app instances from the local DC. Only when the local app instances are stopped, the messages will be processed by one of the app instances in Azure.
Answer:
Your observation suggests a clear imbalance in Kafka consumer group behavior, favoring the local data center (DC) instances over the Azure instances, despite them being part of the same Spring Boot app and Kafka consumer group. The key question is: why do only local instances get partitions assigned and process messages?
Let’s break this down based on how Kafka works and what could cause this behavior:
Kafka ensures that:
-
Each partition is consumed by only one consumer in a consumer group at a time.
-
Partition assignment is done centrally by the broker during rebalances.
-
Kafka client heartbeats are used to determine liveness of consumers.
The Azure instances may be failing to maintain healthy heartbeats due to higher network latency to the local Kafka brokers, causing the broker to avoid assigning partitions to them.
-
Kafka assumes consumers that don’t send heartbeats on time are dead and removes them from partition assignment.
-
Spring Kafka clients use
max.poll.interval.ms,session.timeout.ms, andheartbeat.interval.msto configure timing. If the latency from Azure → local DC is high, heartbeats may be delayed or missed.
-
Firewall/NAT delays or packet loss between Azure and local DC.
-
Kafka broker configuration, such as
connections.max.idle.ms, may close idle connections prematurely if clients are far away. -
Spring Boot/Kafka client version mismatches or bugs—but if you're on the same version everywhere, this is less likely.
-
Kafka's rebalance strategy (e.g., range vs round-robin) interacts with consumer ordering and can give bias when some consumers appear unreliable.
-
Azure consumers get assigned partitions but crash or fail silently—check logs carefully.
Use:
kafka-consumer-groups.sh --bootstrap-server <broker> --describe --group <group-id>This will show:
-
Which instance (client.id/host) has which partitions
-
If the Azure instances are members but have 0 partitions
In application.yml:
logging:
level:
org.apache.kafka.clients.consumer.internals: DEBUGLook for logs indicating rebalances, heartbeats, or partition assignments.
Run ping, traceroute, and ideally something like iperf or tcpdump from Azure → Kafka brokers to measure actual round-trip time and packet behavior.
Make Azure clients more tolerant of delay:
spring.kafka.consumer.session.timeout.ms=10000
spring.kafka.consumer.heartbeat.interval.ms=3000
spring.kafka.consumer.max.poll.interval.ms=300000- Make sure
heartbeat.interval.msis 1/3 ofsession.timeout.ms.
Switch to org.apache.kafka.clients.consumer.RoundRobinAssignor to avoid “locality bias” if using range assignor (the default).
spring.kafka.consumer.properties.partition.assignment.strategy=org.apache.kafka.clients.consumer.RoundRobinAssignorEnsure:
-
connections.max.idle.msis long enough -
advertised.listenersare correctly set for cross-DC clients -
Broker isn't overloaded causing delays in heartbeat processing
Expose JMX or Prometheus metrics to monitor consumer lags and heartbeat timing.
If Azure connectivity to the Kafka brokers is fundamentally unreliable (e.g., via VPN, or NAT), no amount of tuning will make this reliable. In that case, deploying Kafka in Azure (or using Confluent Cloud) might be the only clean solution.
If you share:
-
Your Spring Kafka client config
-
The output of
kafka-consumer-groups.sh -
The logs of an Azure instance when the local ones are still active
…we can validate or rule out the hypothesis more concretely.