Env: Spark 2.2.0 using Kafka integration 0.10
./spark-shell --packages org.apache.spark:spark-streaming-kafka-0-10_2.11:2.2.0
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 2.2.0
/_/
Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_161)
Client with a new consumer group subscribed to a new topic starts from 'earliest' offset.
- Create a new Topic.
- Publish records 10-20 to it
- Start Streaming job with consumer group 1
- Observe job consumes 10-20
New consumer group must start first time from earliest given 'earliest' offset reset
-------------
1524397040000 ms -> 10...20
-------------
-------------
1524397050000 ms -> empty
-------------
Break job
Existing consumer group using 'earliest' offset reset starts from where it left last time.
- Publish records 21-30
- Restart Job with consumer group 1; offset reset = earliest
- Observe running job consumes 21-30
-------------
1524397240000 ms -> 21...30
-------------
-------------
1524397250000 ms -> empty
-------------
Break job
Job with new consumer group and 'earliest' offset reset starts from the beginning
- Change consumer group to 'group2'; auto-offset-reset= 'earliest'; Restart job
- Observe running job consumes 10-30
-------------
1524397360000 ms -> 10...30
-------------
-------------
1524397370000 ms -> empty
-------------
(Sanity check) Running job keeps consuming data
- While job runs, publish records 31-40
- Observe running job consumes 31-40
-------------
1524397420000 ms -> empty
-------------
-------------
1524397430000 ms -> 31...40
-------------
-------------
1524397440000 ms -> empty
-------------
Break job
A new consumer group with auto-offset-reset=lastest starts at the end of the topic
- Change consumer group to 'group3'; change offset reset to 'latest'; Restart job
- Observe the job does not consume data the first streaming iteration
-------------
1524397560000 ms -> empty
-------------
-------------
1524397570000 ms -> empty
-------------
(Sanity check) Running job started with auto-offset-reset=latest consumes new data.
- While job with consumer group 'group3' and offset reset='latest' runs, publish records 41-50
- Observe running job consumes 41-50
-------------
1524397600000 ms -> empty
-------------
-------------
1524397610000 ms -> 41...50
-------------
-------------
1524397620000 ms -> empty
-------------