Kafka built on Kubernetes gets stuck while consuming

2 years ago

#315541

Çetin Kesepara

I installed Kafka(3.1.0) as Stateful on Kubernetes. Then I created a Topic. We send data to this topic by receiving data with HA proxy from outside of Kubernetes. Then we consume this data with an application.

Everything looks normal. This Topic works fine. There is no problem with the consumer.

But if I try to consume this Topic via a different group. Kafka is starting to get clogged. This doesn't always happen. It only happens when you become Consume a few times and leave.

Now I will try to explain this with an example. (I replaced some special parts with ...)

There is a Topic and it is already being consumed. The name of the group is "test1".

1- I am now joining this Topic with console-consumer as a new consumer.

Server Logs:

[2022-03-21 06:11:27,538] INFO [GroupCoordinator 0]: Dynamic member with unknown member id joins group console-consumer-18856 in Empty state. Created a new member id consumer-console-consumer-18856-1-7898e8a9-e182-4d1c-8c62-556181ca0641 and request the member to rejoin with this id. (kafka.coordinator.group.GroupCoordinator)
[2022-03-21 06:11:27,637] INFO [GroupCoordinator 0]: Preparing to rebalance group console-consumer-18856 in state PreparingRebalance with old generation 0 (__consumer_offsets-39) (reason: Adding new member consumer-console-consumer-18856-1-7898e8a9-e182-4d1c-8c62-556181ca0641 with group instance id None) (kafka.coordinator.group.GroupCoordinator)
[2022-03-21 06:11:30,639] INFO [GroupCoordinator 0]: Stabilized group console-consumer-18856 generation 1 (__consumer_offsets-39) with 1 members (kafka.coordinator.group.GroupCoordinator)
[2022-03-21 06:11:30,748] INFO [GroupCoordinator 0]: Assignment received from leader consumer-console-consumer-18856-1-7898e8a9-e182-4d1c-8c62-556181ca0641 for group console-consumer-18856 for generation 1. The group has 1 members, 0 of which are static. (kafka.coordinator.group.GroupCoordinator)

2- Let's look at our groups

bin/kafka-consumer-groups.sh --bootstrap-server ... --list
console-consumer-18856
test1

3- Now let's stop the consumer. (ctrl + c)

[2022-03-21 06:20:36,646] INFO [GroupCoordinator 0]: Preparing to rebalance group console-consumer-18856 in state PreparingRebalance with old generation 1 (__consumer_offsets-39) (reason: Removing member consumer-console-consumer-18856-1-7898e8a9-e182-4d1c-8c62-556181ca0641 on LeaveGroup) (kafka.coordinator.group.GroupCoordinator)
[2022-03-21 06:20:36,646] INFO [GroupCoordinator 0]: Group console-consumer-18856 with generation 2 is now empty (__consumer_offsets-39) (kafka.coordinator.group.GroupCoordinator)
[2022-03-21 06:20:36,647] INFO [GroupCoordinator 0]: Member MemberMetadata(memberId=consumer-console-consumer-18856-1-7898e8a9-e182-4d1c-8c62-556181ca0641, groupInstanceId=None, clientId=consumer-console-consumer-18856-1, clientHost=/..., sessionTimeoutMs=10000, rebalanceTimeoutMs=300000, supportedProtocols=List(range)) has left group console-consumer-18856 through explicit `LeaveGroup` request (kafka.coordinator.group.GroupCoordinator)
[2022-03-21 06:20:57,243] INFO [GroupMetadataManager brokerId=0] Group console-consumer-18856 transitioned to Dead in generation 2 (kafka.coordinator.group.GroupMetadataManager)

4- Let's look at our groups (Of course, if this line exists, "transitioned to Dead in generation 2". Otherwise it still appears in the list.)

bin/kafka-consumer-groups.sh --bootstrap-server ... --list
test1

Everything is normal up to this point. Joins and leaves Topic with a group of consumers. However, the situation changes when we repeat the process of joining the consumer a few times.

1- Let's re-enter the same Topic with a different group.

[2022-03-21 07:02:46,377] INFO [GroupCoordinator 0]: Dynamic member with unknown member id joins group console-consumer-43677 in Empty state. Created a new member id consumer-console-consumer-43677-1-8809ea17-b557-4571-8c27-8cdbe417e052 and request the member to rejoin with this id. (kafka.coordinator.group.GroupCoordinator)
[2022-03-21 07:02:46,510] INFO [GroupCoordinator 0]: Preparing to rebalance group console-consumer-43677 in state PreparingRebalance with old generation 0 (__consumer_offsets-38) (reason: Adding new member consumer-console-consumer-43677-1-8809ea17-b557-4571-8c27-8cdbe417e052 with group instance id None) (kafka.coordinator.group.GroupCoordinator)
[2022-03-21 07:02:49,511] INFO [GroupCoordinator 0]: Stabilized group console-consumer-43677 generation 1 (__consumer_offsets-38) with 1 members (kafka.coordinator.group.GroupCoordinator)
[2022-03-21 07:02:49,650] INFO [GroupCoordinator 0]: Assignment received from leader consumer-console-consumer-43677-1-8809ea17-b557-4571-8c27-8cdbe417e052 for group console-consumer-43677 for generation 1. The group has 1 members, 0 of which are static. (kafka.coordinator.group.GroupCoordinator)

2- Let's look at our groups

bin/kafka-consumer-groups.sh --bootstrap-server ... --list
console-consumer-43677
test1

3- Now let's stop the consumer. (ctrl + c)

Here the problem starts. No exit log is seen after stopping the consumer. Sometimes a log like the one below may appear.

2022-03-21 07:03:30,045] INFO [GroupCoordinator 0]: Member consumer-console-consumer-43677-1-8809ea17-b557-4571-8c27-8cdbe417e052 in group console-consumer-43677 has failed, removing it from the group (kafka.coordinator.group.GroupCoordinator)

And Kafka gets stuck. I can no longer consume it at all. Operations like create, delete in Kafka no longer work. Only list and describe work.

1- If I try to delete the group it won't let me. Because even if I stopped the consumer (ctrl+c), actually the quit process didn't happen.

bin/kafka-consumer-groups.sh --bootstrap-server ... --delete --group console-consumer-43677
Error: Deletion of some consumer groups failed:
* Group 'console-consumer-43677' could not be deleted due to: java.util.concurrent.ExecutionException: org.apache.kafka.common.errors.TimeoutException: Call(callName=deleteConsumerGroups, deadlineMs=1647846924163, tries=1, nextAllowedTryMs=1647846924266) timed out at 1647846924166 after 1 attempt(s)

2- If I'm trying to set up a new Topic.

bin/kafka-topics.sh --create --topic test-topic ...

Error while executing topic command : Call(callName=createTopics, deadlineMs=1647849153162, tries=2, nextAllowedTryMs=1647849153263) timed out at 1647849153163 after 2 attempt(s)
[2022-03-21 10:52:33,170] ERROR org.apache.kafka.common.errors.TimeoutException: Call(callName=createTopics, deadlineMs=1647849153162, tries=2, nextAllowedTryMs=1647849153263) timed out at 1647849153163 after 2 attempt(s)
Caused by: org.apache.kafka.common.errors.TimeoutException: Timed out waiting to send the call. Call: createTopics
 (kafka.admin.TopicCommand$)

3- If I try to join Topic with consumer again. It will give timeout errors. In the trace logs, it will give "re-join group" errors. I don't think it has anything to do with concepts like sessiontimeout, heart. Because Kafka shouldn't be locked even if I can't consume it.

Re-Deploy is the only way to fix the situation. But why does the error occur? Is this a Bug? A race condition? Is there a solution? Is it a case with Kubernetes? Is it related to Kafka 3.1.0?

kubernetes

apache-kafka

consumer

0 Answers

Your Answer

Posts

Questions

Blogs

Jobs