Kafka offset commit with lots of partitions

My team just had a problem with our shared Kafka cluster. Logstash processes in centralized-logging system were getting unstable and crashing. These Logstash sends logs to Kafka as a buffer before consuming the messages out and insert to ES. My colleague were suspecting that one of our new deployed application could be the cause of this crashing

This application uses a topic for publishing all state events in the system. The topic is configured with 500 partitions. But the system is not live yet and there is no messages in this topic at all. The application is based on SpringBoot and spring-kafka. We are not using auto-offset-commit (the default one in Kafka library ) and configured spring-kafka to commit every 100 messages or if the last commit pass 10 seconds. Spring-kafka should not commit any offset if there is no messages there to consume.

Just to prove that spring-kafka doesn’t do any strange thing, I have tried running small client with some debug logs.

logging:
  level.org.springframework.kafka.listener: DEBUG
  level.org.apache.kafka.clients.consumer.internals.ConsumerCoordinator : DEBUG

spring:
  kafka:
    bootstrap-servers: "some-host:9092"
    consumer:
      ...
      enable-auto-commit: false

    listener:
      ack-mode: COUNT_TIME
      ack-count: 100
      ack-time: 10000

I can see from the logs that there is no offset commit which is make sense

 
: Received: 0 records
 : Committing in AckMode.COUNT_TIME because time elapsed exceeds configured limit of 10000
 : Commit list: {}

The new application is unlikely to cause the problem and my colleague was later able to identify the actual problem which is not my interest in this post. What I am more interested is what if we change to use auto-commit mechanism.

Consumer offset auto-commit mechanism


dIt is very interesting that auto-commit in Kafka consumer library will keep committing same offset once it reach the commit interval even though the offsets haven’t been changed at all. This is to avoid the offset-commit message get deleted by offsets.retention.minutes

//This same logs show every 10 secs ( commit interval )
o.a.k.c.c.internals.ConsumerCoordinator  : 
Sending asynchronous auto-commit of offsets 
{big-topic-179=OffsetAndMetadata{offset=0, metadata=''}, 
big-topic-146=OffsetAndMetadata{offset=0, metadata=''},  
big-topic-47=OffsetAndMetadata{offset=0, metadata=''} 
....., 
big-topic-87=OffsetAndMetadata{offset=0, metadata=''}, 
big-topic-54=OffsetAndMetadata{offset=0, metadata=''}, 
big-topic-21=OffsetAndMetadata{offset=0, metadata=''}}

But how much trouble it cause for commit all offsets every time? I tried turning on debug log for org.apache.kafka.clients.NetworkClient to get more information. The log showing below is where the OffsetCommitRequest about to get sent through network.

 org.apache.kafka.clients.NetworkClient : Sending OFFSET_COMMIT 
{ group_id=con-lots-of-partitions,generation_id=36,
member_id=consumer-2-b2e44186-3d2a-47f1-817e0e1b8bd74623,
retention_time=-1,
topics=[{topic=big-topic,partitions=[{partition=0,offset=0,metadata=},
{partition=1,offset=0,metadata=},{partition=2,offset=0,metadata=},
{partition=3,offset=0,metadata=},{partition=4,offset=0,metadata=},
...
{partition=196,offset=0,metadata=},{partition=197,offset=0,metadata=},
{partition=198,offset=0,metadata=},{partition=199,offset=0,metadata=}]}]

} with correlation id 67 to node 2147483642

Offsets of all partitions are packed into one commit request. I have a strong feeling that this should not matter much in term of performance. Auto commit should work fine even with large number of partitions.