Apache Kafka: Basics for Application Developers

"We need Kafka." "Why?" "Because... Netflix uses it?"

Kafka is overkill for many, but indispensable for some. Let's understand what it actually is.

It's a Log, Not a Queue

In a traditional queue (RabbitMQ): Producer -> Queue -> Consumer Once a consumer reads a message, it's gone.

In Kafka: Producer -> Log (Topic) -> Services The message stays there (for 7 days usually). Multiple consumers can read the same message at different times.

Core Concepts

1. Topic

A category name to store streams of records. e.g., "UserClicks", "Transactions".

2. Partition

Topics are split into partitions. This allows scaling.

Partition 1: Messages 1-100
Partition 2: Messages 101-200 Ordering is guaranteed only within a partition.

3. Consumer Group

The magic of Kafka.

Service A (Group 1): Reads "UserClicks".
Service B (Group 2): Also reads "UserClicks".

Both services get ALL messages. They don't compete.

BUT, if you start two instances of Service A (Group 1):

Instance 1 reads Partition 1
Instance 2 reads Partition 2 They split the load automatically.

When to Use Kafka

Activity Tracking: Track every click, view, scroll. High volume.
Event Sourcing: Store state changes as a sequence of events.
Log Aggregation: Collect logs from 100 servers centrally.
Decoupling Services: Microservices pattern.

When NOT to Use Kafka

simple work queue (Use Redis/Sidekiq).
You need complex routing (RabbitMQ is better).
You process < 100 messages/second (Overkill).

Playing with Kafka (Docker)

version: '2'
services:
  zookeeper:
    image: confluentinc/cp-zookeeper:latest
    environment:
      ZOOKEEPER_CLIENT_PORT: 2181

  kafka:
    image: confluentinc/cp-kafka:latest
    depends_on: [zookeeper]
    ports: ["9092:9092"]
    environment:
      KAFKA_ZOOKEEPER_CONNECT: zookeeper:2181
      KAFKA_ADVERTISED_LISTENERS: PLAINTEXT://localhost:9092
      KAFKA_OFFSETS_TOPIC_REPLICATION_FACTOR: 1

It has a steep learning curve but runs the backbone of modern data architecture.