Apache Kafka: Basics for Application Developers
"We need Kafka." "Why?" "Because... Netflix uses it?"
Kafka is overkill for many, but indispensable for some. Let's understand what it actually is.
It's a Log, Not a Queue
In a traditional queue (RabbitMQ):
Producer -> Queue -> Consumer
Once a consumer reads a message, it's gone.
In Kafka:
Producer -> Log (Topic) -> Services
The message stays there (for 7 days usually). Multiple consumers can read the same message at different times.
Core Concepts
1. Topic
A category name to store streams of records. e.g., "UserClicks", "Transactions".
2. Partition
Topics are split into partitions. This allows scaling.
- Partition 1: Messages 1-100
- Partition 2: Messages 101-200 Ordering is guaranteed only within a partition.
3. Consumer Group
The magic of Kafka.
- Service A (Group 1): Reads "UserClicks".
- Service B (Group 2): Also reads "UserClicks".
Both services get ALL messages. They don't compete.
BUT, if you start two instances of Service A (Group 1):
- Instance 1 reads Partition 1
- Instance 2 reads Partition 2 They split the load automatically.
When to Use Kafka
- Activity Tracking: Track every click, view, scroll. High volume.
- Event Sourcing: Store state changes as a sequence of events.
- Log Aggregation: Collect logs from 100 servers centrally.
- Decoupling Services: Microservices pattern.
When NOT to Use Kafka
- simple work queue (Use Redis/Sidekiq).
- You need complex routing (RabbitMQ is better).
- You process < 100 messages/second (Overkill).
Playing with Kafka (Docker)
version: '2'
services:
zookeeper:
image: confluentinc/cp-zookeeper:latest
environment:
ZOOKEEPER_CLIENT_PORT: 2181
kafka:
image: confluentinc/cp-kafka:latest
depends_on: [zookeeper]
ports: ["9092:9092"]
environment:
KAFKA_ZOOKEEPER_CONNECT: zookeeper:2181
KAFKA_ADVERTISED_LISTENERS: PLAINTEXT://localhost:9092
KAFKA_OFFSETS_TOPIC_REPLICATION_FACTOR: 1
It has a steep learning curve but runs the backbone of modern data architecture.