[Avg. reading time: 22 minutes]

Kafka

Introduction

Apache Kafka is a powerful distributed streaming platform that revolutionized how organizations handle real-time data streams.

Developed by LinkedIn and open sourced in 2010.

Kafka is a distributed publish-subscribe messaging system that excels at handling real-time data streams.

Key Features

High throughput: Can handle millions of messages per second
Fault-tolerant: Data is replicated across servers
Scalable: Can easily scale horizontally across multiple servers
Persistent storage: Keeps messages for as long as you need

Apache Kafka is a publish/subscribe messaging system designed to solve this problem. It is often described as a “distributed commit log” or, more recently, as a “distributing streaming platform.”

A filesystem or database commit log is designed to provide a durable record of all transactions so that they can be replayed to build the state of a system consistently.

Basic Terms

Messages

The fundamental unit of data in Kafka
Similar to a row in a database, but immutable (can’t be changed once written)

Structure of a message
Value: The actual data payload (array of bytes)
Key: Optional identifier (more on this below)
Timestamp
Optional metadata (headers)

Messages don’t have a specific format requirement - they’re just bytes.

Sample Message

{
    "metadata": {
        "offset": 15,
        "partition": 2,
        "topic": "user_activities",
        "timestamp": "2024-11-13T14:30:00.123Z",
        "headers": [
            {
                "traceId": "abc-123-xyz",
                "version": "1.0",
                "source": "mobile-app"
            }
        ]
    },
    "key": "user_123",
    "value": {
        "userId": "user_123",
        "action": "login",
        "device": "iPhone",
        "location": "New York"
    }
}

Topic

Think of it like a TV Channel or Radio station where messages are published. A category or feed name to which messages are stored and published.

Key characteristics

Multi-subscriber (multiple consumers can read from same topic)
Durable (messages are persisted based on retention policy)
Ordered (within each partition)
Like a database table, but with infinite append-only logs

Partitions

Topics are broken down into multiple partitions
Messages are written in an append-only fashion

Important aspects

Each partition is an ordered, immutable sequence of messages
Messages get a sequential ID called an “offset” within their partition
Time-ordering is guaranteed only within a single partition, not across the entire topic
Provides redundancy and scalability
Can be hosted on different servers

Keys

An optional identifier for messages serves two main purposes:**

Partition Determination:

Messages with same key always go to same partition
No key = round-robin distribution across partitions
Uses formula: hash(key) % number_of_partitions

Data Organization:

Groups related messages together
Useful for message compaction

Real-world Example:

Topic: "user_posts"
Key: userId
Message: post content
Partitions: Multiple partitions for scalability
Result: All posts from the same user (same key) go to the same partition, maintaining order for that user's posts

Offset

A unique sequential identifier for messages within a partition, starts at 0 and increments by 1 for each message

Important characteristics:

Immutable (never changes)
Specific to a partition
Used by consumers to track their position
- Example: In a partition with 5 messages → offsets are 0, 1, 2, 3, 4

Offset is a collaboration between Kafka and consumers:

Kafka maintains offsets in a special internal topic called __consumer_offsets

This topic stores the latest committed offset for each partition per consumer group

Format in __consumer_offsets:

Key: (group.id, topic, partition)
Value: offset value

Two types of offsets for consumers:

Current Position: The offset of the next message to be read
Committed Offset: The last offset that has been saved to Kafka

Two types of Commits

Auto Commit, default at a given interval in milli seconds.
Manual Commit, done by consumer.

Batches

A collection of messages, all for the same topic and partition.

Benefits:

More efficient network usage
Better compression
Faster I/O operations

Trade-off: Latency vs Throughput (larger batches = more latency but better throughput)

Producers

Producers create new messages. In general, a message will be produced on a specific topic.

Key behaviors:

Can send to specific partitions or let Kafka handle distribution

Partition assignment happens through:

Round-robin (when no key is provided)
Hash of key (when message has a key)
Can specify acknowledgment requirements (acks)

Consumers and Consumer Groups

Consumers read messages from topics

Consumer Groups:

Multiple consumers working together
Each partition is read by ONLY ONE consumer in a group
Automatic rebalancing if consumers join/leave the group

src: Oreilly Kafka Book

Brokers and Clusters

Broker:

Single Kafka server Responsibilities:

Receive messages from producers
Assign offsets
Commit messages to storage
Serve consumers

Cluster:

Multiple brokers working together
One broker acts as the Controller
Handles replication and broker failure
Provides scalability and fault tolerance
A partition may be assigned to multiple brokers, which will result in Replication.

src: Oreilly Kafka Book

Message Delivery Semantics

Message Delivery Semantics are primarily controlled through** Producer and Consumer** configurations, not at the broker level.

At Least Once Delivery:

Messages are never lost but might be redelivered.
This is the default delivery method.

Scenario

Consumer reads message
Processes message
Crashes before committing offset
After restart, reads same message again

Best for cases where duplicate processing is acceptable

at_least_once_producer = KafkaProducer(
    bootstrap_servers=['localhost:9092'],
    acks='all',     # Wait for all replicas
    retries=5,      # Number of retries
    retry_backoff_ms=100,  # Time between retries
    enable_idempotence=False,
    auto_offset_reset='earliest'
)

At Most Once Delivery:

Messages might be lost but never redelivered
Commits offset as soon as message is received
Use when some data loss is acceptable but duplicates are not
ack=0 (no acknowledgement)

at_most_once_producer = KafkaProducer(
    bootstrap_servers=['localhost:9092'],
    acks=0,  # Fire and forget
    retries=0,
    enable_idempotence=False,
    auto_offset_reset='latest'
)

Exactly Once Delivery

Messages are processed exactly once
Achieved through transactional APIs
Higher overhead but strongest guarantee

exactly_once_producer = KafkaProducer(
    bootstrap_servers=['localhost:9092'],
    acks='all',
    enable_idempotence=True,
    transactional_id='prod-1',
    auto_offset_reset='earliest'
)

Summary

At Most Once: Highest performance, lowest reliability
At Least Once: Good performance, possible duplicates
Exactly Once: Highest reliability, lower performance

Can Producer and Consumer have different semantics? Like producer with Exactly Once and Consumer with Atleast Once?

Yes its possible.

# Producer with Exactly Once
exactly_once_producer = KafkaProducer(
    bootstrap_servers=['localhost:9092'],
    acks='all',
    enable_idempotence=True,
    transactional_id='prod-1'
)

# Consumer with At Least Once
at_least_once_consumer = KafkaConsumer(
    'your_topic',
    bootstrap_servers=['localhost:9092'],
    group_id='my_group',
    enable_auto_commit=False,  # Manual commit
    auto_offset_reset='earliest'
    # Note: No isolation_level setting needed
)

Transcation ID & Group ID

transactional_id

A unique identifier for a producer instance
Ensures only one active producer with that ID
Required for exactly-once message delivery
If a new producer starts with same transactional_id, old one is fenced off

group_id

Identifies a group of consumers working together
Multiple consumers can share same group_id
Used for load balancing - each partition assigned to only one consumer in group
Manages partition distribution among consumers

Feature	transactional_id	group_id
Purpose	Exactly-once delivery	Consumer scaling
Uniqueness	Must be unique	Shared
Active instances	One at a time	Multiple allowed
State management	Transaction state	Offset management
Failure handling	Fencing mechanism	Rebalancing
Scope	Producer only	Consumer only

Adv Big Data and Tools

Kafka

Introduction

Key Features

Basic Terms