Apache Kafka: Real-Time Event Streaming Platform

What is Apache Kafka?

Apache Kafka is a distributed event streaming platform used by thousands of companies for high-performance data pipelines, streaming analytics, data integration, and mission-critical applications. Originally developed at LinkedIn, Kafka can handle trillions of events per day.

Think of Kafka as a distributed commit log - events are written once and can be read by multiple consumers at different speeds.

Core Concepts

Topics: Categories for organizing messages (like database tables)
Partitions: Topics are split into partitions for parallelism
Producers: Applications that publish messages to topics
Consumers: Applications that read messages from topics
Brokers: Kafka servers that store data and serve clients
Consumer Groups: Multiple consumers sharing the work of reading a topic

Producing Messages

from kafka import KafkaProducer
import json

# Create producer
producer = KafkaProducer(
    bootstrap_servers=['localhost:9092'],
    value_serializer=lambda v: json.dumps(v).encode('utf-8'),
    key_serializer=lambda k: k.encode('utf-8') if k else None
)

# Send message
producer.send(
    'user-events',
    key='user123',
    value={'event': 'login', 'timestamp': '2024-01-01T10:00:00'}
)

# Send with callback
def on_success(metadata):
    print(f"Sent to {metadata.topic} partition {metadata.partition}")

def on_error(error):
    print(f"Error: {error}")

future = producer.send('user-events', value={'event': 'purchase'})
future.add_callback(on_success)
future.add_errback(on_error)

# Flush and close
producer.flush()
producer.close()

Consuming Messages

from kafka import KafkaConsumer
import json

# Create consumer
consumer = KafkaConsumer(
    'user-events',
    bootstrap_servers=['localhost:9092'],
    group_id='my-consumer-group',
    auto_offset_reset='earliest',  # or 'latest'
    value_deserializer=lambda m: json.loads(m.decode('utf-8'))
)

# Consume messages
for message in consumer:
    print(f"Topic: {message.topic}")
    print(f"Partition: {message.partition}")
    print(f"Offset: {message.offset}")
    print(f"Key: {message.key}")
    print(f"Value: {message.value}")

# Consume with timeout
consumer.poll(timeout_ms=1000)

# Manual commit
consumer = KafkaConsumer(
    'user-events',
    enable_auto_commit=False,
    group_id='my-group'
)

for message in consumer:
    process(message)
    consumer.commit()  # Commit after processing

Kafka Connect

Kafka Connect is a framework for streaming data between Kafka and external systems:

# connector-config.json
{
  "name": "postgres-source",
  "config": {
    "connector.class": "io.debezium.connector.postgresql.PostgresConnector",
    "database.hostname": "localhost",
    "database.port": "5432",
    "database.user": "postgres",
    "database.password": "password",
    "database.dbname": "mydb",
    "database.server.name": "myserver",
    "table.include.list": "public.users,public.orders"
  }
}

# Deploy connector via REST API
curl -X POST http://localhost:8083/connectors \
  -H "Content-Type: application/json" \
  -d @connector-config.json

Schema Registry

Manage schemas for your Kafka messages with Avro:

from confluent_kafka import Producer
from confluent_kafka.schema_registry import SchemaRegistryClient
from confluent_kafka.schema_registry.avro import AvroSerializer

schema_str = """
{
  "type": "record",
  "name": "User",
  "fields": [
    {"name": "id", "type": "int"},
    {"name": "name", "type": "string"},
    {"name": "email", "type": "string"}
  ]
}
"""

schema_registry = SchemaRegistryClient({'url': 'http://localhost:8081'})
avro_serializer = AvroSerializer(schema_registry, schema_str)

producer = Producer({'bootstrap.servers': 'localhost:9092'})

user = {"id": 1, "name": "John", "email": "john@example.com"}
producer.produce(
    'users',
    value=avro_serializer(user, None)
)

Use Cases

Event Sourcing: Store all changes as a sequence of events
Log Aggregation: Collect logs from multiple services
Stream Processing: Real-time analytics and transformations
Data Integration: Connect different systems together
Messaging: Decouple services in microarchitectures
Activity Tracking: User behavior and clickstream data

Best Practices

Choose partition key wisely: Affects ordering and parallelism
Set appropriate retention: Balance storage vs. replay needs
Use consumer groups: Scale consumption horizontally
Monitor lag: Track how far behind consumers are
Handle failures: Implement dead letter queues
Use schemas: Schema Registry prevents data issues

Master Apache Kafka with Expert Mentorship

Our Data Engineering program covers Kafka from basics to production deployment. Build real-time streaming pipelines with guidance from industry experts.

Explore Data Engineering Program

Apache Kafka

What is Apache Kafka?

Core Concepts

Producing Messages

Consuming Messages

Kafka Connect

Schema Registry

Use Cases

Best Practices

Master Apache Kafka with Expert Mentorship

Related Articles

Apache Spark for Big Data

Apache Airflow

ETL Pipeline Design