What is Apache Kafka?

Apache Kafka is a distributed event streaming platform used by thousands of companies for high-performance data pipelines, streaming analytics, data integration, and mission-critical applications. Originally developed at LinkedIn, Kafka can handle trillions of events per day.

Think of Kafka as a distributed commit log - events are written once and can be read by multiple consumers at different speeds.

Core Concepts

  • Topics: Categories for organizing messages (like database tables)
  • Partitions: Topics are split into partitions for parallelism
  • Producers: Applications that publish messages to topics
  • Consumers: Applications that read messages from topics
  • Brokers: Kafka servers that store data and serve clients
  • Consumer Groups: Multiple consumers sharing the work of reading a topic

Producing Messages

from kafka import KafkaProducer
import json

# Create producer
producer = KafkaProducer(
    bootstrap_servers=['localhost:9092'],
    value_serializer=lambda v: json.dumps(v).encode('utf-8'),
    key_serializer=lambda k: k.encode('utf-8') if k else None
)

# Send message
producer.send(
    'user-events',
    key='user123',
    value={'event': 'login', 'timestamp': '2024-01-01T10:00:00'}
)

# Send with callback
def on_success(metadata):
    print(f"Sent to {metadata.topic} partition {metadata.partition}")

def on_error(error):
    print(f"Error: {error}")

future = producer.send('user-events', value={'event': 'purchase'})
future.add_callback(on_success)
future.add_errback(on_error)

# Flush and close
producer.flush()
producer.close()

Consuming Messages

from kafka import KafkaConsumer
import json

# Create consumer
consumer = KafkaConsumer(
    'user-events',
    bootstrap_servers=['localhost:9092'],
    group_id='my-consumer-group',
    auto_offset_reset='earliest',  # or 'latest'
    value_deserializer=lambda m: json.loads(m.decode('utf-8'))
)

# Consume messages
for message in consumer:
    print(f"Topic: {message.topic}")
    print(f"Partition: {message.partition}")
    print(f"Offset: {message.offset}")
    print(f"Key: {message.key}")
    print(f"Value: {message.value}")

# Consume with timeout
consumer.poll(timeout_ms=1000)

# Manual commit
consumer = KafkaConsumer(
    'user-events',
    enable_auto_commit=False,
    group_id='my-group'
)

for message in consumer:
    process(message)
    consumer.commit()  # Commit after processing

Kafka Connect

Kafka Connect is a framework for streaming data between Kafka and external systems:

# connector-config.json
{
  "name": "postgres-source",
  "config": {
    "connector.class": "io.debezium.connector.postgresql.PostgresConnector",
    "database.hostname": "localhost",
    "database.port": "5432",
    "database.user": "postgres",
    "database.password": "password",
    "database.dbname": "mydb",
    "database.server.name": "myserver",
    "table.include.list": "public.users,public.orders"
  }
}

# Deploy connector via REST API
curl -X POST http://localhost:8083/connectors \
  -H "Content-Type: application/json" \
  -d @connector-config.json

Schema Registry

Manage schemas for your Kafka messages with Avro:

from confluent_kafka import Producer
from confluent_kafka.schema_registry import SchemaRegistryClient
from confluent_kafka.schema_registry.avro import AvroSerializer

schema_str = """
{
  "type": "record",
  "name": "User",
  "fields": [
    {"name": "id", "type": "int"},
    {"name": "name", "type": "string"},
    {"name": "email", "type": "string"}
  ]
}
"""

schema_registry = SchemaRegistryClient({'url': 'http://localhost:8081'})
avro_serializer = AvroSerializer(schema_registry, schema_str)

producer = Producer({'bootstrap.servers': 'localhost:9092'})

user = {"id": 1, "name": "John", "email": "john@example.com"}
producer.produce(
    'users',
    value=avro_serializer(user, None)
)

Use Cases

  • Event Sourcing: Store all changes as a sequence of events
  • Log Aggregation: Collect logs from multiple services
  • Stream Processing: Real-time analytics and transformations
  • Data Integration: Connect different systems together
  • Messaging: Decouple services in microarchitectures
  • Activity Tracking: User behavior and clickstream data

Best Practices

  • Choose partition key wisely: Affects ordering and parallelism
  • Set appropriate retention: Balance storage vs. replay needs
  • Use consumer groups: Scale consumption horizontally
  • Monitor lag: Track how far behind consumers are
  • Handle failures: Implement dead letter queues
  • Use schemas: Schema Registry prevents data issues

Master Apache Kafka with Expert Mentorship

Our Data Engineering program covers Kafka from basics to production deployment. Build real-time streaming pipelines with guidance from industry experts.

Explore Data Engineering Program

Related Articles