What is Apache Kafka?
Apache Kafka is a distributed event streaming platform used by thousands of companies for high-performance data pipelines, streaming analytics, data integration, and mission-critical applications. Originally developed at LinkedIn, Kafka can handle trillions of events per day.
Think of Kafka as a distributed commit log - events are written once and can be read by multiple consumers at different speeds.
Core Concepts
- Topics: Categories for organizing messages (like database tables)
- Partitions: Topics are split into partitions for parallelism
- Producers: Applications that publish messages to topics
- Consumers: Applications that read messages from topics
- Brokers: Kafka servers that store data and serve clients
- Consumer Groups: Multiple consumers sharing the work of reading a topic
Producing Messages
from kafka import KafkaProducer
import json
# Create producer
producer = KafkaProducer(
bootstrap_servers=['localhost:9092'],
value_serializer=lambda v: json.dumps(v).encode('utf-8'),
key_serializer=lambda k: k.encode('utf-8') if k else None
)
# Send message
producer.send(
'user-events',
key='user123',
value={'event': 'login', 'timestamp': '2024-01-01T10:00:00'}
)
# Send with callback
def on_success(metadata):
print(f"Sent to {metadata.topic} partition {metadata.partition}")
def on_error(error):
print(f"Error: {error}")
future = producer.send('user-events', value={'event': 'purchase'})
future.add_callback(on_success)
future.add_errback(on_error)
# Flush and close
producer.flush()
producer.close()
Consuming Messages
from kafka import KafkaConsumer
import json
# Create consumer
consumer = KafkaConsumer(
'user-events',
bootstrap_servers=['localhost:9092'],
group_id='my-consumer-group',
auto_offset_reset='earliest', # or 'latest'
value_deserializer=lambda m: json.loads(m.decode('utf-8'))
)
# Consume messages
for message in consumer:
print(f"Topic: {message.topic}")
print(f"Partition: {message.partition}")
print(f"Offset: {message.offset}")
print(f"Key: {message.key}")
print(f"Value: {message.value}")
# Consume with timeout
consumer.poll(timeout_ms=1000)
# Manual commit
consumer = KafkaConsumer(
'user-events',
enable_auto_commit=False,
group_id='my-group'
)
for message in consumer:
process(message)
consumer.commit() # Commit after processing
Kafka Connect
Kafka Connect is a framework for streaming data between Kafka and external systems:
# connector-config.json
{
"name": "postgres-source",
"config": {
"connector.class": "io.debezium.connector.postgresql.PostgresConnector",
"database.hostname": "localhost",
"database.port": "5432",
"database.user": "postgres",
"database.password": "password",
"database.dbname": "mydb",
"database.server.name": "myserver",
"table.include.list": "public.users,public.orders"
}
}
# Deploy connector via REST API
curl -X POST http://localhost:8083/connectors \
-H "Content-Type: application/json" \
-d @connector-config.json
Schema Registry
Manage schemas for your Kafka messages with Avro:
from confluent_kafka import Producer
from confluent_kafka.schema_registry import SchemaRegistryClient
from confluent_kafka.schema_registry.avro import AvroSerializer
schema_str = """
{
"type": "record",
"name": "User",
"fields": [
{"name": "id", "type": "int"},
{"name": "name", "type": "string"},
{"name": "email", "type": "string"}
]
}
"""
schema_registry = SchemaRegistryClient({'url': 'http://localhost:8081'})
avro_serializer = AvroSerializer(schema_registry, schema_str)
producer = Producer({'bootstrap.servers': 'localhost:9092'})
user = {"id": 1, "name": "John", "email": "john@example.com"}
producer.produce(
'users',
value=avro_serializer(user, None)
)
Use Cases
- Event Sourcing: Store all changes as a sequence of events
- Log Aggregation: Collect logs from multiple services
- Stream Processing: Real-time analytics and transformations
- Data Integration: Connect different systems together
- Messaging: Decouple services in microarchitectures
- Activity Tracking: User behavior and clickstream data
Best Practices
- Choose partition key wisely: Affects ordering and parallelism
- Set appropriate retention: Balance storage vs. replay needs
- Use consumer groups: Scale consumption horizontally
- Monitor lag: Track how far behind consumers are
- Handle failures: Implement dead letter queues
- Use schemas: Schema Registry prevents data issues
Master Apache Kafka with Expert Mentorship
Our Data Engineering program covers Kafka from basics to production deployment. Build real-time streaming pipelines with guidance from industry experts.
Explore Data Engineering Program