What is Databricks?
Databricks is a unified analytics platform that combines data engineering, data science, and machine learning on a single platform. Founded by the creators of Apache Spark, it provides a managed Spark environment with additional tools for the entire data lifecycle.
Databricks pioneered the "lakehouse" architecture, combining the best of data lakes and data warehouses.
Databricks Architecture
┌─────────────────────────────────────────────────────────────────┐
│ Databricks Lakehouse │
├─────────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ Unity Catalog │ │
│ │ (Governance, Security, Data Discovery) │ │
│ └─────────────────────────────────────────────────────────┘ │
│ │ │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ Workspaces │ │
│ │ ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐ │ │
│ │ │Notebooks│ │ Jobs │ │ Repos │ │Workflows│ │ │
│ │ └─────────┘ └─────────┘ └─────────┘ └─────────┘ │ │
│ └─────────────────────────────────────────────────────────┘ │
│ │ │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ Compute Layer │ │
│ │ ┌─────────────────┐ ┌─────────────────┐ │ │
│ │ │ All-Purpose │ │ Job Clusters │ │ │
│ │ │ Clusters │ │ (Auto-scale) │ │ │
│ │ └─────────────────┘ └─────────────────┘ │ │
│ │ ┌─────────────────┐ ┌─────────────────┐ │ │
│ │ │ SQL Warehouse │ │ Serverless │ │ │
│ │ │ (Analytics) │ │ Compute │ │ │
│ │ └─────────────────┘ └─────────────────┘ │ │
│ └─────────────────────────────────────────────────────────┘ │
│ │ │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ Delta Lake (Storage Layer) │ │
│ │ Cloud Storage (S3/ADLS/GCS) │ │
│ └─────────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────┘
Databricks Notebooks
# Databricks notebook basics
# Notebooks support Python, Scala, SQL, R
# Magic commands
%python # Switch to Python
%sql # Switch to SQL
%scala # Switch to Scala
%md # Markdown cell
%run # Run another notebook
%fs # File system commands
# Read data
df = spark.read.format("delta").load("/mnt/data/customers")
df.display() # Databricks-specific display
# SQL in notebook
%sql
SELECT * FROM customers
WHERE created_at > '2024-01-01'
LIMIT 100
# Create widgets (parameters)
dbutils.widgets.text("start_date", "2024-01-01")
dbutils.widgets.dropdown("environment", "dev", ["dev", "staging", "prod"])
# Use widget values
start_date = dbutils.widgets.get("start_date")
environment = dbutils.widgets.get("environment")
# File system operations
dbutils.fs.ls("/mnt/data/")
dbutils.fs.head("/mnt/data/sample.csv")
dbutils.fs.cp("/source/", "/destination/", recurse=True)
Delta Lake on Databricks
# Create Delta table
df.write.format("delta").saveAsTable("bronze.raw_events")
# Read Delta table
events = spark.table("bronze.raw_events")
# SQL operations on Delta
%sql
-- Create managed table
CREATE TABLE silver.customers (
customer_id STRING,
name STRING,
email STRING,
created_at TIMESTAMP
) USING DELTA
PARTITIONED BY (date(created_at))
-- Insert data
INSERT INTO silver.customers
SELECT * FROM bronze.raw_customers
-- Merge (upsert)
MERGE INTO silver.customers AS target
USING staging.new_customers AS source
ON target.customer_id = source.customer_id
WHEN MATCHED THEN UPDATE SET *
WHEN NOT MATCHED THEN INSERT *
-- Time travel
SELECT * FROM silver.customers VERSION AS OF 5
SELECT * FROM silver.customers TIMESTAMP AS OF '2024-12-20'
-- Optimize
OPTIMIZE silver.customers ZORDER BY (customer_id)
-- Vacuum old files
VACUUM silver.customers RETAIN 168 HOURS
Unity Catalog
# Unity Catalog - Unified governance for all data assets
%sql
-- Three-level namespace: catalog.schema.table
USE CATALOG main;
USE SCHEMA production;
-- Create catalog
CREATE CATALOG analytics;
CREATE SCHEMA analytics.sales;
-- Grant permissions
GRANT USAGE ON CATALOG analytics TO `data-engineers`;
GRANT SELECT ON SCHEMA analytics.sales TO `analysts`;
GRANT ALL PRIVILEGES ON TABLE analytics.sales.orders TO `admin`;
-- View permissions
SHOW GRANTS ON TABLE analytics.sales.orders;
-- Data lineage (automatic)
-- Unity Catalog tracks:
-- - Which tables read from which sources
-- - Which notebooks/jobs touch which tables
-- - Column-level lineage
-- Tag data for governance
ALTER TABLE analytics.sales.customers
SET TAGS ('pii' = 'true', 'retention' = '7years');
-- Search for tagged assets
SELECT * FROM system.information_schema.table_tags
WHERE tag_name = 'pii';
Workflows & Jobs
# Create a job via API or UI
# Jobs can run notebooks, Python scripts, JARs, SQL
# Example: Multi-task job (DAG)
{
"name": "daily_etl_pipeline",
"tasks": [
{
"task_key": "extract_raw",
"notebook_task": {
"notebook_path": "/Repos/etl/extract_raw"
},
"new_cluster": {
"spark_version": "13.3.x-scala2.12",
"node_type_id": "i3.xlarge",
"num_workers": 2
}
},
{
"task_key": "transform_silver",
"depends_on": [{"task_key": "extract_raw"}],
"notebook_task": {
"notebook_path": "/Repos/etl/transform_silver",
"base_parameters": {
"date": "{{job.start_time.iso_date}}"
}
}
},
{
"task_key": "aggregate_gold",
"depends_on": [{"task_key": "transform_silver"}],
"notebook_task": {
"notebook_path": "/Repos/etl/aggregate_gold"
}
}
],
"schedule": {
"quartz_cron_expression": "0 0 6 * * ?",
"timezone_id": "UTC"
}
}
SQL Warehouse
-- Databricks SQL for BI and analytics workloads
-- Create SQL Warehouse via UI or API
-- Serverless or Pro options available
-- Connect BI tools (Tableau, Power BI, Looker)
-- via JDBC/ODBC using warehouse endpoint
-- Example queries optimized for SQL Warehouse
SELECT
DATE_TRUNC('month', order_date) AS month,
product_category,
SUM(revenue) AS total_revenue,
COUNT(DISTINCT customer_id) AS unique_customers
FROM gold.sales_summary
WHERE order_date >= '2024-01-01'
GROUP BY 1, 2
ORDER BY month, total_revenue DESC;
-- Create dashboard queries
-- Save as SQL queries in SQL Editor
-- Build dashboards with visualizations
-- Query Federation (query external sources)
CREATE CONNECTION my_postgres
TYPE postgresql
OPTIONS (
host 'postgres.example.com',
port '5432',
user secret('postgres-creds', 'username'),
password secret('postgres-creds', 'password')
);
CREATE FOREIGN TABLE external.customers
OPTIONS (dbtable 'public.customers')
USING postgresql
CONNECTION my_postgres;
MLflow on Databricks
import mlflow
from sklearn.ensemble import RandomForestClassifier
# MLflow is built into Databricks
# Auto-logging enabled by default
# Train and log model
with mlflow.start_run():
# Log parameters
mlflow.log_param("n_estimators", 100)
mlflow.log_param("max_depth", 10)
# Train model
model = RandomForestClassifier(n_estimators=100, max_depth=10)
model.fit(X_train, y_train)
# Log metrics
accuracy = model.score(X_test, y_test)
mlflow.log_metric("accuracy", accuracy)
# Log model
mlflow.sklearn.log_model(model, "model")
# Register model to Unity Catalog
mlflow.register_model(
"runs:/abc123/model",
"main.ml_models.customer_churn"
)
# Deploy model as endpoint
# Via UI or API, create serving endpoint
# Models auto-scale based on traffic
Best Practices
- Use Unity Catalog: Centralized governance from day one
- Medallion architecture: Bronze/Silver/Gold for data organization
- Job clusters: Use job clusters for production, all-purpose for dev
- Repos: Version control notebooks with Git integration
- Secrets: Never hardcode credentials, use secret scopes
- Photon: Enable for faster SQL and Delta operations
Master Databricks
Our Data Engineering program covers Databricks, lakehouse architecture, and production data platforms.
Explore Data Engineering Program