Ollama: Running LLMs Locally

What is Ollama?

Ollama is an open-source tool that makes it easy to run Large Language Models locally on your machine. It provides a simple interface to download, run, and manage various open-source models like Llama, Mistral, and more.

Why run LLMs locally?

Privacy: Your data never leaves your machine
Cost: No API fees - run unlimited queries
Offline: Works without internet connection
Customization: Fine-tune and modify models
Speed: No network latency for local inference

Installation

macOS

# Using Homebrew
brew install ollama

# Or download from ollama.ai

Linux

curl -fsSL https://ollama.ai/install.sh | sh

Windows

# Download installer from ollama.ai
# Or use WSL2 with Linux installation

Docker

docker run -d -v ollama:/root/.ollama -p 11434:11434 --name ollama ollama/ollama

# With GPU support
docker run -d --gpus=all -v ollama:/root/.ollama -p 11434:11434 --name ollama ollama/ollama

Getting Started

Download and Run a Model

# Start Ollama service
ollama serve

# In another terminal, pull and run a model
ollama run llama2

# Chat with the model
>>> What is machine learning?
Machine learning is a subset of artificial intelligence...

Available Models

# Popular models
ollama pull llama2              # Meta's Llama 2 (7B default)
ollama pull llama2:13b          # Larger Llama 2
ollama pull mistral             # Mistral 7B
ollama pull mixtral             # Mistral's MoE model
ollama pull codellama           # Code-specialized Llama
ollama pull phi                 # Microsoft's small model
ollama pull neural-chat         # Intel's chat model
ollama pull starling-lm         # Berkeley's model

# List downloaded models
ollama list

# Remove a model
ollama rm llama2

Using Ollama with Python

Basic Usage

import ollama

# Simple generation
response = ollama.generate(
    model='llama2',
    prompt='Explain quantum computing in simple terms'
)
print(response['response'])

# Chat interface
response = ollama.chat(
    model='llama2',
    messages=[
        {'role': 'user', 'content': 'Why is the sky blue?'}
    ]
)
print(response['message']['content'])

Streaming Responses

import ollama

# Stream the response
for chunk in ollama.chat(
    model='llama2',
    messages=[{'role': 'user', 'content': 'Write a poem about AI'}],
    stream=True
):
    print(chunk['message']['content'], end='', flush=True)

Using the REST API

import requests

# Ollama exposes an OpenAI-compatible API
response = requests.post(
    'http://localhost:11434/api/generate',
    json={
        'model': 'llama2',
        'prompt': 'Hello, how are you?',
        'stream': False
    }
)
print(response.json()['response'])

Integration with LangChain

from langchain_community.llms import Ollama
from langchain_community.chat_models import ChatOllama
from langchain.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser

# Basic LLM
llm = Ollama(model="llama2")
response = llm.invoke("What is the capital of France?")

# Chat model
chat = ChatOllama(model="llama2")
prompt = ChatPromptTemplate.from_messages([
    ("system", "You are a helpful assistant."),
    ("human", "{input}")
])

chain = prompt | chat | StrOutputParser()
response = chain.invoke({"input": "Explain Docker"})

Ollama with RAG

from langchain_community.embeddings import OllamaEmbeddings
from langchain_community.vectorstores import Chroma
from langchain_community.chat_models import ChatOllama
from langchain.chains import RetrievalQA

# Use Ollama for embeddings too
embeddings = OllamaEmbeddings(model="llama2")

# Create vector store
vectorstore = Chroma.from_documents(
    documents=docs,
    embedding=embeddings
)

# Create RAG chain
llm = ChatOllama(model="llama2")
qa_chain = RetrievalQA.from_chain_type(
    llm=llm,
    retriever=vectorstore.as_retriever()
)

result = qa_chain.invoke("What does the document say about...")

Custom Models (Modelfiles)

Create custom models with specific behaviors:

# Create a Modelfile
FROM llama2

# Set parameters
PARAMETER temperature 0.7
PARAMETER top_p 0.9
PARAMETER num_ctx 4096

# Set system prompt
SYSTEM """
You are a helpful coding assistant specialized in Python.
Always provide code examples when relevant.
Explain concepts clearly and concisely.
"""

# Add custom template
TEMPLATE """{{ if .System }}{{ .System }}{{ end }}
{{ if .Prompt }}User: {{ .Prompt }}{{ end }}
Assistant: """

# Build and run custom model
ollama create python-assistant -f Modelfile
ollama run python-assistant

>>> How do I read a CSV file?

Hardware Requirements

7B Models

8GB RAM minimum, 16GB recommended. Works on most modern laptops.

13B Models

16GB RAM minimum, 32GB recommended. Better quality, slower speed.

GPU Acceleration

NVIDIA GPU with CUDA greatly improves speed. Apple Silicon M1/M2/M3 works great.

Quantization

4-bit quantized models (Q4) use less memory with minimal quality loss.

Model Comparison

# Model sizes and use cases

| Model          | Size   | Best For                    |
|----------------|--------|------------------------------|
| phi            | 2.7B   | Fast, simple tasks          |
| llama2:7b      | 7B     | General purpose, balanced   |
| mistral        | 7B     | High quality, efficient     |
| codellama      | 7B     | Code generation             |
| llama2:13b     | 13B    | Better reasoning            |
| mixtral        | 47B*   | State-of-art open source    |

* Mixtral uses 12B active parameters (MoE)

OpenAI Compatibility

Use Ollama as a drop-in replacement for OpenAI:

from openai import OpenAI

# Point to local Ollama
client = OpenAI(
    base_url="http://localhost:11434/v1",
    api_key="ollama"  # Not used but required
)

response = client.chat.completions.create(
    model="llama2",
    messages=[
        {"role": "user", "content": "Hello!"}
    ]
)
print(response.choices[0].message.content)

Best Practices

Start small: Test with 7B models before scaling up
Use quantization: Q4_K_M offers good quality/speed balance
GPU matters: Use GPU acceleration when available
Context length: Set num_ctx based on your needs
Custom Modelfiles: Create specialized models for your use case
Monitor resources: Watch memory usage during inference

Master Local AI with Expert Guidance

Our Agentic AI program covers Ollama and local model deployment. Learn to build privacy-preserving AI applications.

Explore Agentic AI Program

Ollama: Local LLMs

What is Ollama?

Installation

macOS

Linux

Windows

Docker

Getting Started

Download and Run a Model

Available Models

Using Ollama with Python

Basic Usage

Streaming Responses

Using the REST API

Integration with LangChain

Ollama with RAG

Custom Models (Modelfiles)

Hardware Requirements

7B Models

13B Models

GPU Acceleration

Quantization

Model Comparison

OpenAI Compatibility

Best Practices

Master Local AI with Expert Guidance

Related Articles

Ollama: Local LLMs

What is Ollama?

Installation

macOS

Linux

Windows

Docker

Getting Started

Download and Run a Model

Available Models

Using Ollama with Python

Basic Usage

Streaming Responses

Using the REST API

Integration with LangChain

Ollama with RAG

Custom Models (Modelfiles)

Hardware Requirements

7B Models

13B Models

GPU Acceleration

Quantization

Model Comparison

OpenAI Compatibility

Best Practices

Master Local AI with Expert Guidance

Related Articles

LLM Foundations

LangChain Guide

Cost Optimization