What is Ollama?

Ollama is an open-source tool that makes it easy to run Large Language Models locally on your machine. It provides a simple interface to download, run, and manage various open-source models like Llama, Mistral, and more.

Why run LLMs locally?

  • Privacy: Your data never leaves your machine
  • Cost: No API fees - run unlimited queries
  • Offline: Works without internet connection
  • Customization: Fine-tune and modify models
  • Speed: No network latency for local inference

Installation

macOS

# Using Homebrew
brew install ollama

# Or download from ollama.ai

Linux

curl -fsSL https://ollama.ai/install.sh | sh

Windows

# Download installer from ollama.ai
# Or use WSL2 with Linux installation

Docker

docker run -d -v ollama:/root/.ollama -p 11434:11434 --name ollama ollama/ollama

# With GPU support
docker run -d --gpus=all -v ollama:/root/.ollama -p 11434:11434 --name ollama ollama/ollama

Getting Started

Download and Run a Model

# Start Ollama service
ollama serve

# In another terminal, pull and run a model
ollama run llama2

# Chat with the model
>>> What is machine learning?
Machine learning is a subset of artificial intelligence...

Available Models

# Popular models
ollama pull llama2              # Meta's Llama 2 (7B default)
ollama pull llama2:13b          # Larger Llama 2
ollama pull mistral             # Mistral 7B
ollama pull mixtral             # Mistral's MoE model
ollama pull codellama           # Code-specialized Llama
ollama pull phi                 # Microsoft's small model
ollama pull neural-chat         # Intel's chat model
ollama pull starling-lm         # Berkeley's model

# List downloaded models
ollama list

# Remove a model
ollama rm llama2

Using Ollama with Python

Basic Usage

import ollama

# Simple generation
response = ollama.generate(
    model='llama2',
    prompt='Explain quantum computing in simple terms'
)
print(response['response'])

# Chat interface
response = ollama.chat(
    model='llama2',
    messages=[
        {'role': 'user', 'content': 'Why is the sky blue?'}
    ]
)
print(response['message']['content'])

Streaming Responses

import ollama

# Stream the response
for chunk in ollama.chat(
    model='llama2',
    messages=[{'role': 'user', 'content': 'Write a poem about AI'}],
    stream=True
):
    print(chunk['message']['content'], end='', flush=True)

Using the REST API

import requests

# Ollama exposes an OpenAI-compatible API
response = requests.post(
    'http://localhost:11434/api/generate',
    json={
        'model': 'llama2',
        'prompt': 'Hello, how are you?',
        'stream': False
    }
)
print(response.json()['response'])

Integration with LangChain

from langchain_community.llms import Ollama
from langchain_community.chat_models import ChatOllama
from langchain.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser

# Basic LLM
llm = Ollama(model="llama2")
response = llm.invoke("What is the capital of France?")

# Chat model
chat = ChatOllama(model="llama2")
prompt = ChatPromptTemplate.from_messages([
    ("system", "You are a helpful assistant."),
    ("human", "{input}")
])

chain = prompt | chat | StrOutputParser()
response = chain.invoke({"input": "Explain Docker"})

Ollama with RAG

from langchain_community.embeddings import OllamaEmbeddings
from langchain_community.vectorstores import Chroma
from langchain_community.chat_models import ChatOllama
from langchain.chains import RetrievalQA

# Use Ollama for embeddings too
embeddings = OllamaEmbeddings(model="llama2")

# Create vector store
vectorstore = Chroma.from_documents(
    documents=docs,
    embedding=embeddings
)

# Create RAG chain
llm = ChatOllama(model="llama2")
qa_chain = RetrievalQA.from_chain_type(
    llm=llm,
    retriever=vectorstore.as_retriever()
)

result = qa_chain.invoke("What does the document say about...")

Custom Models (Modelfiles)

Create custom models with specific behaviors:

# Create a Modelfile
FROM llama2

# Set parameters
PARAMETER temperature 0.7
PARAMETER top_p 0.9
PARAMETER num_ctx 4096

# Set system prompt
SYSTEM """
You are a helpful coding assistant specialized in Python.
Always provide code examples when relevant.
Explain concepts clearly and concisely.
"""

# Add custom template
TEMPLATE """{{ if .System }}{{ .System }}{{ end }}
{{ if .Prompt }}User: {{ .Prompt }}{{ end }}
Assistant: """
# Build and run custom model
ollama create python-assistant -f Modelfile
ollama run python-assistant

>>> How do I read a CSV file?

Hardware Requirements

7B Models

8GB RAM minimum, 16GB recommended. Works on most modern laptops.

13B Models

16GB RAM minimum, 32GB recommended. Better quality, slower speed.

GPU Acceleration

NVIDIA GPU with CUDA greatly improves speed. Apple Silicon M1/M2/M3 works great.

Quantization

4-bit quantized models (Q4) use less memory with minimal quality loss.

Model Comparison

# Model sizes and use cases

| Model          | Size   | Best For                    |
|----------------|--------|------------------------------|
| phi            | 2.7B   | Fast, simple tasks          |
| llama2:7b      | 7B     | General purpose, balanced   |
| mistral        | 7B     | High quality, efficient     |
| codellama      | 7B     | Code generation             |
| llama2:13b     | 13B    | Better reasoning            |
| mixtral        | 47B*   | State-of-art open source    |

* Mixtral uses 12B active parameters (MoE)

OpenAI Compatibility

Use Ollama as a drop-in replacement for OpenAI:

from openai import OpenAI

# Point to local Ollama
client = OpenAI(
    base_url="http://localhost:11434/v1",
    api_key="ollama"  # Not used but required
)

response = client.chat.completions.create(
    model="llama2",
    messages=[
        {"role": "user", "content": "Hello!"}
    ]
)
print(response.choices[0].message.content)

Best Practices

  • Start small: Test with 7B models before scaling up
  • Use quantization: Q4_K_M offers good quality/speed balance
  • GPU matters: Use GPU acceleration when available
  • Context length: Set num_ctx based on your needs
  • Custom Modelfiles: Create specialized models for your use case
  • Monitor resources: Watch memory usage during inference

Master Local AI with Expert Guidance

Our Agentic AI program covers Ollama and local model deployment. Learn to build privacy-preserving AI applications.

Explore Agentic AI Program

Related Articles