[Local AI with Ollama] Building a RAG System with Ollama and LangChain

Retrieval-Augmented Generation (RAG) enables your LLM-powered assistant to answer questions using up-to-date and domain-specific knowledge from your own files. This guide will show you how to build a complete, local RAG pipeline with Ollama (for LLM and embeddings) and LangChain (for orchestration)—step by step, using a real PDF, and add a simple UI with Streamlit.

What is RAG and Why Use It?

Language models are powerful, but limited to their training data. RAG (Retrieval-Augmented Generation) connects an LLM with a retriever that searches your own documents, making it possible to answer questions grounded in real, current, and private knowledge.

How RAG works:

  • Your documents are split into chunks and converted to embeddings (vectors).
  • Both the embeddings and the original text are stored in a vector database.
  • Your query is also embedded and used to retrieve similar chunks.
  • The LLM generates an answer using both your query and the retrieved content.

RAG reduces hallucinations and makes your assistant far more useful for practical, real-world scenarios.

Core Concepts

  • Document Loading & Chunking: Load sources (PDFs, text, URLs) and split into smaller pieces.
  • Embeddings: Each chunk is transformed into a vector using an embedding model; similar content produces similar vectors.
  • Vector Database: Both vectors and original text are stored in a special database (e.g., ChromaDB).
  • Retrieval: When you query, your question is embedded and the database finds the most relevant chunks. These are passed to the LLM for a grounded answer.

Setup

- Folder structure:

- Create a virtual environment

- Activate the environment

On macOS/Linux:

On Windows:

- requirements.txt

Install dependencies:

RAG Pipeline Example

Step 1 : Load and Ingest a PDF

We start by loading a PDF file. You can either upload one or use a local path:

Step 2 : Split the PDF into Chunks

To make the document searchable, we need to chunk it into smaller parts.

Step 3 : Embed Chunks into Vector Database

We'll use the nomic-embed-text embedding model via Ollama, and store everything in ChromaDB:

Step 4 : Setup the Retriever with Multi-Query

To improve retrieval quality, we generate multiple variations of the input question using an LLM:

Step 5 : Ask Questions with RAG Pipeline

Let's define the full chain that combines retrieval + prompt + LLM:

Step 6 : Full Source Code

Here's the complete script you can run end-to-end:

Run with Streamlit UI

To make your RAG pipeline interactive, let's wrap it in a simple Streamlit app. Here's how to get started:

Step 7 : Install Streamlit

Add streamlit to your requirements.txt if not already included:

Or install it directly:

Step 8 : Create a Streamlit App

Save the following code as pdf_rag_streamlit.py:

Step 9 : Launch the App

Visit http://localhost:8501 to interact with your PDF using natural language.

And enter your question into the input box to get answers based on the PDF content.

You'll get a clean UI in your browser where you can type questions and receive answers based on the content of your PDF.