Topic 5: RAG - Retrieval Augmented Generation

rag

Updates 11 February 2026

Code spacing error and problems with execution on MacOS are fixed in manual_rag_pipeline_universal.ipynb.
The OCR is low quality in the original Model T service manual. I found a cleaner copy of the Model T manual with higher quality embedded text. You should re-download Corpora.zip and use the files in NewModelT instead. The new files are a single long pdf and a single long text file. Use the .txt versions.

The Congressional Record corpus has multi-column text that is turned into single column in the text file version (in the txt/ folder). However, extracting the embedded searchable text layer, tables are garbled, so RAG will not work for questions where the information is in a table.

This is a case illustrating the importance of pre-processing files for optimal retrieval.
After experimenting with a variety of OCR programs, I finally found an approach that creates beautiful, clean text with properly formatted tables: ChatGPT! None of the costly commercial programs I tried on a trial basis did as good a job. I used this prompt:


xxxxxxxxxx
Please convert this pdf to plain text. Ignore any embedded text layer. Do not maintain multiple columns, but create a single column of text.  When there are tables, which might cross multiple columns in the pdf, turn them into text with one line per row, being careful not to break them.

(Later) It turns out that ChatGPT can instead use the original embedded text and extract it more accurately. This is faster than having it perform OCR as well:


xxxxxxxxxx
Please convert this pdf with embedded text to plain text. Do not run OCR, use the embedded text. Do not maintain multiple columns, but create a single column of text.  When there are tables, which might cross multiple columns in the pdf, turn them into text with one line per row, being careful not to break them.

You have the option of converting all 25 issues of the Congressional Record to text files in this manner. This will improve the performance of the RAG pipeline on the corpus. If you do so, please report to the class using the newly-opened Discussions tab on Canvas.
Here's another suggestion for a final project: create a program that uses an open-source vision-language model with open-source OCR tools such as Tesseract that can do as good a job as the current ChatGPT. Note that you cannot call ChatGPT itself using an API, so such a project would be quite useful.

The cleanest text is that for the EU AI Act. If you have having trouble getting meaningful results from the other corpora, you may swap it in for any of the exercises and change the queries appropriately to be about the act.
The LearJet corpora embedded text is of medium quality. The text that is on diagrams will not be useful.
I recommend using the text files I provided in the corpora instead of relying on the code in the pipeline that extracts embedded text from pdfs. The program I used to extract the embedded text (pdftotext) is more accurate than the PyMuPDF function.

Learning Goals

Know when to use RAG in your agent system
Understand the purpose of each component in the RAG pipeline
Understand the effect of chunk size, query phrasing, and prompt templates
Know how to optimize retrieval using top-K variation
See how RAG supports cross-document synthesis

Tasks

You should work in a self-selected team of 2 or three people on these exercises. If you have difficulty finding a team, please ask the instructor for assistance. You should do the first few exercises together. If you wish, you may split up the remaining exercises among your team members, but you should share and discuss the results with each other afterwards. Put copies of all of the results in the GitHub for every team member, and list the names of the people in your team in the README.md file.

Exercise 0: Set-up

Download manual_rag_pipeline_universal.ipynb and get it running on Colab or your own computer. I recommend Colab unless you have a good GPU on your machine. The notebook will, however, run correctly on any machine with or without a GPU. Download Corpora.zip and unzip it. (See below for details about the corpora.) Feel free to modify the pipeline program to help automate running the exercises..

Exercise 1: Open Model RAG vs. No RAG Comparison

Compare LLM's answers with and without retrieval augmentation.

Setup: Use Qwen 2.5 1.5B (or another small open model) with the Model T Ford repair manual, and then with the Congressional Record corpus (separately).

Queries to try with the Model T Ford corpus:

"How do I adjust the carburetor on a Model T?"
"What is the correct spark plug gap for a Model T Ford?"
"How do I fix a slipping transmission band?"
"What oil should I use in a Model T engine?"

Queries to try with the Congressional Record corpus:

These issue numbers are for your use in finding the correct answers, don't give them to the LLM.

"What did Mr. Flood have to say about Mayor David Black in Congress on January 13, 2026?" (See CR Jan 13, 2026)
"What mistake Elise Stefanovic make in Congress on January 23, 2026?" (See CR Jan 23, 2026)
"What is the purpose of the Main Street Parity Act?" (See CR Jan 20, 2026)
"Who in Congress has spoken for and against funding of pregnancy centers?" (See CR Jan 21, 2026)

For each query:

Ask the model directly (no RAG)
Ask using your RAG pipeline

Document:

Does the model hallucinate specific values without RAG?
Does RAG ground the answers in the actual manual?
Are there questions where the model's general knowledge is actually correct?

Optional subtask: Put both the Model T manual and the Congressional Record issues into the same RAG database. Does this have any affect on the quality of the answers?

Exercise 2: Open Model + RAG vs. Large Model Comparison

Perhaps a larger model without RAG could be competitive with a small model with RAG.

Setup: Write a program (or adapt some of your code from last week) to run GPT 4o Mini with no tools on individual questions (not a conversation). Run it on the queries from Exercise 1.

Document:

Does GPT 4o Mini do a better job than Qwen 2.5 1.5B in avoiding hallucinations?
Which questions does GPT 4o Mini answer correctly? Compare the cut-off date of GPT 4o Mini pre-training and the age of the Model T Ford and Congressional Record corpora.

Exercise 3: Open Model + RAG vs. State-of-the-Art Chat Model

Compare your local RAG pipeline against a frontier model (GPT-5.2, Claude 4.6, etc.).

Setup:

Local: Qwen 2.5 1.5B with RAG using the Model T manual
Cloud: GPT-4 or Claude via their web interface (no file upload)

Queries to try:

All the ones from Exercise 1.

Document:

Where does the frontier model's general knowledge succeed?
When did the frontier model appear to be using live web search to help answer your questions?
Where does your RAG system provide more accurate, specific answers?
What does this tell you about when RAG adds value vs. when a powerful model suffices?

Exercise 4: Effect of Top-K Retrieval Count

Vary the number of chunks retrieved and observe how it affects answer quality. You can use any of the copora and your own queries.

Test with: k = 1, 3, 5, 10, 20

For each k value:

Run the same 3-5 queries
Note answer quality, completeness, and accuracy
Note response latency

Questions to explore:

At what point does adding more context stop helping?
When does too much context hurt (irrelevant information, confusion)?
How does k interact with chunk size?

Exercise 5: Handling Unanswerable Questions

Test how well your system handles questions that cannot be answered from the corpus. You can use any of the copora and your own queries.

Types of unanswerable questions:

Completely off-topic: "What is the capital of France?"
Related but not in corpus: "What's the horsepower of a 1925 Model T?" (if not in your manual)
False premises: "Why does the manual recommend synthetic oil?" (when it doesn't)

Document:

Does the model admit it doesn't know?
Does it hallucinate plausible-sounding but wrong answers?
Does retrieved context help or hurt? (Does irrelevant context encourage hallucination?)

Experiment: Modify your prompt template to add "If the context doesn't contain the answer, say 'I cannot answer this from the available documents.'" Does this help?

Exercise 6: Query Phrasing Sensitivity

Test how different phrasings of the same question affect retrieval. You can use any of the copora and your own queries or the ones below for the Model T or Learjet corpus.

Choose one underlying question and phrase it 5+ different ways:

Formal: "What is the recommended maintenance schedule for the engine?"
Casual: "How often should I service the engine?"
Keywords only: "engine maintenance intervals"
Question form: "When do I need to check the engine?"
Indirect: "Preventive maintenance requirements"

For each phrasing:

Record the top 5 retrieved chunks
Note similarity scores
Compare overlap between result sets

Questions to explore:

Which phrasings retrieve the best chunks?
Do keyword-style queries work better or worse than natural questions?
What does this tell you about potential query rewriting strategies?

Exercise 7: Chunk Overlap Experiment

Test how overlap between chunks affects retrieval of information that spans chunk boundaries. You can use any of the copora and your own queries. Note: this exercise takes a long time to run. Only try it on CoLab or a similar platform with T4 or better GPUs.

Setup: Re-chunk your corpus with different overlap values while keeping chunk size constant (e.g., 512 characters):

Overlap = 0 (no overlap)
Overlap = 64
Overlap = 128
Overlap = 256

For each configuration:

Rebuild the index
Find a question whose answer spans what would be a chunk boundary
Test retrieval quality

Document:

Does higher overlap improve retrieval of complete information?
What's the cost? (Index size, redundant information in context)
Is there a point of diminishing returns?

Exercise 8: Chunk Size Experiment

Test how chunk size affects retrieval precision and answer quality. You can use any of the copora and your own queries. Note: this exercise takes a long time to run. Only try it on CoLab or a similar platform with T4 or better GPUs.

Setup: Chunk your corpus at different sizes:

Very small: 128 characters
Medium: 512 characters
Very large: 2048 characters

For each configuration:

Rebuild the index
Run the same set of 5 queries
Examine retrieved chunks and final answers

Questions to explore:

How does chunk size affect retrieval precision (relevant vs. irrelevant content)?
How does it affect answer completeness?
Is there a sweet spot for your corpus?
Does optimal size depend on the type of question?

Exercise 9: Retrieval Score Analysis

Analyze the similarity scores returned by your retrieval system. You can use any of the copora and your own queries.

For 10 different queries:

Retrieve top 10 chunks
Record all similarity scores
Examine the score distribution

Look for patterns:

When is there a clear "winner" (large gap between #1 and #2)?
When are scores tightly clustered (ambiguous)?
What score threshold would you use to filter out irrelevant results?
How does score distribution correlate with answer quality?

Experiment: Implement a score threshold (e.g., only include chunks with score > 0.5). How does this affect results?

Exercise 10: Prompt Template Variations

Test how the system prompt affects generation quality. You can use any of the copora and your own queries.

Create variations of your RAG prompt:

Minimal: Just context and question, no instructions
Strict grounding: "Answer ONLY based on the context. If the answer isn't there, say so."
Encouraging citation: "Quote the exact passages that support your answer."
Permissive: "Use the context to help answer, but you may also use your knowledge."
Structured output: "First list relevant facts from the context, then synthesize your answer."

For each prompt variation:

Run the same 5 queries
Evaluate: accuracy, groundedness, helpfulness, citation quality

Document:

Which prompt produces the most accurate answers?
Which produces the most useful answers?
Is there a trade-off between strict grounding and helpfulness?

Exercise 11: Cross-Document Synthesis

Test questions that require combining information from multiple chunks. You can use any of the copora and your own queries.

Example design queries that require synthesis:

"What are ALL the maintenance tasks I need to do monthly?"
"Compare the procedures for adjusting X vs. adjusting Y"
"What tools do I need for a complete tune-up?" (if tools are mentioned in separate sections)
"Summarize all safety warnings in the manual"

Experiment with top_k:

Try k=3, k=5, k=10
Does retrieving more chunks improve synthesis?

Document:

Can the model successfully combine information from multiple chunks?
Does it miss information that wasn't retrieved?
Does contradictory information in different chunks cause problems?

Resources

manual_rag_pipeline_universal.ipynb - python notebook for running a complete RAG pipeline. Runs locally or on CoLab. Assumes the pdf files have embedded text, as do the pdfs in Corpora.zip. Feel free to break down some of the cells into multiple cells so you have finer-grained control over execution of the pipeline.
Corpora.zip - some test corpora for RAG. It contains the following corpora, each as pdf with embedded text and the embedded text as a plain text file. I recommend using the text only versions for your experiments because they were created using a better embedded-text extraction program than the Python routine that is built into the pipeline.
- The Model T Ford Service Manual. There are two versions of the manual in corpora, use the one named NewModelT.
- Issues of the Congressional Record from January, 2026, long after the training date of any of the LLMs we are using.
- The Learjet Maintenance manual.
- The European Union AI Act - because it is 200 pages long, after chunking it can serve as a corpus itself.
pdf_ocr - A program that performs OCR on scanned pdf files and embeds text as well as creating a text file. You may find this useful for later projects.

Portfolio

Create a subdirectory in your GitHub portfolio named Topic5RAG and save your programs, each modified version named to indicate its task number and purpose. Create appropriately named text files saving the outputs from your terminal sessions running the programs. Create README.md with a table of contents of the directory.