Topic 5: RAG - Retrieval Augmented Generation

rag

Updates 11 February 2026

Learning Goals

Tasks

You should work in a self-selected team of 2 or three people on these exercises. If you have difficulty finding a team, please ask the instructor for assistance. You should do the first few exercises together. If you wish, you may split up the remaining exercises among your team members, but you should share and discuss the results with each other afterwards. Put copies of all of the results in the GitHub for every team member, and list the names of the people in your team in the README.md file.

Exercise 0: Set-up

Download manual_rag_pipeline_universal.ipynb and get it running on Colab or your own computer. I recommend Colab unless you have a good GPU on your machine. The notebook will, however, run correctly on any machine with or without a GPU. Download Corpora.zip and unzip it. (See below for details about the corpora.) Feel free to modify the pipeline program to help automate running the exercises..


Exercise 1: Open Model RAG vs. No RAG Comparison

Compare LLM's answers with and without retrieval augmentation.

Setup: Use Qwen 2.5 1.5B (or another small open model) with the Model T Ford repair manual, and then with the Congressional Record corpus (separately).

Queries to try with the Model T Ford corpus:

Queries to try with the Congressional Record corpus:

These issue numbers are for your use in finding the correct answers, don't give them to the LLM.

For each query:

  1. Ask the model directly (no RAG)

  2. Ask using your RAG pipeline

Document:

Optional subtask: Put both the Model T manual and the Congressional Record issues into the same RAG database. Does this have any affect on the quality of the answers?


Exercise 2: Open Model + RAG vs. Large Model Comparison

Perhaps a larger model without RAG could be competitive with a small model with RAG.

Setup: Write a program (or adapt some of your code from last week) to run GPT 4o Mini with no tools on individual questions (not a conversation). Run it on the queries from Exercise 1.

Document:


Exercise 3: Open Model + RAG vs. State-of-the-Art Chat Model

Compare your local RAG pipeline against a frontier model (GPT-5.2, Claude 4.6, etc.).

Setup:

Queries to try:

Document:


Exercise 4: Effect of Top-K Retrieval Count

Vary the number of chunks retrieved and observe how it affects answer quality. You can use any of the copora and your own queries.

Test with: k = 1, 3, 5, 10, 20

For each k value:

Questions to explore:


Exercise 5: Handling Unanswerable Questions

Test how well your system handles questions that cannot be answered from the corpus. You can use any of the copora and your own queries.

Types of unanswerable questions:

Document:

Experiment: Modify your prompt template to add "If the context doesn't contain the answer, say 'I cannot answer this from the available documents.'" Does this help?


Exercise 6: Query Phrasing Sensitivity

Test how different phrasings of the same question affect retrieval. You can use any of the copora and your own queries or the ones below for the Model T or Learjet corpus.

Choose one underlying question and phrase it 5+ different ways:

For each phrasing:

Questions to explore:


Exercise 7: Chunk Overlap Experiment

Test how overlap between chunks affects retrieval of information that spans chunk boundaries. You can use any of the copora and your own queries. Note: this exercise takes a long time to run. Only try it on CoLab or a similar platform with T4 or better GPUs.

Setup: Re-chunk your corpus with different overlap values while keeping chunk size constant (e.g., 512 characters):

For each configuration:

  1. Rebuild the index

  2. Find a question whose answer spans what would be a chunk boundary

  3. Test retrieval quality

Document:


Exercise 8: Chunk Size Experiment

Test how chunk size affects retrieval precision and answer quality. You can use any of the copora and your own queries. Note: this exercise takes a long time to run. Only try it on CoLab or a similar platform with T4 or better GPUs.

Setup: Chunk your corpus at different sizes:

For each configuration:

  1. Rebuild the index

  2. Run the same set of 5 queries

  3. Examine retrieved chunks and final answers

Questions to explore:


Exercise 9: Retrieval Score Analysis

Analyze the similarity scores returned by your retrieval system. You can use any of the copora and your own queries.

For 10 different queries:

  1. Retrieve top 10 chunks

  2. Record all similarity scores

  3. Examine the score distribution

Look for patterns:

Experiment: Implement a score threshold (e.g., only include chunks with score > 0.5). How does this affect results?


Exercise 10: Prompt Template Variations

Test how the system prompt affects generation quality. You can use any of the copora and your own queries.

Create variations of your RAG prompt:

  1. Minimal: Just context and question, no instructions

  2. Strict grounding: "Answer ONLY based on the context. If the answer isn't there, say so."

  3. Encouraging citation: "Quote the exact passages that support your answer."

  4. Permissive: "Use the context to help answer, but you may also use your knowledge."

  5. Structured output: "First list relevant facts from the context, then synthesize your answer."

For each prompt variation:

Document:


Exercise 11: Cross-Document Synthesis

Test questions that require combining information from multiple chunks. You can use any of the copora and your own queries.

Example design queries that require synthesis:

Experiment with top_k:

Document:

Resources

Portfolio

Create a subdirectory in your GitHub portfolio named Topic5RAG and save your programs, each modified version named to indicate its task number and purpose. Create appropriately named text files saving the outputs from your terminal sessions running the programs. Create README.md with a table of contents of the directory.