CS 6501 Workshop on Building AI Agents

Prof. Henry Kautz, henry.kautz@virginia.edu Spring 2026, Tuesdays & Thursdays 2:00pm - 3:15pm Mechanical Engineering Building Room 339

TA: Wenqian Ye, pvc7hs@virginia.edu

Office Hours

In-person office hours: Tuesdays & Thursdays 3:30-4:30pm, Rice 511. Sign up required: if no one is signed up by the start of office hours, I will not be in my office.

Zoom office hours: Wednesdays 12:30pm-2:00pm. Sign up required.

Questions about course enrollment, logistics, or absences: email both the instructor and TA.

Emergency Remote Class

Class is normally in-person on. It will be held by Zoom only when in-person meeting is impossible due to weather or other emergencies. In such cases, join by this link with passcode 152379. Students are required attend the session live. agent-types-nvidia

Illustration from Small Language Models are the Future of Agentic AI by Belcak et al. 2025

Class Calendar

Class Roster

Syllabus

In this hands-on workshop, we will learn how to build AI agents: systems powered by large-language models that autonomously interact with services, tools, and other agents.

Much of the programming work will be completed during class. Due to its nature as a workshop, class attendance is mandatory. Attendance will be take on paper at each class. Attendance can only be excused for illness or career events (including athletics): in such cases, notify both the instructor and TA. Any student who misses more than two classes without valid reasons will receive 0 for class participation for the semester. Chronic absences will result in failing the class. Students are responsible for signing up for office hours for their mid-term portfolio review and discussion of final project ideas on the dates noted in the class calendar.

You will need to create accounts on the following platforms. Please create your accounts before the first day of class.

An AI coder to help you complete the assignments:
- Recommended: Claude Pro (including Claude Code) - $17 a month
- Also good: GitHub CoPilot Pro - free for students and faculty
- Also good: Gemini Code Assist for Individuals - free (and also built-into Google CoLab)
Hugging Face - free or Pro for $9 a month
Google CoLab - free or Pro for $10 a month or free Pro for students for 12 months
GitHub - free
OpenAI Platform - price is per token used. Note this is different than your OpenAI chat account.

I have found that Visual Studio Code with the Claude Code plugin to be a superb IDE.

We will begin with small open-source models. The free tier of Google CoLab includes use of T4 GPUs so you can run models more efficiently than on laptops. You can also run them on your laptop or desktop if it has a GPU.

We will go to using commercial models running on the providers' servers. Our example code will use the most inexpensive gpt models from OpenAI. You are free to create accounts on and use APIs from other LLM providers, and/or to run large open-source models on your own infrastructure.

For the topic of fine-tuning, we will provide you with a pre-paid student account on the platform Tinker by thinkingmachines.ai.

Students should not use the computer science department server cluster because it is needed full-time for research.

In addition to programming, we will read about one paper a week and discuss it in class.

Topics

Overview of LLMs and AI agent benchmark datasets
Running your own small LLMs on your laptop and Google CoLab
Agent control flows (HF Pipelines, HF smolagents, LangChain/LangGraph, Toolformer, Model Context Protocol)
Few-shot learning AKA in-context learning
Chain of thought reasoning (CoT) and self-refinement (Self-Refine)
Search-augmented generation AKA live RAG AKA internet-augmented dialog generation
Vector-database Retrieval-augmented generation (RAG)
Action Models (ReAct)
Vision-language models (VLMs)
Multi-agent systems (Generative Agents, ToolOrchestra, Magnetic-One)
Weight fine-tuning (QLORA)
Context Management (AgentFold)

Grading

25% attendance and class participation
20% mid-term and final GitHub portfolio review
55% final project, with:
- 10% concept and motivation
- 30% implementation
- 20% presentation by a 5-minute video in mp4 format followed by in-person Q&A,

Your work from each class and your final project should be stored in a GitHub repository. The final project, including a polished 5-minute video presentation, is due by 12 noon on Thursday, April 16. This is a hard deadline and extensions will not be granted.

Final Project Ideas

Here are some ideas for final projects just to get you started thinking. Please talk to me via email or office hours about your choice of project before spring break. You may work alone or in a team of two students (not more). You will need to build a working system, give a presentation about it, and write a report that describes the problem it solves, the design of the system, and the results of running the agent.

Build an agent that submits the user's question to several different models and then selects the best answer to return. How does it determine the best answer? By asking models (the same or different ones) to critique each and vote.
Wouldn't it be great if UVA had an events calendar for all AI talks? It doesn't and it can't do so easily because different parts of the university use different calendar software. Write an agent that reads the many event calendars across the university and creates a calendar of all AI related events in a standard format. Note that your agent will need to get around the ways the UVA website is guarded against bots. If you can build an agent that is reliable and requires minimal human effort to install, let's talk about launching a startup - the incompatible calendar problem is ubiquitous across universities.
LLMs have proven useful as theorem proving assistants - see the paper on Lean Co-Pilot and Gauss from MathAI.inc, as well as this bibliography and summary of recent breakthrough. Much more can be done, here are a few possibilities:
- Extend the work in one of the breakthrough papers described in the bibliography.
- An alternative to Lean are Sat-Modulo Theory (SMT) fully-automated theorem provers, which are widely used in verification and testing - as well as for oracles for Lean. Examples include Z3 and cvc5. They work by translating program logic and properties into a set of logical constraints, then checking if the negation of the program specifications are satisfiable. Humans still have to write the program specifications and translate the code into formal logic. Create an agent that performs either or both of these tasks. It could, for example, determine what the specification should be from comments in the code and the names of functions and variables. Turning the code into logic is non-trivial because you typically need to abstract away unnecessary details to prevent the logical representations from being to large for the SMT provers.
- Propositional resolution provably requires exponentially long refutation proofs to solve pigeon-hole problems (you can't fit 12 pidgeons into 11 pigeon holes) and simiar problems that involve counting. Extended resolution allows a prover to introduce new defined propositional variables and can solve such problems with small proofs. However, there are no good provers for extended resolution. Write an agent that uses LLMs to augment a propositional CNF formula with defined variables so that a state of the art SAT-based prover such as kissat can find small proofs. Note: doing this in general for all proofs would mean that NP=co-NP and you would be next in line for the Turing Award. But you should be able to make progress in having the agent recognize particular patterns in the CNF formula that indicate a pigeon-hole problem.
- I have written a little language called Schema (not Scheme!) to make it easy to write formulas in finite-domain first-order logic that are compiled into propositional logic and then solved by satisfiability solvers. A SOTA LLM can be taught to translate English word problems into Schema by in-context training. A central challenge in using systems like Schema for commonsense reasoning is that problem statements usually do not contain all the information needed to solve the problem - "obvious" background knowledge is needed as well. Build a Schema-using agent that tries to find relevant background knowledge for a given problem and add it to the logical formalization.
Build an agent that helps you maintain good social connections with other people. How? It could remind you to make plans to see friends, suggest meet-ups you might enjoy, berate you if you spend too much time on your phone, etc. Use your imagination!
OlmoEarth from AI2 is a new platform for building agents that combine scientific LLMs and geophysical data. Use it to create a novel scientific application such as conservation.
Create an agent to help support work in any academic discipline with which you are familiar. The agent could focus on finding related literature in a highly accurate manner, suggesting research hypotheses for particular questions, designing experiments, communicating results, or some combination of such tasks.
Build a multi-agent simulation of an ecosystem. For inspiration see the paper on generative agents listed below.
Neuroscience shows that the hippocampus plays a role in consolidating memories during sleep through a process called "hippocampal replay". Read up on this theory, and also the paper below on the AgentFold for the problem of maintain long LLM contexts over time. Devise an agent system where a hippocampus agent periodically incorporates parts of the context (memories) into the LLM agent's weights using QLORA so that those memories no longer need to be explicitly stored in the context.
Create an agent that can play a board game other than chess, checkers, Othello, or go. Suggestions: Monopoly (with trading), character-based terminal games Rogue or NetHack, card games such as Pokemon or Magic the Gathering, or strategy games such as Risk or Strategy. Note that you might want to give your agent access to a odds calculator tool specific to the game - e.g., in Risk, to calculate the odds of winning a battle given the number of armies in the attacking and defending countries. Can you go on to make your agent improve its play using reinforcement learning?
The recent paper on model-first reasoning argues that an LLM can be strengthened by simply giving it a prompt to create a mathematical model of a problem by defining entities, state variables, actions, and constraints. Follow up on this by devising modelling directive prompts for some specific domain with which you are familiar.

Resources

User Guides, Websites, and Blogs

Unsloth, an open-source framework for LLM fine-tuning and reinforcement learning. https://unsloth.ai/

Tinker: a training API for researchers and developers. https://tinker-docs.thinkingmachines.ai/.

Ouyang, Long, et al., Advances in Neural Information Processing Systems (NeurIPS) 2022. Training language models to follow instructions with human feedback. https://arxiv.org/abs/2203.02155

The Model Context Protocol (MCP) Course, sponsored by Hugging Face and Anthropic. https://huggingface.co/learn/mcp-course/en/unit0/introduction

Domain-specific small language models, 2026, Guglielmo Iozzia, Manning Publications. Contact instructor for a pre-print. Textbook with code for many of the concepts in this course. Source code here: https://www.manning.com/books/domain-specific-small-language-models

Olmo 3: Charting a path through the model flow to lead open-source AI. AI2. https://allenai.org/blog/olmo3

Weaviate Claude Skills: A comprehensive set of Claude Skills for working with local Weaviate vector databases. These skills enable you to connect, manage, ingest data, and query Weaviate running in Docker directly through Claude.ai or Claude Desktop. https://github.com/saskinosie/weaviate-claude-skills

OlmoEarth Platform: Powerful open infrastructure for planetary insights. https://allenai.org/blog/olmoearth

OpenAI ChatGPT Atlas https://openai.com/index/introducing-chatgpt-atlas/

Agentic browser: an open-source, privacy-first alternative to ChatGPT Atlas, Perplexity Comet, Dia. Forked from chromium and 100% Opensource. https://github.com/browseros-ai/BrowserOS

IBM Granite 4.0 Nano-Models https://www.ibm.com/granite/docs/models/granite

Ai2 AI for Science Home Page: CodeScientist, DiscoveryWorld, DiscoveryBench, olmOCR, Ai2 ScholarQA, scientifc datasets S2ORC & S2AG. https://allenai.org/ai-for-science

Gemini for Google Workspace Prompting Guide 101. https://workspace.google.com/learning/content/gemini-prompt-guide

GPT-5 Prompting Guide. https://cookbook.openai.com/examples/gpt-5/gpt-5_prompting_guide

Claude Prompt Engineering Overview. https://docs.claude.com/en/docs/build-with-claude/prompt-engineering/overview

Effective context engineering for AI Agents. Anthropic Blog. https://www.anthropic.com/engineering/effective-context-engineering-for-ai-agents

"Building LLM applications for production", Chip Huyen's Blog, 2023. https://huyenchip.com/2023/04/11/llm-engineering.html

LEANN RAG Vector Database https://github.com/yichuan-w/LEANN

Fine-tune a pretrained model, Hugging Face Documentation. https://huggingface.co/docs/transformers/training

Large Language Model, Stanford Course, by Percy Liang. https://stanford-cs324.github.io/winter2022/

Agent Design Patterns: A Hands-on Guide to Building Intelligent Systems. Antonio Gulli. Preview of e-book. 424 pages. https://docs.google.com/document/d/1rsaK53T3Lg5KoGwvf8ukOUvbELRtH-V0LnOIFDxBryE/preview

Introducing Gauss, an agent for auto formalization, 2025, MathAI.inc. https://www.math.inc/gauss

Summary and Bibliography of Lean Mathematical Breakthroughs, Jan 2025-Jan 2026. Kevin Sullivan.

Papers

ERNIE 5.0 Technical Report, 2026, Baidu. https://www.arxiv.org/pdf/2602.04705

Masterman, Tula, Sandi Besen, Mason Sawtell, and Alex Chao. "The Landscape of Emerging AI Agent Architectures for Reasoning, Planning, and Tool Calling: A Survey." arXiv preprint arXiv:2404.11584 (2024). https://arxiv.org/abs/2404.11584.

Training AI Co-Scientists Using Rubric Rewards. Shashwat Goel, Rishi Hazra, Dulhan Jayalath, Timon Willi, Parag Jain, William F. Shen, Ilias Leontiadis, Francesco Barbieri, Yoram Bachrach, Jonas Geiping, Chenxi Whitehouse. https://arxiv.org/abs/2512.23707

QLoRA: Efficient Finetuning of Quantized LLMs, Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, Luke Zettlemoyer, 2023. https://arxiv.org/abs/2305.14314

A generative model of memory construction and consolidation, Eleanor Spens & Neil Burges, 2023. https://www.nature.com/articles/s41562-023-01799-z.pdf

Magentic-One: A Generalist Multi-Agent System for Solving Complex Tasks. Microsoft Research AI Frontiers. 2024. https://arxiv.org/html/2411.04468v1

Generative Agents: Interactive Simulacra of Human Behavior. Joon Sung Park, Joseph C. O'Brien, Carrie J. Cai, Meredith Ringel Morris, Percy Liang, Michael S. Bernstein, 2023 https://arxiv.org/abs/2304.03442

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models, Wei, J., et al. (2022). https://arXiv:2201.11903.

Internet-Augmented Dialogue Generation, Komeili et al., 2021. from Meta AI Research. This was one of the first papers to systematically explore augmenting conversational AI with real-time web search. https://arxiv.org/abs/2107.07566

Toolformer: Language Models Can Teach Themselves to Use Tools. Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Luke Zettlemoyer, Nicola Cancedda, Thomas Scialom, 2023. https://arxiv.org/abs/2302.04761

Language Models are Few-Shot Learners, Brown et al., 2020. https://arxiv.org/abs/2005.14165

A Survey on In-Context Learning, Dong, Q. et al. (2024). https://arxiv.org/abs/2301.00234

Formalizes ICL, relates it to meta-learning and prompting, and surveys techniques, analyses, and applications specifically for LLMs.

Long Is More for Alignment: A Simple but Tough-to-Beat Baseline for Instruction Fine-Tuning, 2024, Hao Zhao, Maksym Andriushchenko, Francesco Croce, Nicolas Flammarion. https://arxiv.org/abs/2402.04833

Mathematical exploration and discovery at scale, Bogdan Georgiev, Javier Gómez-Serrano, Terence Tao, Adam Zsolt Wagner, 2025. https://arxiv.org/abs/2511.02864

ToolOrchestra: Elevating Intelligence via Efficient Model and Tool Orchestration, 2025. https://arxiv.org/abs/2511.21689

From Code Foundation Models to Agents and Applications: A Comprehensive Survey and Practical Guide to Code Intelligence, 2025. https://arxiv.org/abs/2511.18538

Small Language Models are the Future of Agentic AI. Peter Belcak, Greg Heinrich, Shizhe Diao, Yonggan Fu, Xin Don, Saurav Muralidhara, Yingyan Celine Lin, Pavlo Molchanov. https://arxiv.org/pdf/2506.02153

AgentFold: Long-Horizon Web Agents with Proactive Context Management. Rui Ye, Zhongwang Zhang, Kuan Li, Huifeng Yin, Zhengwei Tao, Yida Zhao, Liangcai Su, Liwen Zhang, Zile Qiao, Xinyu Wang, Pengjun Xie, Fei Huang, Siheng Chen, Jingren Zhou, Yong Jiang. https://arxiv.org/abs/2510.24699

ReAct: Synergizing Reasoning and Acting in Language Models. Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, Yuan Cao. https://arxiv.org/abs/2210.03629

Lean Copilot: Large Language Models as Copilots for Theorem Proving in Lean. Peiyang Song, Kaiyu Yang, Anima Anandkumar. https://arxiv.org/abs/2404.12534

Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Thomas L. Griffiths, Yuan Cao, and Karthik Narasimhan. 2023. Tree of Thoughts: Deliberate Problem Solving with Large Language Models. arXiv preprint arXiv:2305.10601. https://arxiv.org/abs/2305.10601

Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, Shashank Gupta, Bodhisattwa Prasad Majumder, Katherine Hermann, Sean Welleck, Amir Yazdanbakhsh, and Peter Clark. 2023. Self-Refine: Iterative Refinement with Self-Feedback. arXiv preprint arXiv:2303.17651. https://arxiv.org/abs/2303.17651

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. 2023. ReAct: Synergizing Reasoning and Acting in Language Models. arXiv preprint arXiv:2210.03629. https://arxiv.org/abs/2210.03629

Shukang Yin, Chaoyou Fu, Sirui Zhao, Ke Li, Xing Sun, Tong Xu, and Enhong Chen. 2024. A survey on multimodal large language models. Natl. Sci. Rev. 11, 12 (November 2024), nwae403. DOI:https://doi.org/10.1093/nsr/nwae403

Hanjia Lyu, Jinfa Huang, Daoan Zhang, Yongsheng Yu, Xinyi Mou, Jinsheng Pan, Zhengyuan Yang, Zhongyu Wei, Jiebo Luo. GPT-4V(ision) as A Social Media Analysis Engine. https://arxiv.org/abs/2311.07547

Jiacheng Miao, Joe R. Davis, Jonathan K. Pritchard, and James Zou. 2025. Paper2Agent: Reimagining Research Papers As Interactive and Reliable AI Agents. arXiv preprint arXiv:2509.06917. https://arxiv.org/abs/2509.06917

Model-First Reasoning LLM Agents: Reducing Hallucinations through Explicit Problem Modeling, Annu Rana, Gaurav Kumar, 2025. https://arxiv.org/abs/2512.14474