Prof. Henry Kautz, henry.kautz@virginia.edu Spring 2026, Tuesdays & Thursdays 2:00pm - 3:15pm Mechanical Engineering Building Room 339
TA: Wenqian Ye, pvc7hs@virginia.edu
In-person office hours: Tuesdays & Thursdays 3:30-4:30pm, Rice 511. Sign up required: if no one is signed up by the start of office hours, I will not be in my office.
Zoom office hours: Wednesdays 12:30pm-2:00pm. Sign up required.
Questions about course enrollment, logistics, or absences: email both the instructor and TA.
Class is normally in-person on. It will be held by Zoom only when in-person meeting is impossible due to weather or other emergencies. In such cases, join by this link with passcode 152379. Students are required attend the session live.
In this hands-on workshop, we will learn how to build AI agents: systems powered by large-language models that autonomously interact with services, tools, and other agents.
Much of the programming work will be completed during class. Due to its nature as a workshop, class attendance is mandatory. Attendance will be take on paper at each class. Attendance can only be excused for illness or career events (including athletics): in such cases, notify both the instructor and TA. Any student who misses more than two classes without valid reasons will receive 0 for class participation for the semester. Chronic absences will result in failing the class. Students are responsible for signing up for office hours for their mid-term portfolio review and discussion of final project ideas on the dates noted in the class calendar.
You will need to create accounts on the following platforms. Please create your accounts before the first day of class.
An AI coder to help you complete the assignments:
Recommended: Claude Pro (including Claude Code) - $17 a month
Also good: GitHub CoPilot Pro - free for students and faculty
Also good: Gemini Code Assist for Individuals - free (and also built-into Google CoLab)
Hugging Face - free or Pro for $9 a month
Google CoLab - free or Pro for $10 a month or free Pro for students for 12 months
GitHub - free
I have found that Visual Studio Code with the Claude Code plugin to be a superb IDE.
For most of our tutorial-type programming we will use small open-source models and run them locally. The free tier of Google Colab includes use of T4 GPUs so you can run models more efficiently than on laptops. The maximum session runtime is 12 hours with no more than 90 minutes of inactivity. Heavy usage can trigger "resource exhausted" messages and/or throttle your jobs. CoLab Pro enables up to 24 hour sessions and provides better GPU availability and RAM.
For later projects, you might choose to use API access to a state of the art model from OpenAI, Anthropic, or Google running on their own servers. Although you can buy time for any of these models all from a single aggregation service (including AWS Bedrock, Google Vertex AI, and Microsoft Foundry), it is not possible to put a hard limit on your potential charges. This can be dangerous for your credit card if your code runs wild! I recommend instead signing up for the Claude API directly from Anthropic because you can set a credit card charge limit. The SOTA model Claude Opus 4.5 costs $5 per million tokens processed, and Claude Haiku 4.5 is nearly as good and costs only $0.45 per million tokens.
Students comfortable using the department research GPU cluster may optionally use them, but be sure to use SLURM and be careful not to bog down the few GPU servers for more than a few minutes.
In addition to programming, we will read about one paper a week and discuss it in class.
Overview of LLMs and AI agent benchmark datasets
Running your own small LLMs on your laptop and Google CoLab
Agent control flows (HF Pipelines, HF smolagents, LangChain/LangGraph, Toolformer, Model Context Protocol)
Few-shot learning AKA in-context learning
Chain of thought reasoning (CoT) and self-refinement (Self-Refine)
Search-augmented generation AKA live RAG AKA internet-augmented dialog generation
Vector-database Retrieval-augmented generation (RAG)
Action Models (ReAct)
Vision-language models (VLMs)
Multi-agent systems (Generative Agents, ToolOrchestra, Magnetic-One)
Weight fine-tuning (QLORA)
Context Management (AgentFold)
25% attendance and class participation
20% mid-term and final GitHub portfolio review
55% final project, with:
10% concept and motivation
30% implementation
20% presentation by a 10-minute video in mp4 format
Your work from each class and your final project should be stored in a GitHub repository. The final project, including a polished 10-minute video presentation, is due by 12 noon on Thursday, April 16. This is a hard deadline and extensions will not be granted.
Here are some ideas for final projects just to get you started thinking. Please talk to me via email or office hours about your choice of project before spring break. You may work alone or in a team of two students (not more). You will need to build a working system, give a presentation about it, and write a report that describes the problem it solves, the design of the system, and the results of running the agent.
Build an agent that submits the user's question to several different models and then selects the best answer to return. How does it determine the best answer? By asking models (the same or different ones) to critique each and vote.
Wouldn't it be great if UVA had an events calendar for all AI talks? It doesn't and it can't do so easily because different parts of the university use different calendar software. Write an agent that reads the many event calendars across the university and creates a calendar of all AI related events in a standard format. Note that your agent will need to get around the ways the UVA website is guarded against bots. If you can build an agent that is reliable and requires minimal human effort to install, let's talk about launching a startup - the incompatible calendar problem is ubiquitous across universities.
LLMs have proven useful as theorem proving assistants - see the paper on Lean Co-Pilot and Gauss from MathAI.inc, as well as this bibliography and summary of recent breakthrough. Much more can be done, here are a few possibilities:
Extend the work in one of the breakthrough papers described in the bibliography.
An alternative to Lean are Sat-Modulo Theory (SMT) fully-automated theorem provers, which are widely used in verification and testing - as well as for oracles for Lean. Examples include Z3 and cvc5. They work by translating program logic and properties into a set of logical constraints, then checking if the negation of the program specifications are satisfiable. Humans still have to write the program specifications and translate the code into formal logic. Create an agent that performs either or both of these tasks. It could, for example, determine what the specification should be from comments in the code and the names of functions and variables. Turning the code into logic is non-trivial because you typically need to abstract away unnecessary details to prevent the logical representations from being to large for the SMT provers.
Propositional resolution provably requires exponentially long refutation proofs to solve pigeon-hole problems (you can't fit 12 pidgeons into 11 pigeon holes) and simiar problems that involve counting. Extended resolution allows a prover to introduce new defined propositional variables and can solve such problems with small proofs. However, there are no good provers for extended resolution. Write an agent that uses LLMs to augment a propositional CNF formula with defined variables so that a state of the art SAT-based prover such as kissat can find small proofs. Note: doing this in general for all proofs would mean that NP=co-NP and you would be next in line for the Turing Award. But you should be able to make progress in having the agent recognize particular patterns in the CNF formula that indicate a pigeon-hole problem.
I have written a little language called Schema (not Scheme!) to make it easy to write formulas in finite-domain first-order logic that are compiled into propositional logic and then solved by satisfiability solvers. A SOTA LLM can be taught to translate English word problems into Schema by in-context training. A central challenge in using systems like Schema for commonsense reasoning is that problem statements usually do not contain all the information needed to solve the problem - "obvious" background knowledge is needed as well. Build a Schema-using agent that tries to find relevant background knowledge for a given problem and add it to the logical formalization.
Build an agent that helps you maintain good social connections with other people. How? It could remind you to make plans to see friends, suggest meet-ups you might enjoy, berate you if you spend too much time on your phone, etc. Use your imagination!
OlmoEarth from AI2 is a new platform for building agents that combine scientific LLMs and geophysical data. Use it to create a novel scientific application such as conservation.
Create an agent to help support work in any academic discipline with which you are familiar. The agent could focus on finding related literature in a highly accurate manner, suggesting research hypotheses for particular questions, designing experiments, communicating results, or some combination of such tasks.
Build a multi-agent simulation of an ecosystem. For inspiration see the paper on generative agents listed below.
Neuroscience shows that the hippocampus plays a role in consolidating memories during sleep through a process called "hippocampal replay". Read up on this theory, and also the paper below on the AgentFold for the problem of maintain long LLM contexts over time. Devise an agent system where a hippocampus agent periodically incorporates parts of the context (memories) into the LLM agent's weights using QLORA so that those memories no longer need to be explicitly stored in the context.
Create an agent that can play a board game other than chess, checkers, Othello, or go. Suggestions: Monopoly (with trading), character-based terminal games Rogue or NetHack, card games such as Pokemon or Magic the Gathering, or strategy games such as Risk or Strategy. Note that you might want to give your agent access to a odds calculator tool specific to the game - e.g., in Risk, to calculate the odds of winning a battle given the number of armies in the attacking and defending countries. Can you go on to make your agent improve its play using reinforcement learning?
The recent paper on model-first reasoning argues that an LLM can be strengthened by simply giving it a prompt to create a mathematical model of a problem by defining entities, state variables, actions, and constraints. Follow up on this by devising modelling directive prompts for some specific domain with which you are familiar.
Unsloth, an open-source framework for LLM fine-tuning and reinforcement learning. https://unsloth.ai/
Tinker: a training API for researchers and developers. https://tinker-docs.thinkingmachines.ai/.
Ouyang, Long, et al., Advances in Neural Information Processing Systems (NeurIPS) 2022. Training language models to follow instructions with human feedback. https://arxiv.org/abs/2203.02155
The Model Context Protocol (MCP) Course, sponsored by Hugging Face and Anthropic. https://huggingface.co/learn/mcp-course/en/unit0/introduction
Domain-specific small language models, 2026, Guglielmo Iozzia, Manning Publications. Contact instructor for a pre-print. Textbook with code for many of the concepts in this course. Source code here: https://www.manning.com/books/domain-specific-small-language-models
Olmo 3: Charting a path through the model flow to lead open-source AI. AI2. https://allenai.org/blog/olmo3
Weaviate Claude Skills: A comprehensive set of Claude Skills for working with local Weaviate vector databases. These skills enable you to connect, manage, ingest data, and query Weaviate running in Docker directly through Claude.ai or Claude Desktop. https://github.com/saskinosie/weaviate-claude-skills
OlmoEarth Platform: Powerful open infrastructure for planetary insights. https://allenai.org/blog/olmoearth
OpenAI ChatGPT Atlas https://openai.com/index/introducing-chatgpt-atlas/
Agentic browser: an open-source, privacy-first alternative to ChatGPT Atlas, Perplexity Comet, Dia. Forked from chromium and 100% Opensource. https://github.com/browseros-ai/BrowserOS
IBM Granite 4.0 Nano-Models https://www.ibm.com/granite/docs/models/granite
Ai2 AI for Science Home Page: CodeScientist, DiscoveryWorld, DiscoveryBench, olmOCR, Ai2 ScholarQA, scientifc datasets S2ORC & S2AG. https://allenai.org/ai-for-science
Gemini for Google Workspace Prompting Guide 101. https://workspace.google.com/learning/content/gemini-prompt-guide
GPT-5 Prompting Guide. https://cookbook.openai.com/examples/gpt-5/gpt-5_prompting_guide
Claude Prompt Engineering Overview. https://docs.claude.com/en/docs/build-with-claude/prompt-engineering/overview
Effective context engineering for AI Agents. Anthropic Blog. https://www.anthropic.com/engineering/effective-context-engineering-for-ai-agents
"Building LLM applications for production", Chip Huyen's Blog, 2023. https://huyenchip.com/2023/04/11/llm-engineering.html
LEANN RAG Vector Database https://github.com/yichuan-w/LEANN
Fine-tune a pretrained model, Hugging Face Documentation. https://huggingface.co/docs/transformers/training
Large Language Model, Stanford Course, by Percy Liang. https://stanford-cs324.github.io/winter2022/
Agent Design Patterns: A Hands-on Guide to Building Intelligent Systems. Antonio Gulli. Preview of e-book. 424 pages. https://docs.google.com/document/d/1rsaK53T3Lg5KoGwvf8ukOUvbELRtH-V0LnOIFDxBryE/preview
Introducing Gauss, an agent for auto formalization, 2025, MathAI.inc. https://www.math.inc/gauss
Summary and Bibliography of Lean Mathematical Breakthroughs, Jan 2025-Jan 2026. Kevin Sullivan.
Masterman, Tula, Sandi Besen, Mason Sawtell, and Alex Chao. "The Landscape of Emerging AI Agent Architectures for Reasoning, Planning, and Tool Calling: A Survey." arXiv preprint arXiv:2404.11584 (2024). https://arxiv.org/abs/2404.11584.
Training AI Co-Scientists Using Rubric Rewards. Shashwat Goel, Rishi Hazra, Dulhan Jayalath, Timon Willi, Parag Jain, William F. Shen, Ilias Leontiadis, Francesco Barbieri, Yoram Bachrach, Jonas Geiping, Chenxi Whitehouse. https://arxiv.org/abs/2512.23707
QLoRA: Efficient Finetuning of Quantized LLMs, Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, Luke Zettlemoyer, 2023. https://arxiv.org/abs/2305.14314
A generative model of memory construction and consolidation, Eleanor Spens & Neil Burges, 2023. https://www.nature.com/articles/s41562-023-01799-z.pdf
Magentic-One: A Generalist Multi-Agent System for Solving Complex Tasks. Microsoft Research AI Frontiers. 2024. https://arxiv.org/html/2411.04468v1
Generative Agents: Interactive Simulacra of Human Behavior. Joon Sung Park, Joseph C. O'Brien, Carrie J. Cai, Meredith Ringel Morris, Percy Liang, Michael S. Bernstein, 2023 https://arxiv.org/abs/2304.03442
Chain-of-Thought Prompting Elicits Reasoning in Large Language Models, Wei, J., et al. (2022). https://arXiv:2201.11903.
Internet-Augmented Dialogue Generation, Komeili et al., 2021. from Meta AI Research. This was one of the first papers to systematically explore augmenting conversational AI with real-time web search. https://arxiv.org/abs/2107.07566
Toolformer: Language Models Can Teach Themselves to Use Tools. Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Luke Zettlemoyer, Nicola Cancedda, Thomas Scialom, 2023. https://arxiv.org/abs/2302.04761
Language Models are Few-Shot Learners, Brown et al., 2020. https://arxiv.org/abs/2005.14165
A Survey on In-Context Learning, Dong, Q. et al. (2024). https://arxiv.org/abs/2301.00234
Formalizes ICL, relates it to meta-learning and prompting, and surveys techniques, analyses, and applications specifically for LLMs.
Long Is More for Alignment: A Simple but Tough-to-Beat Baseline for Instruction Fine-Tuning, 2024, Hao Zhao, Maksym Andriushchenko, Francesco Croce, Nicolas Flammarion. https://arxiv.org/abs/2402.04833
Mathematical exploration and discovery at scale, Bogdan Georgiev, Javier Gómez-Serrano, Terence Tao, Adam Zsolt Wagner, 2025. https://arxiv.org/abs/2511.02864
ToolOrchestra: Elevating Intelligence via Efficient Model and Tool Orchestration, 2025. https://arxiv.org/abs/2511.21689
From Code Foundation Models to Agents and Applications: A Comprehensive Survey and Practical Guide to Code Intelligence, 2025. https://arxiv.org/abs/2511.18538
Small Language Models are the Future of Agentic AI. Peter Belcak, Greg Heinrich, Shizhe Diao, Yonggan Fu, Xin Don, Saurav Muralidhara, Yingyan Celine Lin, Pavlo Molchanov. https://arxiv.org/pdf/2506.02153
AgentFold: Long-Horizon Web Agents with Proactive Context Management. Rui Ye, Zhongwang Zhang, Kuan Li, Huifeng Yin, Zhengwei Tao, Yida Zhao, Liangcai Su, Liwen Zhang, Zile Qiao, Xinyu Wang, Pengjun Xie, Fei Huang, Siheng Chen, Jingren Zhou, Yong Jiang. https://arxiv.org/abs/2510.24699
ReAct: Synergizing Reasoning and Acting in Language Models. Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, Yuan Cao. https://arxiv.org/abs/2210.03629
Lean Copilot: Large Language Models as Copilots for Theorem Proving in Lean. Peiyang Song, Kaiyu Yang, Anima Anandkumar. https://arxiv.org/abs/2404.12534
Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Thomas L. Griffiths, Yuan Cao, and Karthik Narasimhan. 2023. Tree of Thoughts: Deliberate Problem Solving with Large Language Models. arXiv preprint arXiv:2305.10601. https://arxiv.org/abs/2305.10601
Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, Shashank Gupta, Bodhisattwa Prasad Majumder, Katherine Hermann, Sean Welleck, Amir Yazdanbakhsh, and Peter Clark. 2023. Self-Refine: Iterative Refinement with Self-Feedback. arXiv preprint arXiv:2303.17651. https://arxiv.org/abs/2303.17651
Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. 2023. ReAct: Synergizing Reasoning and Acting in Language Models. arXiv preprint arXiv:2210.03629. https://arxiv.org/abs/2210.03629
Shukang Yin, Chaoyou Fu, Sirui Zhao, Ke Li, Xing Sun, Tong Xu, and Enhong Chen. 2024. A survey on multimodal large language models. Natl. Sci. Rev. 11, 12 (November 2024), nwae403. DOI:https://doi.org/10.1093/nsr/nwae403
Hanjia Lyu, Jinfa Huang, Daoan Zhang, Yongsheng Yu, Xinyi Mou, Jinsheng Pan, Zhengyuan Yang, Zhongyu Wei, Jiebo Luo. GPT-4V(ision) as A Social Media Analysis Engine. https://arxiv.org/abs/2311.07547
Jiacheng Miao, Joe R. Davis, Jonathan K. Pritchard, and James Zou. 2025. Paper2Agent: Reimagining Research Papers As Interactive and Reliable AI Agents. arXiv preprint arXiv:2509.06917. https://arxiv.org/abs/2509.06917
Model-First Reasoning LLM Agents: Reducing Hallucinations through Explicit Problem Modeling, Annu Rana, Gaurav Kumar, 2025. https://arxiv.org/abs/2512.14474