Lesson Plan: Agentic AI for Science

Module: AI Agents Duration: ~60 minutes Format: Lecture (no exercise)


Learning Goals

By the end of this lecture, students should be able to:


Prerequisites

Students have covered:


1. Why Science Needs Agents, Not Just Models (8 min)

1.1 The Scientific Method Is an Agent Loop

Open by drawing the parallel that motivates the entire lecture. The scientific method — hypothesize, design experiment, execute, analyze results, revise hypothesis — is structurally identical to the ReAct loop students have already implemented:

Scientific MethodAgent Loop
Formulate hypothesisPlan
Design and run experimentAct (tool calls)
Collect and analyze dataObserve
Revise hypothesis, repeatReflect and loop

The key insight: a single LLM call can answer a factual question, but it cannot do science. Science requires iterative, multi-step workflows where the output of one step determines what to do next. This is exactly what agents are for.

1.2 From "LLM as Oracle" to "LLM as Scientist"

Distinguish three levels of AI involvement in science, each progressively more agentic:

  1. LLM as lookup tool — Ask a question, get an answer. (ChatGPT for literature search.) No iteration, no tool use.

  2. LLM as copilot — The scientist drives; the LLM assists with specific subtasks (write this code, summarize this paper, suggest a next experiment). Tool calling but no autonomy.

  3. LLM as autonomous agent — The agent plans a research strategy, executes experiments (computationally or via lab automation), analyzes results, revises its approach, and produces a report. Full agent loop with minimal human oversight.

Today's lecture is about the frontier of level 3 — and the hard questions about when and whether to trust it.

1.3 What Makes Scientific Agents Different from Coding Agents

Students have seen coding agents (Claude Code, SWE-bench runners) that write, test, and debug software. Scientific agents share the same architecture but face additional challenges:

These challenges motivate the specific architectural patterns we'll see throughout the lecture.


2. The Scientific Agent Stack (7 min)

2.1 Mapping Course Concepts to Science

Walk through the agent architecture students already know and show how each component has a scientific counterpart:

Tool calling → Scientific instruments and APIs. Just as students called web APIs and executed code, scientific agents call domain-specific tools: molecular docking software, quantum chemistry simulators, genomic annotation pipelines, telescope scheduling APIs. The mechanism is identical — structured function calls with typed inputs and outputs.

MCP → Domain database servers. Students used the Asta MCP server to query Semantic Scholar. The same pattern extends to UniProt (protein sequences), the Materials Project (crystal structures), PubChem (chemical compounds), GenBank (genomic sequences), and the CERN Open Data Portal (particle physics datasets). Each is a specialized knowledge source that an agent can query through a standardized protocol.

RAG → Scientific literature retrieval. Instead of retrieving documentation chunks, scientific agents retrieve relevant papers, methods sections, experimental protocols, and prior results. The retrieval quality is critical — a missed relevant paper can mean repeating work or missing a known failure mode.

A2A → Multi-agent scientific collaboration. Complex scientific workflows naturally decompose into specialist roles: one agent for literature review, another for experimental design, a third for statistical analysis, a fourth for peer review. This is exactly the multi-agent coordination pattern students implemented with A2A.

2.2 The Emergent Pattern: Design–Execute–Analyze–Learn (DEAL)

Across every domain we'll survey today, the same four-phase loop appears:

In drug discovery this is called the DMTA cycle (Design–Make–Test–Analyze). In materials science it's the autonomous experimentation loop. In the CERN paper it's hypothesis–analysis–inference–revision. The terminology varies but the agent architecture is the same.


3. Ai2 Tools: The Knowledge Infrastructure for Scientific Agents (8 min)

3.1 Why Knowledge Infrastructure Matters

Before surveying domain applications, highlight a foundational requirement: scientific agents are only as good as the knowledge they can access. An agent that cannot efficiently search, retrieve, and synthesize scientific literature is flying blind. The Allen Institute for AI (Ai2) has built the most comprehensive open ecosystem of tools for exactly this purpose — and students have already used part of it.

3.2 Semantic Scholar: The Knowledge Graph

Semantic Scholar is a free, AI-powered search engine for academic literature, indexing over 200 million papers across all scientific disciplines. Unlike Google Scholar, it provides:

Students already queried this via the Asta MCP server in the remote tool use module. Emphasize that when a scientific agent "reads the literature," Semantic Scholar's API is often the underlying tool call.

3.3 OpenScholar: RAG for Scientific Literature at Scale

OpenScholar (Ai2 + University of Washington, published in Nature 2026) is a retrieval-augmented language model purpose-built for scientific literature synthesis. It demonstrates what domain-specific RAG looks like at scale:

The pedagogical point: the RAG architecture students built from scratch in the RAG module is the same architecture powering state-of-the-art scientific literature tools. Scale and domain tuning matter, but the mechanism is the same.

3.4 ScholarQA and Asta: From Search to Reasoning

ScholarQA extends OpenScholar to multi-document scientific question answering — answering questions that require synthesizing insights across many papers, with structured outputs including comparison tables, subtopic breakdowns, and citation-backed evidence chains.

Asta is Ai2's scholarly research assistant combining literature understanding with data-driven discovery across 108 million+ abstracts and 12 million+ full-text papers. Students have already interacted with Asta's API through MCP tool calls. Frame Asta as an example of the kind of domain-specific tool server that scientific agents need in their toolkit.

3.5 AutoDiscovery: Agents That Choose Their Own Questions

All the systems above help agents answer scientific questions. AutoDiscovery (Agarwal & Majumder, UMass Amherst + Ai2; NeurIPS 2025, arXiv:2507.00310) tackles a deeper problem: can an agent decide which questions are worth asking in the first place?

AutoDiscovery uses Bayesian surprise as an intrinsic motivation signal. The agent holds a prior belief about a hypothesis (represented as a probability distribution), designs and runs an experiment (writing and executing Python code), then updates to a posterior belief given the results. Surprise is quantified as the magnitude of the epistemic shift — the more the evidence changes the agent's beliefs, the more interesting the discovery.

Architecture: A multi-agent system where collaborating LLMs propose experiment plans, write and execute analysis code, critique and fix mistakes, and analyze results — the same agent loop students know. To navigate the vast space of possible hypotheses efficiently, AutoDiscovery uses Monte Carlo tree search (MCTS) with progressive widening, balancing exploration of novel research directions against exploitation of promising threads.

Results: Across 21 real-world datasets spanning behavioral science, economics, biology, and finance, AutoDiscovery finds 5–29% more surprising hypotheses than strong search baselines. Critically, in a human study with 500+ hypotheses, 67% of the discoveries the agent found surprising were also surprising to domain experts — suggesting that Bayesian surprise is a reasonable proxy for genuine scientific novelty.

AutoDiscovery is now integrated into Ai2's AstaLabs platform, making it accessible as part of the same ecosystem students have already used.

The pedagogical connection: this directly addresses the "scientific taste" challenge we'll revisit in §7. It is the first principled attempt to formalize what it means for an agent to find something scientifically interesting, using information-theoretic criteria rather than human-specified objectives.

3.6 The Open Science Commitment

Note that Ai2's tools are open-source and freely available — Semantic Scholar's API, OpenScholar's model weights and code, and the OLMo language model family that underlies several of these systems. This matters for scientific reproducibility: if a scientific agent's literature retrieval component is a black box, the agent's conclusions cannot be fully audited. Open infrastructure makes the entire agent pipeline inspectable.


4. Domain Survey: Biological Sciences (10 min)

4.1 Drug Discovery: The DMTA Cycle as Agent Loop

Drug discovery follows the Design–Make–Test–Analyze (DMTA) cycle — a natural agent loop. Traditionally, each iteration takes weeks to months. Agentic AI aims to compress this dramatically.

ChemCrow (Bran, Schwaller et al., EPFL; Nature Machine Intelligence 2024) is the canonical early example. It integrates an LLM (GPT-4) with 18 expert-designed chemistry tools via LangChain:

ChemCrow autonomously planned and executed syntheses of an insect repellent and three organocatalysts, and guided the discovery of a novel chromophore. The architecture is exactly LangChain tool calling — the same pattern students implemented.

Industry adoption is accelerating. AstraZeneca's ChatInvent system evolved from a single-agent proof of concept into a multi-agent architecture for molecular design and synthesis planning. The Tippy framework automates the full DMTA cycle with specialized agents for each phase. Companies including Eli Lilly, Bristol Myers Squibb, Takeda, and AbbVie are building AI-driven pipelines that increasingly resemble the agent architectures students know.

4.2 Protein Engineering: Autonomous Design–Build–Test–Learn

Protein engineering is following a parallel trajectory. The key systems:

AlphaFold 3 (Google DeepMind, 2024) predicts structures of entire biomolecular complexes — proteins with DNA, RNA, small molecules, and ions. It serves as a powerful tool that agents can call, not an agent itself.

Boltz-2 (Wohlwend et al., MIT + Recursion, June 2025; open-source, MIT license) is a biomolecular foundation model that jointly predicts molecular structure and binding affinity. It approaches the accuracy of physics-based free energy perturbation (FEP) calculations but runs up to 1,000× faster — fast enough to serve as an inner-loop tool call for an agent iterating on designs. Released with full code, weights, and training pipeline.

ProteinMCP (bioRxiv, March 2026) is an agentic framework for autonomous protein engineering that uses MCP — the same protocol students implemented — to unify 38 specialized bioinformatics tools (structure prediction, sequence design, docking, property evaluation) under a single LLM orchestrator. The agent interprets high-level scientific goals and autonomously composes multi-step workflows across these tools. A notable innovation: an automated pipeline that converts existing code repositories into MCP servers, allowing rapid integration of new tools as they appear. Applications span peptide design, antibody engineering, and drug discovery.

Autonomous enzyme engineering platforms (Nature Communications 2025) integrate machine learning with robotic biofoundry automation, eliminating human intervention from the design–build–test–learn cycle entirely. The agent selects mutations, the robot builds and tests them, and the agent analyzes results and plans the next round.

4.3 The Pattern Across Biological Sciences

Emphasize the common architecture: an LLM-based planning agent orchestrates domain-specific computational tools (docking, folding, property prediction) and, increasingly, physical laboratory automation (liquid handlers, plate readers, robotic synthesis). The agent loop is the same; only the tools change.


5. Domain Survey: Physical Sciences (10 min)

5.1 Materials Discovery: The A-Lab and Self-Driving Laboratories

The A-Lab (Lawrence Berkeley National Laboratory; Nature 2023) is an autonomous laboratory for solid-state synthesis of inorganic materials. The system uses computational predictions, literature data, machine learning, and active learning to plan experiments, which are then executed by robotic equipment. Over 17 days of continuous autonomous operation (355 experiments), the A-Lab successfully synthesized 41 of 58 target compounds (71% success rate), including a variety of oxides and phosphates identified from the Materials Project and Google DeepMind's GNoME predictions.

The A-Lab exemplifies the full agent loop with physical-world actions: the agent reasons about which compounds to attempt, plans synthesis conditions, controls robotic equipment to execute the synthesis, characterizes the products via X-ray diffraction, analyzes whether the target phase was obtained, and adjusts its strategy for the next attempt.

Radical AI (founded 2024 by Gerbrand Ceder, the A-Lab's principal investigator) is scaling this approach commercially, operating self-driving labs that can create and characterize over 25 alloys per day.

FORUM-AI (Foundation Models Orchestrating Reasoning Agents to Uncover Materials Advances and Insights), led by Berkeley Lab, is a multi-institutional project building the first full-stack agentic AI system for materials science — combining literature retrieval, large-scale simulation on supercomputers, and robotic experimentation under unified agent orchestration.

5.2 Chemistry: Retrosynthesis and Reaction Planning

Beyond drug discovery, agentic AI is transforming synthetic chemistry more broadly. Retrosynthesis — working backward from a target molecule to identify a viable synthesis route — is a classic planning problem that maps naturally to agent architectures. The agent proposes a route, evaluates feasibility (checking reagent availability, reaction conditions, safety), and iterates. ChemCrow (discussed in §4.1) demonstrates this; more specialized systems like IBM's RXNfor Chemistry provide the tool-call backend.

5.3 Climate and Earth Science

Climate science involves orchestrating large-scale simulations (global circulation models, ocean models, atmospheric chemistry) that run for hours to days on supercomputers. The agentic opportunity is in automating the experimental design loop: an agent proposes a simulation configuration, submits the job, monitors execution, analyzes output, and designs the next simulation to test a hypothesis. This is early-stage but represents a natural application of the A2A multi-agent pattern — one agent manages compute resources, another analyzes results, a third consults the literature for context.

5.4 The Self-Driving Lab as Architecture Pattern

Synthesize the physical science examples into an architectural pattern. A "self-driving laboratory" is:

  1. Planning agent — reasons about what experiment to run next, informed by prior results and literature

  2. Execution layer — robotic equipment or simulation infrastructure controlled via tool calls

  3. Analysis agent — processes raw experimental data into structured results

  4. Knowledge base — scientific literature (via Semantic Scholar / OpenScholar) plus the lab's own experimental history (a growing dataset the agent queries via RAG)

This four-component pattern recurs across every physical science domain. It is the scientific instantiation of the agent stack students have been building all semester.


6. Case Study: Autonomous High-Energy Physics (Moreno et al., 2026) (12 min)

6.1 Why This Paper

This section is the integrative core of the lecture. The paper "AI Agents Can Already Autonomously Perform Experimental High Energy Physics" (Moreno et al., MIT/CERN, March 2026, arXiv:2603.20179) is ideal because it demonstrates a complete agentic system that combines nearly every technique students have learned: autonomous code execution, literature-based RAG, multi-step reasoning, and multi-agent review — applied to real experimental data from three particle physics experiments.

6.2 The JFC Framework: Just Furnish Context

The framework's deliberately provocative name — JFC, "Just Furnish Context" — captures its core thesis: given access to the right context (data, literature, execution tools), a capable agent can autonomously plan and execute a credible physics analysis without experiment-specific scaffolding.

Components:

6.3 What the Agent Did

The JFC framework was applied to open data from three particle physics experiments:

For each, the agent autonomously performed the complete analysis pipeline:

  1. Event selection — choosing which collision events to analyze based on physics criteria (analogous to data preprocessing and filtering)

  2. Background estimation — estimating contributions from non-signal processes (analogous to noise modeling)

  3. Uncertainty quantification — assessing systematic and statistical uncertainties

  4. Statistical inference — extracting physics parameters and evaluating significance

  5. Paper drafting — writing up the analysis as a scientific publication

6.4 Two Philosophies: HEPTAPOD vs. JFC

The JFC paper is best understood in contrast with HEPTAPOD (High Energy Physics Toolkit for Agentic Planning, Orchestration, and Deployment; Menzo et al., Fermilab, December 2025, arXiv:2512.15867), which takes a fundamentally different design approach to the same problem.

HEPTAPOD is built on the Orchestral AI engine (Roman & Roman, January 2026, arXiv:2601.02577; orchestral-ai.com), a lightweight Python framework for building LLM agents with automatic tool schema generation from Python type hints, MCP integration, workspace sandboxing, and human approval workflows. On top of Orchestral, HEPTAPOD provides:

HEPTAPOD demonstrated a complete workflow for Beyond the Standard Model Monte Carlo signal validation, covering everything from symbolic model generation through event simulation to jet clustering and kinematic analysis.

The design tension: HEPTAPOD represents the structured orchestration philosophy — carefully engineered tool interfaces, schema validation, and reproducibility infrastructure. JFC represents the minimal scaffolding philosophy — provide context (data, literature, execution sandbox) and let the agent plan its own approach with no experiment-specific structure. The Moreno et al. paper explicitly argues that most proposed agentic workflows (like HEPTAPOD's) are "too narrowly scoped or scaffolded to specific analysis structures."

This is a genuine and productive disagreement in the field. The structured approach offers better reproducibility, auditability, and safety guarantees. The minimal approach offers more generality and may discover unconventional analysis strategies. In practice, production scientific agents will likely need elements of both.

6.5 Connecting to Course Concepts

Walk through each component and explicitly connect it to what students have built:

JFC ComponentCourse ModuleWhat Students Built
Claude Code as base agentCoding Agents, Claude Code deep diveUnderstood the master loop, tool ecosystem, sub-agents
Literature RAG over physics papersRAG SystemsBuilt retrieval pipelines with ChromaDB + embeddings
Asta/Semantic Scholar queriesRemote Tool Use (MCP)Called Asta MCP server, parsed nested JSON responses
Multi-agent review protocolAgent-to-Agent (A2A)Implemented A2A communication, trivia tournament
Autonomous code executionTool Calling, ReAct AgentsBuilt ReAct loops, tool-calling agents from scratch

The point to emphasize: this is not some exotic new architecture. It is the standard agent stack — ReAct + tools + RAG + multi-agent — applied to a domain where the tools are physics analysis libraries and the knowledge base is experimental literature.

6.6 The Unblinding Decision: Where Human Judgment Remains

A critical detail for the challenges discussion: in experimental particle physics, the analysis is designed while "blind" to the actual signal region, to prevent unconscious bias from affecting analysis choices. The decision to "unblind" — to look at the actual data in the signal region — is a significant moment requiring human judgment about whether the analysis methodology is sound.

The authors explicitly note that the agent performs all analysis steps up to and including the full statistical inference pipeline, but the decision to trust the result — the unblinding — remains a human responsibility. This is a concrete example of the "human in the loop" principle applied to autonomous scientific agents.

6.7 The Authors' Provocative Claim

The paper argues that the experimental HEP community is underestimating current agent capabilities, and that most proposed agentic workflows in physics are too narrowly scoped or too tightly scaffolded to specific analysis structures. The JFC framework deliberately avoids experiment-specific scaffolding — it provides context (data, literature, tools) and lets the agent plan its own approach.

Invite student discussion: Is this a strength (generality, robustness) or a risk (less control, harder to verify)?


7. Challenges, Risks, and Open Questions (5 min)

7.1 Hallucination in High-Stakes Domains

When a coding agent hallucinates a function name, the code fails to compile. When a scientific agent hallucinates a physical constant, a citation, or a statistical result, the error may propagate undetected. OpenScholar's finding that GPT-4o hallucinates citations 78–90% of the time is alarming in a scientific context where citation accuracy is foundational.

7.2 The Verification Problem

How do you verify an agent's scientific claims? For code, you have test suites. For science:

This is the deepest challenge: the whole point of autonomous agents is to go beyond human capacity, but verification still requires human experts.

7.3 Reproducibility and Auditability

Scientific results must be reproducible. An agent-driven analysis must log not just the final code and results but the entire reasoning trace — every literature query, every decision point, every alternative considered and rejected. This connects to the broader agent observability challenge.

7.4 Cost and Access

Running agents costs money — API calls, compute for simulations, robotic lab time. The A-Lab's 17-day run was not free. Who gets access to autonomous scientific infrastructure? Does this concentrate scientific capability in well-funded labs?

7.5 Scientific Taste

Perhaps the deepest open question: can agents identify which questions are worth asking? Optimizing a known objective (better binding affinity, faster kernel) is well-suited to agents. Identifying a genuinely novel research direction — scientific taste — is harder.

AutoDiscovery (§3.5) represents the most principled attempt so far: using Bayesian surprise as an intrinsic motivation signal, with MCTS to explore the hypothesis space. The result that 67% of agent-surprising discoveries also surprised human experts is encouraging but also reveals the gap — a third of the time, the agent finds something "surprising" that experts consider unremarkable. The question of what makes a scientific question important (not just surprising) remains open. Surprise is necessary but not sufficient for taste.

Conclude with: the systems we've seen today are real and producing real results. The question is not whether agentic AI will transform science, but how the scientific community will adapt its institutions — peer review, reproducibility standards, credit assignment, training — to a world where agents are active participants in the research process.


Key Takeaways


Citations