Module: AI Agents Duration: ~60 minutes Format: Lecture (no exercise)
By the end of this lecture, students should be able to:
Articulate why scientific research is inherently agentic and why the agent loop (plan → act → observe → revise) maps naturally onto the scientific method
Describe the emerging "scientific agent stack" and connect it to architectures they have already built (ReAct, tool calling, MCP, RAG, A2A)
Identify concrete agentic AI systems across the biological sciences (drug discovery, protein engineering) and physical sciences (materials discovery, autonomous laboratories, chemistry)
Explain how the Ai2 tool ecosystem (Semantic Scholar, OpenScholar, ScholarQA, Asta) provides the knowledge infrastructure that scientific agents depend on
Analyze the JFC framework (Moreno et al., 2026) as an integrative case study that combines autonomous agent execution, literature RAG, and multi-agent review to perform experimental high-energy physics
Critically evaluate the challenges of deploying agents in high-stakes scientific domains: hallucination, verification, reproducibility, and the limits of autonomy
Students have covered:
The ReAct loop (plan, act, observe, repeat)
Tool calling: manual implementation → LangChain → LangGraph
RAG systems: embeddings, vector databases, retrieval pipelines
Remote tool use via MCP (including the Asta MCP server for Semantic Scholar)
Agent-to-agent communication via the A2A protocol
Coding agents and autonomous code execution
Open by drawing the parallel that motivates the entire lecture. The scientific method — hypothesize, design experiment, execute, analyze results, revise hypothesis — is structurally identical to the ReAct loop students have already implemented:
| Scientific Method | Agent Loop |
|---|---|
| Formulate hypothesis | Plan |
| Design and run experiment | Act (tool calls) |
| Collect and analyze data | Observe |
| Revise hypothesis, repeat | Reflect and loop |
The key insight: a single LLM call can answer a factual question, but it cannot do science. Science requires iterative, multi-step workflows where the output of one step determines what to do next. This is exactly what agents are for.
Distinguish three levels of AI involvement in science, each progressively more agentic:
LLM as lookup tool — Ask a question, get an answer. (ChatGPT for literature search.) No iteration, no tool use.
LLM as copilot — The scientist drives; the LLM assists with specific subtasks (write this code, summarize this paper, suggest a next experiment). Tool calling but no autonomy.
LLM as autonomous agent — The agent plans a research strategy, executes experiments (computationally or via lab automation), analyzes results, revises its approach, and produces a report. Full agent loop with minimal human oversight.
Today's lecture is about the frontier of level 3 — and the hard questions about when and whether to trust it.
Students have seen coding agents (Claude Code, SWE-bench runners) that write, test, and debug software. Scientific agents share the same architecture but face additional challenges:
Domain knowledge is vast and specialized — the agent needs access to scientific literature, databases, and ontologies (not just code documentation)
Tools are physical or computationally expensive — running a simulation may take hours; running a wet-lab experiment takes days and costs real money
Verification is harder — you can run a test suite on code, but how do you verify a novel scientific claim?
Stakes are higher — a wrong drug candidate or a flawed physics measurement has consequences beyond a failed build
These challenges motivate the specific architectural patterns we'll see throughout the lecture.
Walk through the agent architecture students already know and show how each component has a scientific counterpart:
Tool calling → Scientific instruments and APIs. Just as students called web APIs and executed code, scientific agents call domain-specific tools: molecular docking software, quantum chemistry simulators, genomic annotation pipelines, telescope scheduling APIs. The mechanism is identical — structured function calls with typed inputs and outputs.
MCP → Domain database servers. Students used the Asta MCP server to query Semantic Scholar. The same pattern extends to UniProt (protein sequences), the Materials Project (crystal structures), PubChem (chemical compounds), GenBank (genomic sequences), and the CERN Open Data Portal (particle physics datasets). Each is a specialized knowledge source that an agent can query through a standardized protocol.
RAG → Scientific literature retrieval. Instead of retrieving documentation chunks, scientific agents retrieve relevant papers, methods sections, experimental protocols, and prior results. The retrieval quality is critical — a missed relevant paper can mean repeating work or missing a known failure mode.
A2A → Multi-agent scientific collaboration. Complex scientific workflows naturally decompose into specialist roles: one agent for literature review, another for experimental design, a third for statistical analysis, a fourth for peer review. This is exactly the multi-agent coordination pattern students implemented with A2A.
Across every domain we'll survey today, the same four-phase loop appears:
Design → Execute → Analyze → Learn → (repeat)
In drug discovery this is called the DMTA cycle (Design–Make–Test–Analyze). In materials science it's the autonomous experimentation loop. In the CERN paper it's hypothesis–analysis–inference–revision. The terminology varies but the agent architecture is the same.
Before surveying domain applications, highlight a foundational requirement: scientific agents are only as good as the knowledge they can access. An agent that cannot efficiently search, retrieve, and synthesize scientific literature is flying blind. The Allen Institute for AI (Ai2) has built the most comprehensive open ecosystem of tools for exactly this purpose — and students have already used part of it.
Semantic Scholar is a free, AI-powered search engine for academic literature, indexing over 200 million papers across all scientific disciplines. Unlike Google Scholar, it provides:
Structured metadata accessible via API — authors, citations, references, venues, fields of study, and publication dates as typed data, not just search results
AI-generated summaries (TLDRs) for rapid triage of paper relevance
Citation context — not just "paper A cites paper B" but how and why it cites it
Influence scores and citation graphs for identifying seminal vs. incremental work
Students already queried this via the Asta MCP server in the remote tool use module. Emphasize that when a scientific agent "reads the literature," Semantic Scholar's API is often the underlying tool call.
OpenScholar (Ai2 + University of Washington, published in Nature 2026) is a retrieval-augmented language model purpose-built for scientific literature synthesis. It demonstrates what domain-specific RAG looks like at scale:
Datastore: 45 million open-access papers from Semantic Scholar
Architecture: Retrieval-augmented generation — query → retrieve relevant passages → synthesize a citation-backed response (the same RAG pipeline students built, but at massive scale with domain-specific tuning)
Key result: OpenScholar-8B (an 8-billion parameter open model) outperforms GPT-4o by 6.1% on correctness for multi-paper synthesis tasks. GPT-4o hallucinates citations 78–90% of the time; OpenScholar achieves citation accuracy on par with human experts.
Expert preference: Human domain experts preferred OpenScholar responses over expert-written responses 51% of the time — a striking benchmark for AI-generated scientific synthesis.
The pedagogical point: the RAG architecture students built from scratch in the RAG module is the same architecture powering state-of-the-art scientific literature tools. Scale and domain tuning matter, but the mechanism is the same.
ScholarQA extends OpenScholar to multi-document scientific question answering — answering questions that require synthesizing insights across many papers, with structured outputs including comparison tables, subtopic breakdowns, and citation-backed evidence chains.
Asta is Ai2's scholarly research assistant combining literature understanding with data-driven discovery across 108 million+ abstracts and 12 million+ full-text papers. Students have already interacted with Asta's API through MCP tool calls. Frame Asta as an example of the kind of domain-specific tool server that scientific agents need in their toolkit.
All the systems above help agents answer scientific questions. AutoDiscovery (Agarwal & Majumder, UMass Amherst + Ai2; NeurIPS 2025, arXiv:2507.00310) tackles a deeper problem: can an agent decide which questions are worth asking in the first place?
AutoDiscovery uses Bayesian surprise as an intrinsic motivation signal. The agent holds a prior belief about a hypothesis (represented as a probability distribution), designs and runs an experiment (writing and executing Python code), then updates to a posterior belief given the results. Surprise is quantified as the magnitude of the epistemic shift — the more the evidence changes the agent's beliefs, the more interesting the discovery.
Architecture: A multi-agent system where collaborating LLMs propose experiment plans, write and execute analysis code, critique and fix mistakes, and analyze results — the same agent loop students know. To navigate the vast space of possible hypotheses efficiently, AutoDiscovery uses Monte Carlo tree search (MCTS) with progressive widening, balancing exploration of novel research directions against exploitation of promising threads.
Results: Across 21 real-world datasets spanning behavioral science, economics, biology, and finance, AutoDiscovery finds 5–29% more surprising hypotheses than strong search baselines. Critically, in a human study with 500+ hypotheses, 67% of the discoveries the agent found surprising were also surprising to domain experts — suggesting that Bayesian surprise is a reasonable proxy for genuine scientific novelty.
AutoDiscovery is now integrated into Ai2's AstaLabs platform, making it accessible as part of the same ecosystem students have already used.
The pedagogical connection: this directly addresses the "scientific taste" challenge we'll revisit in §7. It is the first principled attempt to formalize what it means for an agent to find something scientifically interesting, using information-theoretic criteria rather than human-specified objectives.
Note that Ai2's tools are open-source and freely available — Semantic Scholar's API, OpenScholar's model weights and code, and the OLMo language model family that underlies several of these systems. This matters for scientific reproducibility: if a scientific agent's literature retrieval component is a black box, the agent's conclusions cannot be fully audited. Open infrastructure makes the entire agent pipeline inspectable.
Drug discovery follows the Design–Make–Test–Analyze (DMTA) cycle — a natural agent loop. Traditionally, each iteration takes weeks to months. Agentic AI aims to compress this dramatically.
ChemCrow (Bran, Schwaller et al., EPFL; Nature Machine Intelligence 2024) is the canonical early example. It integrates an LLM (GPT-4) with 18 expert-designed chemistry tools via LangChain:
Molecular property prediction (toxicity, solubility, drug-likeness)
Retrosynthesis planning via IBM's RXN4Chemistry API
Web and literature search for safety and prior art
Reaction execution planning
ChemCrow autonomously planned and executed syntheses of an insect repellent and three organocatalysts, and guided the discovery of a novel chromophore. The architecture is exactly LangChain tool calling — the same pattern students implemented.
Industry adoption is accelerating. AstraZeneca's ChatInvent system evolved from a single-agent proof of concept into a multi-agent architecture for molecular design and synthesis planning. The Tippy framework automates the full DMTA cycle with specialized agents for each phase. Companies including Eli Lilly, Bristol Myers Squibb, Takeda, and AbbVie are building AI-driven pipelines that increasingly resemble the agent architectures students know.
Protein engineering is following a parallel trajectory. The key systems:
AlphaFold 3 (Google DeepMind, 2024) predicts structures of entire biomolecular complexes — proteins with DNA, RNA, small molecules, and ions. It serves as a powerful tool that agents can call, not an agent itself.
Boltz-2 (Wohlwend et al., MIT + Recursion, June 2025; open-source, MIT license) is a biomolecular foundation model that jointly predicts molecular structure and binding affinity. It approaches the accuracy of physics-based free energy perturbation (FEP) calculations but runs up to 1,000× faster — fast enough to serve as an inner-loop tool call for an agent iterating on designs. Released with full code, weights, and training pipeline.
ProteinMCP (bioRxiv, March 2026) is an agentic framework for autonomous protein engineering that uses MCP — the same protocol students implemented — to unify 38 specialized bioinformatics tools (structure prediction, sequence design, docking, property evaluation) under a single LLM orchestrator. The agent interprets high-level scientific goals and autonomously composes multi-step workflows across these tools. A notable innovation: an automated pipeline that converts existing code repositories into MCP servers, allowing rapid integration of new tools as they appear. Applications span peptide design, antibody engineering, and drug discovery.
Autonomous enzyme engineering platforms (Nature Communications 2025) integrate machine learning with robotic biofoundry automation, eliminating human intervention from the design–build–test–learn cycle entirely. The agent selects mutations, the robot builds and tests them, and the agent analyzes results and plans the next round.
Emphasize the common architecture: an LLM-based planning agent orchestrates domain-specific computational tools (docking, folding, property prediction) and, increasingly, physical laboratory automation (liquid handlers, plate readers, robotic synthesis). The agent loop is the same; only the tools change.
The A-Lab (Lawrence Berkeley National Laboratory; Nature 2023) is an autonomous laboratory for solid-state synthesis of inorganic materials. The system uses computational predictions, literature data, machine learning, and active learning to plan experiments, which are then executed by robotic equipment. Over 17 days of continuous autonomous operation (355 experiments), the A-Lab successfully synthesized 41 of 58 target compounds (71% success rate), including a variety of oxides and phosphates identified from the Materials Project and Google DeepMind's GNoME predictions.
The A-Lab exemplifies the full agent loop with physical-world actions: the agent reasons about which compounds to attempt, plans synthesis conditions, controls robotic equipment to execute the synthesis, characterizes the products via X-ray diffraction, analyzes whether the target phase was obtained, and adjusts its strategy for the next attempt.
Radical AI (founded 2024 by Gerbrand Ceder, the A-Lab's principal investigator) is scaling this approach commercially, operating self-driving labs that can create and characterize over 25 alloys per day.
FORUM-AI (Foundation Models Orchestrating Reasoning Agents to Uncover Materials Advances and Insights), led by Berkeley Lab, is a multi-institutional project building the first full-stack agentic AI system for materials science — combining literature retrieval, large-scale simulation on supercomputers, and robotic experimentation under unified agent orchestration.
Beyond drug discovery, agentic AI is transforming synthetic chemistry more broadly. Retrosynthesis — working backward from a target molecule to identify a viable synthesis route — is a classic planning problem that maps naturally to agent architectures. The agent proposes a route, evaluates feasibility (checking reagent availability, reaction conditions, safety), and iterates. ChemCrow (discussed in §4.1) demonstrates this; more specialized systems like IBM's RXNfor Chemistry provide the tool-call backend.
Climate science involves orchestrating large-scale simulations (global circulation models, ocean models, atmospheric chemistry) that run for hours to days on supercomputers. The agentic opportunity is in automating the experimental design loop: an agent proposes a simulation configuration, submits the job, monitors execution, analyzes output, and designs the next simulation to test a hypothesis. This is early-stage but represents a natural application of the A2A multi-agent pattern — one agent manages compute resources, another analyzes results, a third consults the literature for context.
Synthesize the physical science examples into an architectural pattern. A "self-driving laboratory" is:
Planning agent — reasons about what experiment to run next, informed by prior results and literature
Execution layer — robotic equipment or simulation infrastructure controlled via tool calls
Analysis agent — processes raw experimental data into structured results
Knowledge base — scientific literature (via Semantic Scholar / OpenScholar) plus the lab's own experimental history (a growing dataset the agent queries via RAG)
This four-component pattern recurs across every physical science domain. It is the scientific instantiation of the agent stack students have been building all semester.
This section is the integrative core of the lecture. The paper "AI Agents Can Already Autonomously Perform Experimental High Energy Physics" (Moreno et al., MIT/CERN, March 2026, arXiv:2603.20179) is ideal because it demonstrates a complete agentic system that combines nearly every technique students have learned: autonomous code execution, literature-based RAG, multi-step reasoning, and multi-agent review — applied to real experimental data from three particle physics experiments.
The framework's deliberately provocative name — JFC, "Just Furnish Context" — captures its core thesis: given access to the right context (data, literature, execution tools), a capable agent can autonomously plan and execute a credible physics analysis without experiment-specific scaffolding.
Components:
Base agent: Claude Code, running as an autonomous coding agent with tool access. Students know this architecture from the Claude Code deep-dive module.
Literature RAG: A retrieval system over a corpus of prior experimental publications. When the agent encounters a decision point (what selection cuts to apply, what background estimation method to use), it queries the literature for how similar analyses have been done. This is the same RAG pattern students built, applied to physics papers instead of documentation.
Execution framework: The agent writes and runs analysis code — Python scripts for event selection, histogram fitting, statistical inference. The code execution sandbox is analogous to what students have seen in coding agent architectures.
Multi-agent review: After the analysis agent produces results, a separate reviewer agent inspects the methodology, checks for errors, and provides feedback. This is A2A-style multi-agent coordination applied to scientific peer review.
The JFC framework was applied to open data from three particle physics experiments:
ALEPH (LEP collider at CERN) — electroweak measurements
DELPHI (LEP collider at CERN) — QCD measurements
CMS (LHC at CERN) — Higgs boson measurements
For each, the agent autonomously performed the complete analysis pipeline:
Event selection — choosing which collision events to analyze based on physics criteria (analogous to data preprocessing and filtering)
Background estimation — estimating contributions from non-signal processes (analogous to noise modeling)
Uncertainty quantification — assessing systematic and statistical uncertainties
Statistical inference — extracting physics parameters and evaluating significance
Paper drafting — writing up the analysis as a scientific publication
The JFC paper is best understood in contrast with HEPTAPOD (High Energy Physics Toolkit for Agentic Planning, Orchestration, and Deployment; Menzo et al., Fermilab, December 2025, arXiv:2512.15867), which takes a fundamentally different design approach to the same problem.
HEPTAPOD is built on the Orchestral AI engine (Roman & Roman, January 2026, arXiv:2601.02577; orchestral-ai.com), a lightweight Python framework for building LLM agents with automatic tool schema generation from Python type hints, MCP integration, workspace sandboxing, and human approval workflows. On top of Orchestral, HEPTAPOD provides:
Schema-validated tool registries — each HEP tool (particle data lookup, event generation, jet clustering, kinematic analysis) is registered with a typed schema, so the agent's tool calls are validated before execution
Run-card-driven configuration — simulation parameters are captured in versioned, machine-readable run cards that ensure reproducibility
Structured event format — a line-delimited JSON (evtjsonl) representation for intermediate event data, enabling transparent inspection of the pipeline at every stage
Explicit human checkpoints — the framework enforces human-in-the-loop approval at critical decision points
HEPTAPOD demonstrated a complete workflow for Beyond the Standard Model Monte Carlo signal validation, covering everything from symbolic model generation through event simulation to jet clustering and kinematic analysis.
The design tension: HEPTAPOD represents the structured orchestration philosophy — carefully engineered tool interfaces, schema validation, and reproducibility infrastructure. JFC represents the minimal scaffolding philosophy — provide context (data, literature, execution sandbox) and let the agent plan its own approach with no experiment-specific structure. The Moreno et al. paper explicitly argues that most proposed agentic workflows (like HEPTAPOD's) are "too narrowly scoped or scaffolded to specific analysis structures."
This is a genuine and productive disagreement in the field. The structured approach offers better reproducibility, auditability, and safety guarantees. The minimal approach offers more generality and may discover unconventional analysis strategies. In practice, production scientific agents will likely need elements of both.
Walk through each component and explicitly connect it to what students have built:
| JFC Component | Course Module | What Students Built |
|---|---|---|
| Claude Code as base agent | Coding Agents, Claude Code deep dive | Understood the master loop, tool ecosystem, sub-agents |
| Literature RAG over physics papers | RAG Systems | Built retrieval pipelines with ChromaDB + embeddings |
| Asta/Semantic Scholar queries | Remote Tool Use (MCP) | Called Asta MCP server, parsed nested JSON responses |
| Multi-agent review protocol | Agent-to-Agent (A2A) | Implemented A2A communication, trivia tournament |
| Autonomous code execution | Tool Calling, ReAct Agents | Built ReAct loops, tool-calling agents from scratch |
The point to emphasize: this is not some exotic new architecture. It is the standard agent stack — ReAct + tools + RAG + multi-agent — applied to a domain where the tools are physics analysis libraries and the knowledge base is experimental literature.
A critical detail for the challenges discussion: in experimental particle physics, the analysis is designed while "blind" to the actual signal region, to prevent unconscious bias from affecting analysis choices. The decision to "unblind" — to look at the actual data in the signal region — is a significant moment requiring human judgment about whether the analysis methodology is sound.
The authors explicitly note that the agent performs all analysis steps up to and including the full statistical inference pipeline, but the decision to trust the result — the unblinding — remains a human responsibility. This is a concrete example of the "human in the loop" principle applied to autonomous scientific agents.
The paper argues that the experimental HEP community is underestimating current agent capabilities, and that most proposed agentic workflows in physics are too narrowly scoped or too tightly scaffolded to specific analysis structures. The JFC framework deliberately avoids experiment-specific scaffolding — it provides context (data, literature, tools) and lets the agent plan its own approach.
Invite student discussion: Is this a strength (generality, robustness) or a risk (less control, harder to verify)?
When a coding agent hallucinates a function name, the code fails to compile. When a scientific agent hallucinates a physical constant, a citation, or a statistical result, the error may propagate undetected. OpenScholar's finding that GPT-4o hallucinates citations 78–90% of the time is alarming in a scientific context where citation accuracy is foundational.
How do you verify an agent's scientific claims? For code, you have test suites. For science:
Computational reproducibility — can the analysis code be re-run and produce the same results? (The JFC framework supports this by generating all analysis code.)
Methodological soundness — are the statistical methods appropriate? (The multi-agent review helps but is not infallible.)
Physical plausibility — do the results make sense given known physics? (Requires domain expertise the agent may not have.)
This is the deepest challenge: the whole point of autonomous agents is to go beyond human capacity, but verification still requires human experts.
Scientific results must be reproducible. An agent-driven analysis must log not just the final code and results but the entire reasoning trace — every literature query, every decision point, every alternative considered and rejected. This connects to the broader agent observability challenge.
Running agents costs money — API calls, compute for simulations, robotic lab time. The A-Lab's 17-day run was not free. Who gets access to autonomous scientific infrastructure? Does this concentrate scientific capability in well-funded labs?
Perhaps the deepest open question: can agents identify which questions are worth asking? Optimizing a known objective (better binding affinity, faster kernel) is well-suited to agents. Identifying a genuinely novel research direction — scientific taste — is harder.
AutoDiscovery (§3.5) represents the most principled attempt so far: using Bayesian surprise as an intrinsic motivation signal, with MCTS to explore the hypothesis space. The result that 67% of agent-surprising discoveries also surprised human experts is encouraging but also reveals the gap — a third of the time, the agent finds something "surprising" that experts consider unremarkable. The question of what makes a scientific question important (not just surprising) remains open. Surprise is necessary but not sufficient for taste.
Conclude with: the systems we've seen today are real and producing real results. The question is not whether agentic AI will transform science, but how the scientific community will adapt its institutions — peer review, reproducibility standards, credit assignment, training — to a world where agents are active participants in the research process.
Scientific research is inherently agentic: the scientific method maps directly onto the agent loop (plan → act → observe → revise) that students have built throughout this course
The scientific agent stack — tool calling for instruments and databases, MCP for domain APIs, RAG for literature, A2A for multi-agent collaboration — is the same architecture students know, applied to new domains
Ai2's open ecosystem (Semantic Scholar, OpenScholar, ScholarQA, Asta, AutoDiscovery) provides the knowledge infrastructure that scientific agents depend on, using the same RAG architecture students built from scratch
In biological sciences, agentic AI is compressing the drug discovery DMTA cycle and enabling autonomous protein engineering through frameworks like ChemCrow and ProteinMCP
In physical sciences, self-driving laboratories like the A-Lab demonstrate the full agent loop with physical-world actions: plan, synthesize, characterize, learn, repeat
The JFC framework (Moreno et al., 2026) shows all of these components working together in experimental particle physics — Claude Code + literature RAG + autonomous execution + multi-agent review — applied to real data from ALEPH, DELPHI, and CMS
Critical challenges remain: hallucination in high-stakes domains, the verification problem, reproducibility of agent-driven research, cost and access, and the open question of scientific taste — where AutoDiscovery's Bayesian surprise approach is a promising but incomplete first step
Moreno, E.A., et al. (2026). AI Agents Can Already Autonomously Perform Experimental High Energy Physics. MIT / CERN. arXiv:2603.20179.
Bran, A.M., et al. (2024). Augmenting large language models with chemistry tools. Nature Machine Intelligence, 6, 525–535. (ChemCrow)
Szymanski, N.J., et al. (2023). An autonomous laboratory for the accelerated synthesis of novel materials. Nature, 624, 86–91. (A-Lab)
Asai, A., et al. (2026). Synthesizing scientific literature with retrieval-augmented language models. Nature. (OpenScholar)
Jumper, J., et al. (2024). AlphaFold 3: Accurate structure prediction of biomolecular interactions. Google DeepMind.
Wohlwend, J., et al. (2025). Boltz-2: Towards Accurate and Efficient Binding Affinity Prediction. MIT / Recursion. bioRxiv:2025.06.14.659707.
ProteinMCP (2026). An Agentic AI Framework for Autonomous Protein Engineering. bioRxiv:2026.03.11.711149.
Agarwal, D. & Majumder, B.P. (2025). AutoDiscovery: Open-ended Scientific Discovery via Bayesian Surprise. NeurIPS 2025. UMass Amherst / Ai2. arXiv:2507.00310.
Menzo, T., et al. (2025). HEPTAPOD: Orchestrating High Energy Physics Workflows Towards Autonomous Agency. Fermilab. arXiv:2512.15867.
Roman, A. & Roman, J. (2026). Orchestral AI: A Framework for Agent Orchestration. arXiv:2601.02577.