Lesson Plan: Evolutionary Code Generation — From AlphaEvolve to AVO to EvoSkill

Module: Coding Agents
Duration: ~20 minutes
Format: Lecture (no exercise)

Learning Goals

By the end of this segment, students should be able to:

Describe the evolutionary search loop and explain the role of the variation operator
Explain what AlphaEvolve does, how it works architecturally, and what its key limitations are
Articulate the specific design shift that AVO makes — elevating the LLM from candidate generator to variation operator — and explain why this matters for difficult optimization problems
Understand how EvoSkill applies a similar evolutionary philosophy at a higher level of abstraction: skills rather than code
Connect all three systems back to the agent loop students already know

Prerequisites

Students have covered:

The ReAct loop (plan, act, observe, repeat)
Tool calling and coding agents (Claude Code, SWE-bench)
Basic understanding of how LLMs generate code

1. Background: Evolutionary Search (3 min)

Evolutionary search is one of the oldest ideas in AI: keep a population of candidate solutions, generate new ones by mutating or crossing existing ones, evaluate each candidate, and retain the best. The loop is:


Vary(P_t) → new candidate x_{t+1} → evaluate f(x_{t+1}) → update population

The variation operator — Vary — is the mechanism that produces new candidates from existing ones. In classical evolutionary algorithms, this was hand-coded: bit-flip mutations, crossover of chromosomes, grammar-guided edits.

The natural question once LLMs arrived: what if the LLM is the variation operator?

FunSearch (Google DeepMind, Nature 2023) was the first major demonstration. An LLM generates Python function variants; an evaluator scores them; a database of past solutions guides the next prompt. It discovered novel combinatorial mathematics results.

AlphaEvolve (Google DeepMind, 2025) was the major step up from FunSearch. We spend most of today there.

2. AlphaEvolve (10 min)

2.1 What AlphaEvolve Is

AlphaEvolve is an evolutionary coding agent that uses LLMs as variation operators to discover and improve algorithms. The human defines "what?" — evaluation criteria, an initial program, optional background knowledge. AlphaEvolve figures out "how?" — autonomously improving the program through evolutionary iteration.

AlphaEvolve architecture: the distributed controller loop, prompt sampler, LLM ensemble, evaluator pool, and program database

Key components:

Program database — stores all solutions evaluated so far, along with their scores. Inspired by MAP-Elites island models: maintains diversity by keeping solutions in different "niches."
Prompt sampler — selects parent programs and "inspiration" programs from the database, builds rich context for the LLM prompt
LLMs ensemble — a mix of Gemini 2.0 Flash (high throughput, many candidates) and Gemini 2.0 Pro (occasional high-quality breakthroughs). The LLM proposes changes as diffs — SEARCH/REPLACE blocks, not full rewrites
Evaluators pool — automatically runs the generated code and scores it. Can handle hours-long evaluations on GPU/TPU accelerators.
Distributed controller loop — runs asynchronously at scale, evaluating many candidates in parallel

The core loop in pseudocode:


parent_program, inspirations = database.sample()
prompt = prompt_sampler.build(parent_program, inspirations)
diff = llm.generate(prompt)
child_program = apply_diff(parent_program, diff)
results = evaluator.execute(child_program)
database.add(child_program, results)

2.2 What Makes AlphaEvolve Different from FunSearch

	FunSearch	AlphaEvolve
Evolves	Single function	Entire codebase
Code volume	10–20 lines	Hundreds of lines
Language	Python only	Any language
Evaluation time	≤ 20 min on 1 CPU	Hours, on GPU/TPU accelerators
LLM size	Small, code-only	Frontier LLMs (Gemini 2.0)
Context	Prior solutions only	Rich: prior solutions + problem description + evaluation feedback
Optimization target	Single metric	Multiple simultaneous metrics

The key engineering choices: (1) diff-based generation rather than full rewrites enables targeted edits to large codebases; (2) asynchronous distributed evaluation means the loop never idles waiting for a single slow evaluator; (3) meta-prompt evolution lets the LLM improve the prompts themselves.

2.3 What AlphaEvolve Achieved

AlphaEvolve was applied to real, hard engineering and math problems at Google:

Matrix multiplication: Discovered rank-48 algorithm for 4×4 complex matrix multiplication — the first improvement over Strassen's 1969 algorithm in 56 years. Also improved the state of the art for 14 other matrix multiplication sizes.

Mathematical discoveries: Applied to 50+ open problems in analysis, geometry, and combinatorics. Matched the best known result in ~75% of cases; discovered new, provably better constructions in ~20%.

Google data center scheduling: Evolved a simple, interpretable heuristic function for Borg (Google's cluster manager) that recovers ~0.7% of fleet-wide compute from stranded resources. Deployed to production. Chosen over deep RL specifically because the code solution is interpretable.

Gemini kernel engineering: Discovered a tiling heuristic for a key matrix multiplication kernel that yields 23% kernel speedup and 1% reduction in Gemini training time. Reduced optimization time from months to days.

FlashAttention optimization: Applied to XLA-generated IR (compiler output) for a FlashAttention kernel — a notoriously difficult format not meant for human editing. Sped up the FlashAttention kernel by 32% and pre/postprocessing by 15%.

2.4 The Core Limitation of AlphaEvolve

Despite these results, the LLM in AlphaEvolve is still confined. Its role within each evolutionary step is:


Vary(P_t) = Generate(Sample(P_t))

Sample is a fixed algorithm — MAP-Elites with predefined heuristics
Generate is what the LLM does — produce a diff conditioned on the sampled parents

The LLM sees the prompt once, generates a diff, and is done. It cannot:

Test its edit before committing it
Read profiler output and diagnose a bottleneck
Consult documentation mid-generation
Try multiple hypotheses and compare them
Iterate on a failure

For domains where further improvement requires deep, iterative engineering — like squeezing the last few percent out of a highly optimized GPU kernel — this single-shot constraint is the bottleneck.

3. AVO: Agentic Variation Operators (5 min)

3.1 The Key Shift

AVO (NVIDIA, March 2026) makes a conceptually simple but powerful change: replace the entire Vary function with a full coding agent loop.

EVO (classical evolutionary search with a single-turn LLM) vs. AVO (an autonomous agent loop with tools, memory, and access to previous solutions)

In classical evolutionary frameworks including AlphaEvolve:


Vary(P_t) = Generate(Sample(P_t))

In AVO:


Vary(P_t) = Agent(P_t, K, f)

where P_t is the full solution lineage, K is a domain-specific knowledge base, and f is the scoring function. The agent is not given a single prompt and asked for a single diff — it runs an autonomous loop that may span many internal actions: reading documentation, writing code, compiling, running benchmarks, analyzing profiler output, diagnosing failures, trying again.

The agent within a single variation step can:

Compare multiple prior solutions in P_t side by side
Consult CUDA programming guides or PTX ISA documentation in K
Run the evaluation function f multiple times on intermediate candidates
Diagnose a correctness failure, fix it, and re-evaluate
Shift strategy when one optimization direction stalls

3.2 What AVO Achieved

AVO was applied to attention kernels on NVIDIA's Blackwell (B200) GPU — one of the most aggressively hand-optimized targets in AI, where both cuDNN and FlashAttention-4 represent months of expert engineering.

Over 7 days of continuous autonomous evolution (no human intervention), AVO:

Explored 500+ optimization directions
Committed 40 kernel versions
Achieved up to +3.5% over cuDNN and +10.5% over FlashAttention-4 on causal multi-head attention

The agent-discovered optimizations are not superficial. Three representative examples:

Branchless accumulator rescaling (v19→v20, +8.1% non-causal): The agent identified that a conditional branch on every key-block iteration introduced warp synchronization overhead. It replaced it with a branchless speculative path (always compute, predicate-select 1.0 when unnecessary), which eliminated warp divergence and allowed replacing a blocking memory fence with a lighter non-blocking one.
Correction/MMA pipeline overlap (v29→v30, +1.1%): The agent identified that the correction warp sat idle during the second PV GEMM in the dual-Q-stage pipeline. It restructured execution so correction of the first stage overlaps with the second GEMM.
Register rebalancing (v32→v33, +2.1% non-causal): Profiling revealed the correction warp spilled to local memory under its 80-register budget while softmax warps had headroom. The agent redistributed 8 registers per group (184/88/56), reducing spill stalls.

Each optimization requires jointly reasoning about synchronization, pipeline scheduling, and register allocation — not tuning one parameter in isolation. After 7 days of MHA evolution, adapting the kernel to grouped-query attention (GQA) took 30 minutes of autonomous adaptation.

3.3 Connecting Back to What Students Know

AVO is the ReAct loop students have already built, applied as the mutation operator inside an evolutionary outer loop. The agent's internal loop (plan → act → observe → repeat) is exactly what students implemented manually. AVO wraps that loop in an evolutionary framework that persists successful candidates as git commits and provides a supervisor agent to detect stagnation.

4. EvoSkill: Evolving Skills, Not Code (2 min)

AlphaEvolve and AVO both evolve code — the low-level artifact. EvoSkill (Alzubi et al., Virginia Tech / Sentient, March 2026) applies the same evolutionary philosophy one level of abstraction higher: it evolves agent skills.

EvoSkill loop: the base agent fails, a Proposer analyzes the failure and proposes a skill, a Skill Builder materializes it, and the new skill improves performance on the next iteration

4.1 What Are Agent Skills?

Skills (in the Claude Code / Agent Skills spec sense) are filesystem directories containing a SKILL.md with procedural instructions and optional helper scripts. They augment a coding agent with reusable, domain-specific capabilities without touching the underlying model. Notably, this is exactly the SKILL.md format students have been using in this course.

4.2 The EvoSkill Loop

EvoSkill's loop has three agents:

Executor — runs tasks using the current set of skills
Proposer — analyzes failures, proposes a new skill or edit to an existing skill
Skill Builder — materializes the proposal into a concrete skill folder

The Pareto frontier retains the best programs (sets of skills). A new skill is only kept if it improves held-out validation performance.

4.3 What EvoSkill Achieved

On OfficeQA (grounded reasoning over ~89,000 pages of U.S. Treasury documents): +7.3% exact-match accuracy (60.6% → 67.9%) using only ~12–24 training examples. On SealQA (search-augmented QA with noisy retrieval): +12.1% (26.6% → 38.7%).

Critically, a search-persistence-protocol skill evolved on SealQA transferred zero-shot to BrowseComp (+5.3%) — a completely different benchmark. Evolved skills capture general capabilities rather than task-specific heuristics, precisely because they are written in natural language with explicit trigger conditions rather than baked into model weights or task-specific prompts.

4.4 The Key Distinction from AlphaEvolve and AVO

System	What evolves	Evaluation signal	Transfer
AlphaEvolve	Code (diffs to a codebase)	Automated scorer	Hard — code is task-specific
AVO	Code (CUDA kernels)	TFLOPS benchmark	Moderate — requires re-adaptation
EvoSkill	Agent skills (SKILL.md + scripts)	Task accuracy on held-out set	Strong — skills are interpretable and portable

The tradeoff: EvoSkill operates at higher abstraction, which gives transferability but limits how low-level the optimization can go. For GPU kernel throughput, you need AVO. For augmenting a general-purpose coding agent's domain reasoning, you need EvoSkill.

5. The Arc: A Pattern Worth Remembering (1 min)

All three systems share a common structure:

Population of solutions — database of scored candidates
Variation — something generates a new candidate from existing ones
Evaluation — automated scorer provides feedback signal
Selection — retain the better candidates

What has changed across the three papers is what the variation step is allowed to do:

AlphaEvolve: LLM generates a diff in one shot, no iteration
AVO: Full agent loop — plan, implement, test, debug, iterate, commit
EvoSkill: Agent loop, but operating on skills (abstractions) rather than code (artifacts)

The unifying insight: the more autonomy and tool access you give the variation operator, the deeper the optimizations it can discover — at the cost of more compute per variation step.

Key Takeaways

Evolutionary search + LLMs is a powerful combination; the variation operator is where the action is
AlphaEvolve (2025) demonstrated production-grade impact: Gemini kernels, data center scheduling, and 56-year-old math results — but confined the LLM to a single-shot generation role
AVO (2026) elevates the agent to be the variation operator, enabling the iterative engineering reasoning required for near-limit optimization of GPU kernels; beat cuDNN and FlashAttention-4 after 7 days of autonomous evolution
EvoSkill (2026) applies the same evolutionary loop at the skill level, producing reusable, interpretable, transferable capabilities with no model update required
The pattern — variation → evaluation → selection → repeat — is the same ReAct loop students have been building, running at the evolutionary timescale

Citations

Romera-Paredes, B., et al. (2024). Mathematical discoveries from program search with large language models. Nature, 625, 468–475.
Novikov, A., et al. (2025). AlphaEvolve: A coding agent for scientific and algorithmic discovery. Google DeepMind. arXiv:2506.13131.
Chen, T., et al. (2026). AVO: Agentic Variation Operators for Autonomous Evolutionary Search. NVIDIA. arXiv:2603.24517.
Alzubi, S., et al. (2026). EvoSkill: Automated Skill Discovery for Multi-Agent Systems. Sentient / Virginia Tech. arXiv:2603.02766.
Zadouri, T., et al. (2026). FlashAttention-4: Algorithm and kernel pipelining co-design for asymmetric hardware scaling. arXiv:2603.05451.