Lesson Plan: Evolutionary Code Generation — From AlphaEvolve to AVO to EvoSkill

Module: Coding Agents
Duration: ~20 minutes
Format: Lecture (no exercise)


Learning Goals

By the end of this segment, students should be able to:


Prerequisites

Students have covered:


1. Background: Evolutionary Search (3 min)

Evolutionary search is one of the oldest ideas in AI: keep a population of candidate solutions, generate new ones by mutating or crossing existing ones, evaluate each candidate, and retain the best. The loop is:

The variation operatorVary — is the mechanism that produces new candidates from existing ones. In classical evolutionary algorithms, this was hand-coded: bit-flip mutations, crossover of chromosomes, grammar-guided edits.

The natural question once LLMs arrived: what if the LLM is the variation operator?

FunSearch (Google DeepMind, Nature 2023) was the first major demonstration. An LLM generates Python function variants; an evaluator scores them; a database of past solutions guides the next prompt. It discovered novel combinatorial mathematics results.

AlphaEvolve (Google DeepMind, 2025) was the major step up from FunSearch. We spend most of today there.


2. AlphaEvolve (10 min)

2.1 What AlphaEvolve Is

AlphaEvolve is an evolutionary coding agent that uses LLMs as variation operators to discover and improve algorithms. The human defines "what?" — evaluation criteria, an initial program, optional background knowledge. AlphaEvolve figures out "how?" — autonomously improving the program through evolutionary iteration.

AlphaEvolve architecture: the distributed controller loop, prompt sampler, LLM ensemble, evaluator pool, and program database

Key components:

The core loop in pseudocode:

2.2 What Makes AlphaEvolve Different from FunSearch

 FunSearchAlphaEvolve
EvolvesSingle functionEntire codebase
Code volume10–20 linesHundreds of lines
LanguagePython onlyAny language
Evaluation time≤ 20 min on 1 CPUHours, on GPU/TPU accelerators
LLM sizeSmall, code-onlyFrontier LLMs (Gemini 2.0)
ContextPrior solutions onlyRich: prior solutions + problem description + evaluation feedback
Optimization targetSingle metricMultiple simultaneous metrics

The key engineering choices: (1) diff-based generation rather than full rewrites enables targeted edits to large codebases; (2) asynchronous distributed evaluation means the loop never idles waiting for a single slow evaluator; (3) meta-prompt evolution lets the LLM improve the prompts themselves.

2.3 What AlphaEvolve Achieved

AlphaEvolve was applied to real, hard engineering and math problems at Google:

Matrix multiplication: Discovered rank-48 algorithm for 4×4 complex matrix multiplication — the first improvement over Strassen's 1969 algorithm in 56 years. Also improved the state of the art for 14 other matrix multiplication sizes.

Mathematical discoveries: Applied to 50+ open problems in analysis, geometry, and combinatorics. Matched the best known result in ~75% of cases; discovered new, provably better constructions in ~20%.

Google data center scheduling: Evolved a simple, interpretable heuristic function for Borg (Google's cluster manager) that recovers ~0.7% of fleet-wide compute from stranded resources. Deployed to production. Chosen over deep RL specifically because the code solution is interpretable.

Gemini kernel engineering: Discovered a tiling heuristic for a key matrix multiplication kernel that yields 23% kernel speedup and 1% reduction in Gemini training time. Reduced optimization time from months to days.

FlashAttention optimization: Applied to XLA-generated IR (compiler output) for a FlashAttention kernel — a notoriously difficult format not meant for human editing. Sped up the FlashAttention kernel by 32% and pre/postprocessing by 15%.

2.4 The Core Limitation of AlphaEvolve

Despite these results, the LLM in AlphaEvolve is still confined. Its role within each evolutionary step is:

The LLM sees the prompt once, generates a diff, and is done. It cannot:

For domains where further improvement requires deep, iterative engineering — like squeezing the last few percent out of a highly optimized GPU kernel — this single-shot constraint is the bottleneck.


3. AVO: Agentic Variation Operators (5 min)

3.1 The Key Shift

AVO (NVIDIA, March 2026) makes a conceptually simple but powerful change: replace the entire Vary function with a full coding agent loop.

EVO (classical evolutionary search with a single-turn LLM) vs. AVO (an autonomous agent loop with tools, memory, and access to previous solutions)

In classical evolutionary frameworks including AlphaEvolve:

In AVO:

where P_t is the full solution lineage, K is a domain-specific knowledge base, and f is the scoring function. The agent is not given a single prompt and asked for a single diff — it runs an autonomous loop that may span many internal actions: reading documentation, writing code, compiling, running benchmarks, analyzing profiler output, diagnosing failures, trying again.

The agent within a single variation step can:

3.2 What AVO Achieved

AVO was applied to attention kernels on NVIDIA's Blackwell (B200) GPU — one of the most aggressively hand-optimized targets in AI, where both cuDNN and FlashAttention-4 represent months of expert engineering.

Over 7 days of continuous autonomous evolution (no human intervention), AVO:

The agent-discovered optimizations are not superficial. Three representative examples:

  1. Branchless accumulator rescaling (v19→v20, +8.1% non-causal): The agent identified that a conditional branch on every key-block iteration introduced warp synchronization overhead. It replaced it with a branchless speculative path (always compute, predicate-select 1.0 when unnecessary), which eliminated warp divergence and allowed replacing a blocking memory fence with a lighter non-blocking one.

  2. Correction/MMA pipeline overlap (v29→v30, +1.1%): The agent identified that the correction warp sat idle during the second PV GEMM in the dual-Q-stage pipeline. It restructured execution so correction of the first stage overlaps with the second GEMM.

  3. Register rebalancing (v32→v33, +2.1% non-causal): Profiling revealed the correction warp spilled to local memory under its 80-register budget while softmax warps had headroom. The agent redistributed 8 registers per group (184/88/56), reducing spill stalls.

Each optimization requires jointly reasoning about synchronization, pipeline scheduling, and register allocation — not tuning one parameter in isolation. After 7 days of MHA evolution, adapting the kernel to grouped-query attention (GQA) took 30 minutes of autonomous adaptation.

3.3 Connecting Back to What Students Know

AVO is the ReAct loop students have already built, applied as the mutation operator inside an evolutionary outer loop. The agent's internal loop (plan → act → observe → repeat) is exactly what students implemented manually. AVO wraps that loop in an evolutionary framework that persists successful candidates as git commits and provides a supervisor agent to detect stagnation.


4. EvoSkill: Evolving Skills, Not Code (2 min)

AlphaEvolve and AVO both evolve code — the low-level artifact. EvoSkill (Alzubi et al., Virginia Tech / Sentient, March 2026) applies the same evolutionary philosophy one level of abstraction higher: it evolves agent skills.

EvoSkill loop: the base agent fails, a Proposer analyzes the failure and proposes a skill, a Skill Builder materializes it, and the new skill improves performance on the next iteration

4.1 What Are Agent Skills?

Skills (in the Claude Code / Agent Skills spec sense) are filesystem directories containing a SKILL.md with procedural instructions and optional helper scripts. They augment a coding agent with reusable, domain-specific capabilities without touching the underlying model. Notably, this is exactly the SKILL.md format students have been using in this course.

4.2 The EvoSkill Loop

EvoSkill's loop has three agents:

The Pareto frontier retains the best programs (sets of skills). A new skill is only kept if it improves held-out validation performance.

4.3 What EvoSkill Achieved

On OfficeQA (grounded reasoning over ~89,000 pages of U.S. Treasury documents): +7.3% exact-match accuracy (60.6% → 67.9%) using only ~12–24 training examples. On SealQA (search-augmented QA with noisy retrieval): +12.1% (26.6% → 38.7%).

Critically, a search-persistence-protocol skill evolved on SealQA transferred zero-shot to BrowseComp (+5.3%) — a completely different benchmark. Evolved skills capture general capabilities rather than task-specific heuristics, precisely because they are written in natural language with explicit trigger conditions rather than baked into model weights or task-specific prompts.

4.4 The Key Distinction from AlphaEvolve and AVO

SystemWhat evolvesEvaluation signalTransfer
AlphaEvolveCode (diffs to a codebase)Automated scorerHard — code is task-specific
AVOCode (CUDA kernels)TFLOPS benchmarkModerate — requires re-adaptation
EvoSkillAgent skills (SKILL.md + scripts)Task accuracy on held-out setStrong — skills are interpretable and portable

The tradeoff: EvoSkill operates at higher abstraction, which gives transferability but limits how low-level the optimization can go. For GPU kernel throughput, you need AVO. For augmenting a general-purpose coding agent's domain reasoning, you need EvoSkill.


5. The Arc: A Pattern Worth Remembering (1 min)

All three systems share a common structure:

  1. Population of solutions — database of scored candidates

  2. Variation — something generates a new candidate from existing ones

  3. Evaluation — automated scorer provides feedback signal

  4. Selection — retain the better candidates

What has changed across the three papers is what the variation step is allowed to do:

The unifying insight: the more autonomy and tool access you give the variation operator, the deeper the optimizations it can discover — at the cost of more compute per variation step.


Key Takeaways


Citations