Module: Coding Agents
Segment: Historical Foundations
Duration: ~10 minutes
Format: Lecture (no exercise)
By the end of this segment, students should be able to:
Trace the arc from statistical language models to modern coding agents
Understand why the transformer architecture was a qualitative inflection point for code generation
Recognize the key empirical milestones that built community confidence that LLMs could reason about code
Key idea: Code is text, and text has statistical regularities — so models trained on code can exploit those regularities to predict what comes next.
Abram Hindle and colleagues at the University of Alberta published "On the Naturalness of Software" (ICSE 2012), asking a deceptively simple question: is source code more predictable than natural language?
They trained n-gram language models (the same family used in speech recognition and machine translation at the time) on large Java and C codebases.
Finding: code is more repetitive and predictable than English prose. Cross-entropy of code under an n-gram LM was lower than that of natural language corpora — meaning the model was surprised less often by what came next.
Implication: statistical models had genuine signal to exploit. This was the empirical license to start applying NLP machinery to code.
The immediate applied payoff was smarter IDE autocompletion: instead of completing only based on type signatures and API surface (as traditional static-analysis tools did), a statistical model could rank completion candidates by likelihood given the surrounding token context.
Eclipse plugin experiments (Bruch et al., 2009 and follow-on work) and later the Eclipse Code Recommenders project showed measurable improvements in completion acceptance rates over purely syntactic tools.
Key limitation of n-gram era: no understanding of long-range structure. An n-gram model with n=6 cannot relate a variable declared 40 lines ago to its use now. Completions were locally plausible but semantically shallow.
Key idea: The transformer's attention mechanism can, in principle, relate any token to any other token in its context window — collapsing the long-range limitation of n-grams.
OpenAI's GPT-2 (Radford et al., 2019), trained primarily on web text, could produce syntactically plausible Python in short bursts even though it was not trained on code intentionally. This was an early signal that transformers generalized to code as a byproduct of scale and architecture.
Microsoft Research released CodeBERT (EMNLP 2020): a bimodal pre-trained model over both natural language and code, built on the RoBERTa architecture.
Trained on the CodeSearchNet corpus (six programming languages, ~6M function/docstring pairs scraped from GitHub).
Primary tasks: code search (find the function that matches a natural language query) and code documentation generation.
CodeBERT demonstrated that a single representation space could align the semantics of English docstrings with the semantics of the code they described — a prerequisite for "turn a comment into code."
OpenAI's Codex (arXiv:2107.03374, "Evaluating Large Language Models Trained on Code") was the first model to receive serious empirical evaluation on functional code synthesis — not just token prediction accuracy but whether the generated code actually ran and passed tests.
Architecture: GPT-3 fine-tuned on ~54 GB of public GitHub code (159 GB filtered down by quality heuristics).
Evaluation: HumanEval benchmark — 164 hand-written Python programming problems, each with a function header, docstring, and a set of unit tests. Metric: pass@k (does at least one of k samples pass all tests?).
Codex (12B parameters): pass@1 ≈ 28.8%, pass@100 ≈ 77.5%
GPT-3 (no code fine-tuning): pass@1 ≈ 0% — it could not solve even trivial problems reliably
This contrast was pedagogically clarifying: code-specific training mattered enormously, not just scale.
Codex became the engine behind GitHub Copilot (technical preview June 2021, GA 2022) — the first commercially deployed coding agent in widespread use.
A function header plus docstring provides a natural language specification of what the function should do.
The correct output is objectively verifiable by running unit tests — unlike open-ended prose generation.
Functions are short enough (~10–50 lines) that the model's context window contains the full specification and room for the output.
This setup made the research question crisp: can a model translate intent into correct, executable logic?
Key idea: Once single-function synthesis was demonstrated, researchers and practitioners immediately asked — what happens when we scale the task?
Published in Science (February 2022) — notable for the venue, signaling this was now considered a mainstream scientific result.
Task: competitive programming problems (Codeforces), which require multi-function solutions, algorithmic reasoning, and handling complex I/O specifications — far harder than HumanEval.
Result: ~50th percentile performance among human Codeforces competitors. Not superhuman, but solidly in the human range for a non-trivial task.
Key insight: AlphaCode used a large candidate pool + filtering strategy — generate thousands of solutions, run them against sample test cases, cluster by output behavior, and submit the most representative ones. This sample-and-filter loop anticipated later agent architectures.
OpenAI's InstructGPT work showed that RLHF (Reinforcement Learning from Human Feedback) dramatically improved model behavior on instruction-following tasks.
When applied to code: models became better at following natural language specifications without being explicitly prompted with "write a Python function that...". This shift from completion to instruction-following was a prerequisite for interactive coding assistants.
Single-shot generation has a hard ceiling: if the model makes a mistake, it cannot recover.
The natural extension is to give the model a feedback signal — most commonly, the output of a code interpreter or test runner.
Early work on self-repair (e.g., Olausson et al., "Is Self-Repair a Silver Bullet for Code Generation?", ICLR 2024) showed that when models received the error message from a failed execution and were asked to revise their code, success rates improved — but the gains were smaller than hoped for without additional scaffolding.
This observation directly motivated the agent loop framing: the model as an actor that observes, acts (writes/edits code), observes the result (test output, error message), and acts again. We will build this loop in today's exercise.
Key idea: Modern coding agents are not just better code completers — they are autonomous problem-solving systems that can read files, run commands, browse documentation, and iterate.
| Benchmark | Year | Task | Difficulty proxy |
|---|---|---|---|
| HumanEval | 2021 | Single function from docstring | ~20–50 lines |
| MBPP | 2021 | Simple Python programming problems | ~5–15 lines |
| DS-1000 | 2022 | Data science tasks (NumPy, Pandas, etc.) | Domain-specific APIs |
| SWE-bench | 2023 | Fix real GitHub issues in real repos | Multi-file, multi-step |
| SWE-bench Verified | 2024 | Curated subset with human-verified solutions | Multi-file, multi-step |
The jump from HumanEval to SWE-bench is a jump from function synthesis to software engineering: the model must read existing code, understand the issue, locate the right file, make a targeted edit, and verify the fix without breaking other tests.
Early frontier model performance on SWE-bench was in the single-digit percentages (2023). By late 2024, leading agents (Claude, GPT-4o with scaffolding) were approaching and exceeding 50% on SWE-bench Verified.
A modern coding agent is a system that:
Receives a task specification in natural language
Uses tools: file read/write, terminal/shell, test runner, web search, documentation lookup
Iterates: plans, acts, observes output, revises
Manages context: decides what to read, what to keep in the window, what to ignore
This is exactly the ReAct loop you studied in the reasoning module, applied to software engineering. The history we traced today is the story of how the underlying model capability grew to the point where this loop became useful rather than frustrating.
Code's statistical regularity (Hindle et al., 2012) gave early NLP methods a foothold — but n-grams lacked long-range reasoning.
Transformers (CodeBERT, 2020; Codex, 2021) introduced attention over full context windows and enabled function-level synthesis from natural language specifications.
Codex/HumanEval established the first rigorous benchmark: pass@k on executable unit tests — a paradigm shift from perplexity to functional correctness.
Scale + code-specific training data + RLHF moved models from completion to instruction-following.
The agent framing — model + tools + iteration loop — emerged naturally from the limitations of single-shot generation.
SWE-bench represents the current frontier: from "write a function" to "fix a bug in a real codebase."
Bruch, M., et al. (2009). Learning from examples to improve code completion systems. FSE 2009.
Hindle, A., et al. (2012). On the naturalness of software. ICSE 2012.
Radford, A., et al. (2019). Language models are unsupervised multitask learners. OpenAI blog.
Feng, Z., et al. (2020). CodeBERT: A pre-trained model for programming and natural languages. EMNLP 2020. arXiv:2002.08155.
Chen, M., et al. (2021). Evaluating large language models trained on code. arXiv:2107.03374.
Li, Y., et al. (2022). Competition-level code generation with AlphaCode. Science, 378(6624). arXiv:2203.07814.
Ouyang, L., et al. (2022). Training language models to follow instructions with human feedback. NeurIPS 2022. arXiv:2203.02155.
Olausson, T.X., et al. (2024). Is self-repair a silver bullet for code generation? ICLR 2024. arXiv:2306.09896.
Jimenez, C.E., et al. (2023). SWE-bench: Can language models resolve real-world GitHub issues? arXiv:2310.06770.