Fine-Tuning Exercise: Text-to-SQL with Tinker

A Hands-On Exercise for Graduate AI Agents Course


FieldDetails
Duration45–60 minutes (intro ~10 min, exercise ~35–50 min)
PrerequisitesCompleted the Pig Latin fine-tuning exercise; Tinker account with API key
Base Modelmeta-llama/Llama-3.2-1B
Datasetb-mc2/sql-create-context (pre-downloaded JSON file provided)
PlatformTinker API (all computation is remote; runs on any laptop)

1. Motivation: Why Fine-Tuning Beats RAG Here

In the Pig Latin exercise, we taught a model a simple string transformation rule. Now you will teach it a compositional skill: translating natural language questions into SQL queries, given a database schema.

This is a case where Retrieval-Augmented Generation (RAG) falls short. You could retrieve similar question-SQL pairs from a database and stuff them into the context window. But the model still needs to:

These are skills, not facts. You cannot look up the answer to "How many students scored above 90 in the math class?" in a reference document — the answer depends on the specific schema, and the model must generate the correct SQL by combining syntax knowledge with schema understanding. Fine-tuning bakes this compositional ability into the model's weights.

Key question to hold in mind: After fine-tuning on thousands of (question, schema, SQL) triples, what has the model actually learned? Is it memorizing patterns, or has it internalized SQL grammar?


2. The Dataset: b-mc2/sql-create-context

This dataset combines the WikiSQL and Spider benchmarks into 78,577 examples. Each example has three fields:

FieldDescriptionExample
questionA natural language question"How many heads of departments are older than 56?"
contextA SQL CREATE TABLE statementCREATE TABLE head (age INTEGER, ...)
answerThe SQL query that answers the questionSELECT COUNT(*) FROM head WHERE age > 56

The dataset was specifically designed for text-to-SQL fine-tuning. The CREATE TABLE context provides table names, column names, and data types — enough for the model to ground its SQL generation without needing actual row data.

The instructor has pre-downloaded the dataset as a JSON file: sql_create_context_v4.json. It is a JSON array of objects, each with question, context, and answer keys. You can find it on the course web page. This is the code that downloaded it:


3. Exercise Overview

You will:

  1. Sample before — ask the base (un-fine-tuned) Llama-3.2-1B to generate SQL from a natural language question. Observe its accuracy.

  2. Prepare training data — load the dataset, format each example as a prompt/completion pair, and tokenize.

  3. Train — run a supervised fine-tuning loop on Tinker for one full epoch (~307 batches with batch size 256) using all data except the 200 held-out test examples (~78,377 examples).

  4. Sample after — ask the fine-tuned model the same questions. Observe the improvement.

  5. Discuss — reflect on what the model learned and where it still fails.


4. Step-by-Step Instructions

Step 0: Setup

Make sure your environment is ready:

Place the dataset file sql_create_context_v4.json in your working directory. (The instructor will tell you where to find it.)

Step 1: Load and Explore the Data

Skim a few examples. Notice the range of SQL complexity — some are simple SELECTs, others involve JOINs, GROUP BY, and subqueries. Next, split the data into training and test sets. You will use 200 data points for testing and the rest for training.

Step 2: Define the Prompt Format

We need a consistent template that presents the schema and question as a prompt, with the SQL query as the completion. The model will learn to predict everything after SQL:.

This format makes the task unambiguous: the model sees the schema, sees the question, and must produce the SQL.

Step 3: Evaluate the Base Model

Create the Tinker client using meta-llama/Llama-3.2-1B. Evaluate on the testing data.

How evaluation works: For each test example, (1) feed the schema and question to the model to get generated SQL, (2) compare it to the expected SQL. We use execution-based comparison: build an in-memory SQLite DB from the schema, run both queries, and check if they return the same result set. Here is the key code:

The full set of code for execution-based comparison is in the file sql_matches.py, which you should copy or import into your program. It parses the schema, extracts string/numeric literals from both queries to build realistic seed data, and runs on multiple DB instances to reduce false positives.

Step 4: Prepare Training Data

Format each one using the template, tokenize, and set loss weights so the model only learns to predict the SQL portion (not the prompt). This function converts one example into a Tinker Datum:

Now prepare all the training data and shuffle it.

Step 5: Train

Run the training loop. For each batch:

  1. Take a batch of 256 examples

  2. Call forward_backward with cross_entropy loss

  3. Call optim_step with Adam

  4. Print the loss every 100 steps

You should see the loss decrease over one full epoch (~307 batches). The whole run typically takes 10–20 minutes.

Step 6: Evaluate the Fine-Tuned Model (After)

Save weights with save_weights_for_sampler, create a sampling client from the checkpoint path, and run evaluate_test_set on the same 200 test questions. The accuracy is the fraction of questions where the generated SQL returns the same result set as the expected SQL (execution-based comparison). Compare this to the base model accuracy from Step 3.

Step 7: Test on Additional Novel Schema Questions

Use these questions that involve schemas not in the training set (e.g., employees, products, students — the training data uses Spider-style tables like head, department). These are out-of-distribution; expect lower accuracy here than on the 200 in-distribution test questions. Manually inspect the results.

Easy (single table, simple WHERE):

  1. Schema: CREATE TABLE employees (id INTEGER, name VARCHAR, salary REAL, department VARCHAR) Question: "What are the names of employees in the engineering department?"

  2. Schema: CREATE TABLE products (id INTEGER, name VARCHAR, price REAL, category VARCHAR) Question: "How many products cost more than 50 dollars?"

Medium (aggregation, ORDER BY):

  1. Schema: CREATE TABLE students (id INTEGER, name VARCHAR, score INTEGER, class VARCHAR) Question: "What is the highest score in the science class?"

  2. Schema: CREATE TABLE orders (id INTEGER, customer VARCHAR, amount REAL, date VARCHAR) Question: "List the top 3 customers by total order amount."

Hard (JOIN, GROUP BY):

  1. Schema: CREATE TABLE courses (id INTEGER, name VARCHAR, department VARCHAR); CREATE TABLE enrollments (student_id INTEGER, course_id INTEGER, grade VARCHAR) Question: "How many students are enrolled in each department?"


6. Discussion Questions

After completing the exercise:


7. Bonus Challenge

(Optional) Can you improve accuracy on the novel-schema questions (Step 7)? The model excels on in-distribution data (~87%) but struggles on schemas it never saw. Try adding examples with similar schemas to the training set, or experiment with learning rate and batch size.


8. Environment Reference

Configuration Notes

Dependencies

Dataset

File: sql_create_context_v4.json (21.8 MB, 78,577 examples)

Source: https://huggingface.co/datasets/b-mc2/sql-create-context

License: CC-BY-4.0

Download (instructor prep):

Tinker API Key

Optional: Checkpoint and Download

After training, you can save:

  1. Training state (save_state) — optimizer state and full checkpoint for resuming.

  2. Sampler weights (save_weights_for_sampler) — LoRA adapter weights for inference.

To download the fine-tuned model for local use:

The checkpoint path is returned when you call save. The downloaded archive contains LoRA weights. Merge with the base model using PEFT/transformers:

See Tinker download docs for details.

Prompt Template


End of Exercise — Text-to-SQL Fine-Tuning with Tinker