A Hands-On Exercise for Graduate AI Agents Course
| Field | Details |
|---|---|
| Duration | 45–60 minutes (intro ~10 min, exercise ~35–50 min) |
| Prerequisites | Completed the Pig Latin fine-tuning exercise; Tinker account with API key |
| Base Model | meta-llama/Llama-3.2-1B |
| Dataset | b-mc2/sql-create-context (pre-downloaded JSON file provided) |
| Platform | Tinker API (all computation is remote; runs on any laptop) |
In the Pig Latin exercise, we taught a model a simple string transformation rule. Now you will teach it a compositional skill: translating natural language questions into SQL queries, given a database schema.
This is a case where Retrieval-Augmented Generation (RAG) falls short. You could retrieve similar question-SQL pairs from a database and stuff them into the context window. But the model still needs to:
Parse the structure of the question (what is being asked, what filters apply, what aggregations are needed)
Map natural language concepts to column names it has never seen before
Compose valid SQL syntax — SELECT, WHERE, JOIN, GROUP BY, ORDER BY — in the right order with the right logic
Generalize to novel table schemas and question phrasings that don't match anything in the retrieval corpus
These are skills, not facts. You cannot look up the answer to "How many students scored above 90 in the math class?" in a reference document — the answer depends on the specific schema, and the model must generate the correct SQL by combining syntax knowledge with schema understanding. Fine-tuning bakes this compositional ability into the model's weights.
Key question to hold in mind: After fine-tuning on thousands of (question, schema, SQL) triples, what has the model actually learned? Is it memorizing patterns, or has it internalized SQL grammar?
b-mc2/sql-create-contextThis dataset combines the WikiSQL and Spider benchmarks into 78,577 examples. Each example has three fields:
| Field | Description | Example |
|---|---|---|
question | A natural language question | "How many heads of departments are older than 56?" |
context | A SQL CREATE TABLE statement | CREATE TABLE head (age INTEGER, ...) |
answer | The SQL query that answers the question | SELECT COUNT(*) FROM head WHERE age > 56 |
The dataset was specifically designed for text-to-SQL fine-tuning. The CREATE TABLE context provides table names, column names, and data types — enough for the model to ground its SQL generation without needing actual row data.
The instructor has pre-downloaded the dataset as a JSON file: sql_create_context_v4.json. It is a JSON array of objects, each with question, context, and answer keys. You can find it on the course web page. This is the code that downloaded it:
pip install datasetspython -c "from datasets import load_datasetimport jsonds = load_dataset('b-mc2/sql-create-context', split='train')with open('sql_create_context_v4.json', 'w') as f:json.dump([dict(row) for row in ds], f)print(f'Saved {len(ds)} examples')"
You will:
Sample before — ask the base (un-fine-tuned) Llama-3.2-1B to generate SQL from a natural language question. Observe its accuracy.
Prepare training data — load the dataset, format each example as a prompt/completion pair, and tokenize.
Train — run a supervised fine-tuning loop on Tinker for one full epoch (~307 batches with batch size 256) using all data except the 200 held-out test examples (~78,377 examples).
Sample after — ask the fine-tuned model the same questions. Observe the improvement.
Discuss — reflect on what the model learned and where it still fails.
Make sure your environment is ready:
xxxxxxxxxxpip install tinker transformers python-dotenvexport TINKER_API_KEY=your_key_herePlace the dataset file sql_create_context_v4.json in your working directory. (The instructor will tell you where to find it.)
ximport json, random
with open("sql_create_context_v4.json") as f: data = json.load(f)
print(f"Total examples: {len(data)}")print(f"\nSample example:")ex = data[0]print(f" Question: {ex['question']}")print(f" Context: {ex['context'][:120]}...")print(f" Answer: {ex['answer']}")Skim a few examples. Notice the range of SQL complexity — some are simple SELECTs, others involve JOINs, GROUP BY, and subqueries. Next, split the data into training and test sets. You will use 200 data points for testing and the rest for training.
xxxxxxxxxxNUM_TEST_EXAMPLES = 200 # Held-out for evaluation; all remaining data used for trainingrandom.shuffle(data)test_data = data[:NUM_TEST_EXAMPLES]train_data = data[NUM_TEST_EXAMPLES:]print(f"Training examples: {len(train_data)} (all except evaluation)")print(f"Test examples: {len(test_data)}")We need a consistent template that presents the schema and question as a prompt, with the SQL query as the completion. The model will learn to predict everything after SQL:.
xxxxxxxxxxTable schema:CREATE TABLE head (age INTEGER, name VARCHAR, ...)Question: How many heads of departments are older than 56?SQL: SELECT COUNT(*) FROM head WHERE age > 56
This format makes the task unambiguous: the model sees the schema, sees the question, and must produce the SQL.
Create the Tinker client using meta-llama/Llama-3.2-1B. Evaluate on the testing data.
xxxxxxxxxximport tinker
service_client = tinker.ServiceClient()base_model = "meta-llama/Llama-3.2-1B"training_client = service_client.create_lora_training_client(base_model=base_model)tokenizer = training_client.get_tokenizer()
print("\n--- Evaluating Base Model on 200 Test Questions ---")base_sampling_client = training_client.save_weights_and_get_sampling_client( name="base-model")base_accuracy = evaluate_test_set( base_sampling_client, tokenizer, test_data, "base")print(f"Base model accuracy: {base_accuracy:.2%} ({int(base_accuracy * 200)}/200)")How evaluation works: For each test example, (1) feed the schema and question to the model to get generated SQL, (2) compare it to the expected SQL. We use execution-based comparison: build an in-memory SQLite DB from the schema, run both queries, and check if they return the same result set. Here is the key code:
x
def sample_from_model(sampling_client, tokenizer, context: str, question: str) -> str: """Generate SQL from the model given schema and question.""" prompt = f"""Table schema:{context}Question: {question}SQL: """ prompt_tokens = tokenizer.encode(prompt, add_special_tokens=True) model_input = types.ModelInput.from_ints(tokens=prompt_tokens) params = types.SamplingParams(max_tokens=150, temperature=0.0, stop=["\n\n", "Question:"]) result = sampling_client.sample(prompt=model_input, sampling_params=params, num_samples=1).result() if result.sequences: return tokenizer.decode(result.sequences[0].tokens).strip() return ""
def eval_one(sampling_client, tokenizer, ex: dict) -> bool: """Evaluate one example: generate SQL, then check if it matches expected.""" sql = sample_from_model(sampling_client, tokenizer, ex["context"], ex["question"]) return sql_matches(sql, ex["answer"], schema=ex["context"])
def evaluate_test_set(sampling_client, tokenizer, test_data: list) -> float: """Compute accuracy = fraction of test examples where generated SQL matches expected.""" correct = sum(1 for ex in test_data if eval_one(sampling_client, tokenizer, ex)) return correct / len(test_data)The full set of code for execution-based comparison is in the file sql_matches.py, which you should copy or import into your program. It parses the schema, extracts string/numeric literals from both queries to build realistic seed data, and runs on multiple DB instances to reduce false positives.
Format each one using the template, tokenize, and set loss weights so the model only learns to predict the SQL portion (not the prompt). This function converts one example into a Tinker Datum:
xxxxxxxxxxdef format_prompt(example: dict) -> tuple[str, str]: """Format example as prompt and completion for text-to-SQL.""" prompt = f"""Table schema:{example['context']}Question: {example['question']}SQL: """ completion = example["answer"] return prompt, completion
def process_example(example: dict, tokenizer) -> types.Datum: """Convert a (question, context, answer) example into a Tinker Datum.""" prompt, completion = format_prompt(example)
prompt_tokens = tokenizer.encode(prompt, add_special_tokens=True) prompt_weights = [0.0] * len(prompt_tokens)
# Add space before completion, end with \n\n so the model learns to stop completion_str = f" {completion}\n\n" completion_tokens = tokenizer.encode(completion_str, add_special_tokens=False) completion_weights = [1.0] * len(completion_tokens)
tokens = prompt_tokens + completion_tokens weights = prompt_weights + completion_weights
# Next-token prediction: input is tokens[:-1], target is tokens[1:] input_tokens = tokens[:-1] target_tokens = tokens[1:] weights = weights[1:]
return types.Datum( model_input=types.ModelInput.from_ints(tokens=input_tokens), loss_fn_inputs={ "target_tokens": np.array(target_tokens, dtype=np.int64), "weights": np.array(weights, dtype=np.float32), }, )Now prepare all the training data and shuffle it.
xxxxxxxxxxprocessed_train = [ process_example(ex, tokenizer) for ex in train_data]random.shuffle(processed_train)Run the training loop. For each batch:
Take a batch of 256 examples
Call forward_backward with cross_entropy loss
Call optim_step with Adam
Print the loss every 100 steps
You should see the loss decrease over one full epoch (~307 batches). The whole run typically takes 10–20 minutes.
xxxxxxxxxxfrom tinker import types
NUM_EPOCHS = 1BATCH_SIZE = 256LEARNING_RATE = 5e-4 # Tinker-recommended for Llama-3.2-1B with LoRA
step = 0for epoch in range(NUM_EPOCHS): random.shuffle(processed_train) for batch_idx in range(0, len(processed_train), BATCH_SIZE): batch = processed_train[batch_idx : batch_idx + BATCH_SIZE] if len(batch) == 0: break
fwdbwd_future = training_client.forward_backward(batch, "cross_entropy") optim_future = training_client.optim_step( types.AdamParams(learning_rate=LEARNING_RATE) )
fwdbwd_result = fwdbwd_future.result() optim_result = optim_future.result()
# Compute loss (weighted cross-entropy over completion tokens only) to_arr = lambda x: x.to_numpy() if hasattr(x, "to_numpy") else np.array(x.tolist()) logprobs = np.concatenate([to_arr(o["logprobs"]) for o in fwdbwd_result.loss_fn_outputs]) weights = np.concatenate([to_arr(d.loss_fn_inputs["weights"]) for d in batch]) loss = float(-np.dot(logprobs, weights) / (weights.sum() + 1e-8))
step += 1 if step % 100 == 0 or batch_idx + BATCH_SIZE >= len(processed_train): print(f"Epoch {epoch + 1}/{NUM_EPOCHS}, update {step}, loss: {loss:.4f}")Save weights with save_weights_for_sampler, create a sampling client from the checkpoint path, and run evaluate_test_set on the same 200 test questions. The accuracy is the fraction of questions where the generated SQL returns the same result set as the expected SQL (execution-based comparison). Compare this to the base model accuracy from Step 3.
Use these questions that involve schemas not in the training set (e.g., employees, products, students — the training data uses Spider-style tables like head, department). These are out-of-distribution; expect lower accuracy here than on the 200 in-distribution test questions. Manually inspect the results.
Easy (single table, simple WHERE):
Schema: CREATE TABLE employees (id INTEGER, name VARCHAR, salary REAL, department VARCHAR)
Question: "What are the names of employees in the engineering department?"
Schema: CREATE TABLE products (id INTEGER, name VARCHAR, price REAL, category VARCHAR)
Question: "How many products cost more than 50 dollars?"
Medium (aggregation, ORDER BY):
Schema: CREATE TABLE students (id INTEGER, name VARCHAR, score INTEGER, class VARCHAR)
Question: "What is the highest score in the science class?"
Schema: CREATE TABLE orders (id INTEGER, customer VARCHAR, amount REAL, date VARCHAR)
Question: "List the top 3 customers by total order amount."
Hard (JOIN, GROUP BY):
Schema: CREATE TABLE courses (id INTEGER, name VARCHAR, department VARCHAR); CREATE TABLE enrollments (student_id INTEGER, course_id INTEGER, grade VARCHAR)
Question: "How many students are enrolled in each department?"
After completing the exercise:
Before vs. after: What specific improvements did you observe? Did the model learn SQL syntax, schema grounding, or both? What was the change in accuracy on the 200 held-out test questions? How well did it do on the additional manual test questions (Step 7)?
Hint: on the 200 in-distribution test questions, accuracy typically improves from ~37% (base) to ~87% (fine-tuned). The Step 7 questions use novel schemas and may show lower accuracy.
RAG comparison: Imagine you had a RAG system with 1,000 (question, SQL) pairs in a vector database. For which of the test questions above would RAG work well? For which would it struggle? Why?
Error analysis: When the fine-tuned model gets a query wrong, how does it fail? Wrong column names? Wrong SQL syntax? Wrong logic? Each failure mode tells you something different about what the model learned.
(Optional) Can you improve accuracy on the novel-schema questions (Step 7)? The model excels on in-distribution data (~87%) but struggles on schemas it never saw. Try adding examples with similar schemas to the training set, or experiment with learning rate and batch size.
Batch size (BATCH_SIZE): Number of examples per training step. Default: 256. Increase to speed up training; decrease if you hit memory limits. If you change it, scale learning rate by √(new/old) (e.g., 128 → LR × 0.7).
Evaluation workers (NUM_EVAL_WORKERS): Number of parallel workers for evaluation. Each worker sends one question to the model; with N workers, up to N questions run concurrently. More workers reduce evaluation time when you have idle GPU capacity.
Evaluation method: Accuracy is computed via execution-based comparison (SQLite). For each test example, the script builds a DB from the schema, runs both generated and expected SQL, and checks if result sets match. Run the full pipeline with bash run_finetuning.sh.
xxxxxxxxxxpip install tinker transformers python-dotenvFile: sql_create_context_v4.json (21.8 MB, 78,577 examples)
Source: https://huggingface.co/datasets/b-mc2/sql-create-context
License: CC-BY-4.0
Download (instructor prep):
xxxxxxxxxxpip install datasetspython -c "from datasets import load_datasetimport jsonds = load_dataset('b-mc2/sql-create-context', split='train')with open('sql_create_context_v4.json', 'w') as f: json.dump([dict(row) for row in ds], f)print(f'Saved {len(ds)} examples')"xxxxxxxxxxexport TINKER_API_KEY=your_key_hereAfter training, you can save:
Training state (save_state) — optimizer state and full checkpoint for resuming.
Sampler weights (save_weights_for_sampler) — LoRA adapter weights for inference.
To download the fine-tuned model for local use:
xxxxxxxxxxtinker checkpoint download <checkpoint_path>The checkpoint path is returned when you call save. The downloaded archive contains LoRA weights. Merge with the base model using PEFT/transformers:
xxxxxxxxxxfrom peft import PeftModelfrom transformers import AutoModelForCausalLM
base = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.2-1B")model = PeftModel.from_pretrained(base, "./path/to/downloaded/adapter")See Tinker download docs for details.
xxxxxxxxxxTable schema:{context}Question: {question}SQL: {answer}
End of Exercise — Text-to-SQL Fine-Tuning with Tinker