This guide shows how to fine‑tune Llama 3 with Torchtune in a predictable, budget‑friendly way. We’ll prep a tiny instruction dataset, run a short LoRA training job on a single GPU, evaluate quality, and keep costs under control.
Why Torchtune
Torchtune provides small, hackable training recipes (LoRA, QLoRA, full fine‑tune) with sane defaults and first‑class configs. It’s ideal when you want to understand what’s happening and keep the loop tight, rather than clicking a one‑off “Train” button and praying.
Prerequisites
- Access to the Meta Llama 3 Instruct checkpoint on Hugging Face.
- A GPU with at least 24 GB VRAM for single‑device LoRA (QLoRA can run below 10 GB).
- A Python environment with PyTorch and Torchtune installed.
- Optional: a cloud notebook (Runpod, Colab Pro, Lambda Cloud) with a 24 GB GPU.
pip install torchtune # or follow the official install instructions for your CUDA/ROCm stack
Step 1 — Build a tiny instruction dataset
Start small: 200–2,000 examples is enough to see if alignment is moving in the right direction. Use a simple JSONL with instruction, input, and output fields.
{ "instruction": "Rewrite the function to be tail‑recursive.", "input": "function fact(n){ if(n<=1) return 1; return n*fact(n-1); }", "output": "function fact(n, acc=1){ if(n<=1) return acc; return fact(n-1, n*acc); }" } { "instruction": "Summarize the ticket in one sentence.", "input": "User reports 500 on checkout after coupon applied.", "output": "Checkout throws 500 when coupons are applied; repro with SAVE10; likely coupon service timeout." }
Keep examples tight, consistent, and scoped to the behavior you actually want. If you’re mixing tasks, group them in separate runs first.
Step 2 — Download the base checkpoint
You’ll need access on Hugging Face and a token. Replace <ACCESS TOKEN> and <checkpoint_dir>.
tune download meta-llama/Meta-Llama-3-8B-Instruct \ --output-dir <checkpoint_dir> \ --hf-token <ACCESS TOKEN>
This pulls model weights and tokenizer. Confirm the directory contains a tokenizer (e.g., tokenizer.model or equivalent) and model shards.
Step 3 — Single‑GPU LoRA fine‑tune
The simplest starting point uses Torchtune’s single‑device LoRA recipe. We’ll point the config to your checkpoint directory and dataset.
# See available recipes/configs tune ls # Basic LoRA single‑device run tune run lora_finetune_single_device \ --config llama3/8B_lora_single_device \ checkpointer.checkpoint_dir=<checkpoint_dir> \ tokenizer.path=<checkpoint_dir>/tokenizer.model \ checkpointer.output_dir=<checkpoint_dir> \ dataset.path=/path/to/data.jsonl
Notes:
- The default config trains bfloat16 and fits on a 24 GB GPU (peak ~18.5 GB in Torchtune’s examples).
- If your dataset is small, reduce epochs and increase gradient accumulation to stabilize updates.
- Use a validation split or held‑out set, even if tiny, to catch overfitting.
QLoRA option (less memory)
If you’re memory‑constrained, try QLoRA:
tune run lora_finetune_single_device \ --config llama3/8B_qlora_single_device \ checkpointer.checkpoint_dir=<checkpoint_dir> \ tokenizer.path=<checkpoint_dir>/tokenizer.model \ checkpointer.output_dir=<checkpoint_dir> \ dataset.path=/path/to/data.jsonl
Expect peak allocated memory under ~10 GB depending on batch settings.
Multi‑GPU (optional)
If you have multiple devices, the distributed recipe can increase batch size and speed:
tune run --nproc_per_node 2 lora_finetune_distributed \ --config llama3/8B_lora \ checkpointer.checkpoint_dir=<checkpoint_dir> \ tokenizer.path=<checkpoint_dir>/tokenizer.model \ checkpointer.output_dir=<checkpoint_dir> \ dataset.path=/path/to/data.jsonl
Step 4 — Evaluate quality
Don’t eyeball outputs—evaluate them. Two pragmatic options:
- EleutherAI Eval Harness: good for standardized tasks and quick regressions.
- Task‑specific checks: write small programmatic tests that assert the behaviors you care about.
Example: minimal script to compare base vs. fine‑tuned on your held‑out prompts. Log accuracy, pass‑rates, and typical failure modes.
# pseudo‑code outline from your_inference_lib import generate prompts = load_jsonl("heldout.jsonl") base_ok, tuned_ok = 0, 0 for p in prompts: ref = p["output"] base = generate(model="llama3-instruct", prompt=p) tuned = generate(model="llama3-instruct-ft", prompt=p) base_ok += passes(ref, base) tuned_ok += passes(ref, tuned) print({"base_rate": base_ok/len(prompts), "tuned_rate": tuned_ok/len(prompts)})
Your passes predicate can be exact (regex, JSON schema) or heuristic (LLM‑judge with strict scoring). Keep it consistent across runs.
Cost control and stability
- Limit training duration: start with 1 epoch and a small batch; scale only if evals improve.
- Freeze your config: copy it with
tune cpand track it in version control. - Log everything: loss curves, learning rate, and evaluation results per checkpoint.
- Keep the loop tight: one change per run (LR, rank, batch, dataset size), not five.
Troubleshooting
- OOM on startup: reduce batch size or try QLoRA; verify you’re using bfloat16.
- Diverging loss: lower LR, increase warmup, ensure your dataset isn’t noisy or contradictory.
- Catastrophic forgetting: mix in a small slice of general instructions to keep balance.
- Bad formatting: make sure your prompts are consistently structured; add templates if needed.
Minimal notebook outline
- Install dependencies and check GPU.
- Download checkpoint with
tune download. - Mount or upload
data.jsonl. - Run single‑device LoRA with a copied config.
- Save merged weights and separate LoRA adapter if desired.
- Run a quick eval script comparing base vs. tuned.
- Export a short report with metrics and sample generations.
Wrap‑up
Start tiny, measure, and iterate. Torchtune’s recipes make it easy to keep experiments disciplined: copy a config, run a short job, evaluate, and decide whether to keep going. When the tuned model’s pass‑rate and error profile beat the base on your real tasks, you’re ready to productize.