Fine-tuning open-source models on proprietary data is replacing API dependency for serious products. Here's when to make the jump — and why you may have already done the hard part.
What the industry is doing
resolution rate, Intercom's Fin Apex fine-tuned over a frontier API at 1/5 the cost
cost reduction reported by Pinterest CEO using fine-tuned open-source vs frontier APIs
per LoRA adapter on Modal A10G — what it costs you per training run with the stack on this site
Three "yes" signals you've outgrown prompt engineering — and three "wait" signals that mean training is premature.
You've iterated prompts, system messages, few-shot exemplars, RAG, agent loops — the eval line has flattened. The next gain is in weights, not strings.
A 3B or 7B fine-tuned model on your own GPU returns in 100–300ms what a frontier API delivers in 2–4s — and it scales horizontally without per-token billing.
Per-call API cost × volume crosses your margin. A fine-tuned adapter on owned compute flips the curve from "linear with usage" to "fixed monthly + tiny marginal."
If your product loop doesn't yet work end-to-end with a frontier API, training won't fix that. Get the loop working first; then optimize.
Fine-tuning needs traces. If your product hasn't shipped or you haven't been logging interactions, train an eval suite first, then come back.
Most teams are intimidated by training because they think it requires infrastructure they don't have. The reverse is closer to true: you probably already have the hard parts.
A loop that drives the model through a real task and captures every step. In this stack: chakra drives PDCA missions and writes traces to data/traces/pdca/.
A way to score "did this run go well?" against a golden set. In this stack: chakra scores + foundry's experiment graph track every adapter against a fixed eval corpus.
People who can read traces and reason about model behavior. In this stack: the cohort SOP trains 250 of them per day, with each producing their own adapter.
Traces from real product use, curated. In this stack: foundry data + the Data Review tab let you keep/drop/flag records before training.
Modal handles the GPU, vLLM handles serving, foundry's Modal backend handles the upload/spawn/poll loop. foundry train fire web --watch is the entire interface.
A LoRA training run on Modal A10G costs ~$0.30–$3 and takes 30–90 minutes. Pay-per-minute, no cluster, no reservation.
In building your product, you may have already done the hard part.
Each industry-standard requirement maps cleanly onto a tool in this stack. Nothing custom; just the right composition.
The whole loop, from "I have a target" to "I have a fine-tuned adapter scoring on golden traces."
$ l33tpwn login # Cognito session
$ l33tpwn student provision me@cohort.l33tpwn.com # 2-4 min
$ l33tpwn student vnc-url me@cohort.l33tpwn.com # ← VNC link from CLI
$ l33tpwn walkthrough view albania # learn the kill chain
$ chakra run albania --invocations 30 # produces the trace
$ foundry data review v17-cohort # human-curate
$ foundry train fire web --watch # ~$1 on Modal A10G
$ chakra scores # eval — did it learn?
Detailed timing in the 10-phase course SOP.
The model is the transistor. The fine-tuned, domain-specific system is the product.
A pen-test mission engine, a security-domain corpus, a fine-tuned LoRA on top of Qwen2.5-7B that posts logins and reports findings — that's your product. The base model is a component, not the offering.
The 10-phase Course Delivery SOP scales to 250 students per cohort, $370 typical spend, $893 hard ceiling. Each student fires their own training run, ships their own adapter.