Skip to main content

Fine-Tuning vs Prompting: Which Is Right?

Fine-tuning trains your model's parameters on a dataset tailored to your task, while prompting shapes the model's output through carefully written instructions and context, leaving the model weights unchanged. Fine-tuning typically improves accuracy by 15–40% on specialized tasks but costs thousands of dollars and takes days to weeks; prompting works within hours and costs dollars for API calls, but hits a ceiling when the base model lacks the right knowledge or reasoning patterns. This article introduces the core trade-offs and helps you understand when each approach wins.

What Is Fine-Tuning and How Does It Differ from Prompting?

Fine-tuning is the process of updating a pre-trained language model's weights by training it on a dataset of examples specific to your domain or task. A language model is trained on billions of web documents; fine-tuning retrains it on, say, 500–10,000 examples from your domain so the model learns task-specific language, patterns, and edge cases. Prompting, by contrast, keeps the model frozen and instead crafts instructions, examples, and context (collectively called a "prompt") to guide the model toward the desired output without modifying its internals.

The metaphor is apt: fine-tuning is like teaching someone a new skill by repeated practice; prompting is like giving that person a detailed instruction manual before each task. One works through practice; the other through clear direction.

Key Differences at a Glance

AspectFine-TuningPrompting
Model ChangesWeights updatedModel frozen
Time to DeployDays to weeksHours (often minutes)
Cost$1,000–$10,000+ per task$10–$100 per 1M tokens
Accuracy Gain15–40% typical5–15% typical
Data Needed500–10,000+ examples0–20 examples (few-shot)
MaintenanceRetraining required for updatesModify prompt only
LatencySame as base modelSame as base model
Knowledge ConstraintLearns from your dataLimited to pre-training data

When Fine-Tuning Wins

Fine-tuning is necessary when your task involves domain-specific language, rare edge cases, or a style the base model has never seen. For example, a legal document classifier trained on public web text performs poorly on dense contract language; fine-tuning on 2,000 labeled contracts raises F1-score from 68% to 89%. Similarly, if your task requires the model to respond in a specific tone (e.g., clinical yet empathetic) or structure (e.g., always JSON), fine-tuning enforces this consistency better than prompting alone.

Fine-tuning also wins when throughput matters. If you call a model 1 million times per month via API, fine-tuning once costs you that money once; calling an expensive base model 1 million times costs you cumulative API charges. At scale, fine-tuned models on cheaper inference infrastructure often cost 10× less per token.

When Prompting Wins

Prompting wins when your task is general enough that the base model already has the core knowledge. A question-answering task on public domain topics, a writing style guide, a coding helper for standard algorithms — all benefit more from well-written prompts than from fine-tuning. Prompting also wins when you need to iterate quickly: you can change your prompt in seconds; retraining a fine-tuned model takes days.

Prompting is also the only option when you have fewer than 50 training examples or when your examples are highly diverse (e.g., you need the model to handle questions on 500 different topics). Fine-tuning needs enough data to learn the underlying pattern; with too little, it overfits.

Combining Both: The Hybrid Approach

Many teams use both: they fine-tune a model on their core domain knowledge, then layer sophisticated prompts and retrieval-augmented generation (RAG) on top. For example, a customer support system might fine-tune a model on 3,000 internal support conversations, then add a prompt that includes the customer's account history and recent tickets retrieved from a database. This hybrid approach captures the best of both worlds: the model understands support language from fine-tuning, and the prompt provides real-time, customer-specific context.

Your First Decision: Three Questions

  1. Do you have 500+ labeled examples? If yes, fine-tuning is worth exploring. If no, start with prompting.
  2. Does the task require domain-specific language or style the base model rarely sees? If yes, lean toward fine-tuning. If no (e.g., general summarization, translation), strong prompting often suffices.
  3. Is inference cost or latency your bottleneck? If you call the model millions of times monthly or need <100 ms response, fine-tuning may be cheaper and faster at scale.

Code Example: Prompting for a Customer Service Task

Below is a prompt-first approach to classify customer intent without fine-tuning:

import anthropic

client = anthropic.Anthropic()

def classify_customer_intent(message: str) -> str:
"""Classify customer intent using few-shot prompting."""
prompt = """You are a customer service intent classifier. Classify the customer's message into one of these categories: billing, technical_support, refund_request, product_inquiry, complaint, other.

Examples:
- "My invoice is wrong" -> billing
- "The app crashes on login" -> technical_support
- "I want my money back" -> refund_request
- "What models do you support?" -> product_inquiry
- "Your service is terrible" -> complaint

Classify this message:
{message}

Return only the category name, nothing else."""

response = client.messages.create(
model="claude-3-5-sonnet-20241022",
max_tokens=20,
messages=[
{
"role": "user",
"content": prompt.format(message=message)
}
]
)
return response.content[0].text.strip()

# Test
intent = classify_customer_intent("My payment method was declined")
print(f"Intent: {intent}") # Expected: billing

Code Example: Fine-Tuning Setup (Conceptual)

If prompting doesn't meet your accuracy threshold, here's what fine-tuning setup looks like:

import json
import anthropic

# Prepare training data: list of (input, expected_output) pairs
training_data = [
{
"messages": [
{"role": "user", "content": "My invoice shows a duplicate charge"},
{"role": "assistant", "content": "billing"}
]
},
{
"messages": [
{"role": "user", "content": "The app won't load on my phone"},
{"role": "assistant", "content": "technical_support"}
]
},
# ... 500+ more examples
]

# Write training data to JSONL file
with open("training_data.jsonl", "w") as f:
for example in training_data:
f.write(json.dumps(example) + "\n")

# Initiate fine-tuning job via API (pseudocode)
client = anthropic.Anthropic()
response = client.messages.model_create(
model="claude-3-5-sonnet-20241022",
training_file="training_data.jsonl",
hyperparameters={
"learning_rate": 0.0001,
"batch_size": 8,
"num_epochs": 3
}
)
print(f"Fine-tuning job started: {response.job_id}")

Key Takeaways

  • Fine-tuning retrains model weights for 15–40% accuracy gains on domain-specific tasks; prompting uses instruction alone with 5–15% gains and hours-to-minutes deployment.
  • Fine-tuning requires 500+ labeled examples, costs $1,000–$10,000+, and takes days to weeks; prompting requires minimal data and is cheap and fast.
  • Use fine-tuning when you have the data, domain-specific language, or high inference volume; use prompting for general tasks, quick iteration, and low data regimes.
  • A hybrid approach combining fine-tuning with RAG and sophisticated prompts often outperforms either alone.
  • Your first decision filter: count your labeled examples and assess task specificity against the base model's pre-training.

Frequently Asked Questions

Can I fine-tune an already fine-tuned model?

Yes, many teams "stack" fine-tuning: start with a general-purpose fine-tuned model, then fine-tune it further on specialized examples. This is called progressive fine-tuning and can boost accuracy faster because the model already has partial domain knowledge.

Does fine-tuning make my model private?

Fine-tuning on your data does not guarantee privacy — the model still contains learned patterns from your training set, and API providers may log inputs/outputs. For truly private models, deploy fine-tuned models on your infrastructure.

How much data do I actually need?

The minimum viable dataset is around 50 examples for experimentation, but 500–1,000 is typical for production accuracy gains. With 10,000+ examples, expect diminishing returns beyond 15–25% accuracy improvement.

Can I use fine-tuning and RAG together?

Absolutely, and it's recommended. Fine-tune on task-specific language and reasoning patterns; use RAG to inject real-time domain knowledge. This is the current best practice for specialized domains.

How do I know if prompting is "good enough"?

Test a strong prompt against your accuracy target. If you hit your threshold without fine-tuning, ship the prompt version — less operational burden. If you fall short by 5–10%, fine-tuning is justified. If you fall short by 20%+, the task may be too hard for the base model, even with fine-tuning.

Further Reading