Data Best Practices
Get the most out of fine-tuning with high-quality training data.
Quality over quantity
A fine-tuned model mirrors your data. If your data is high-quality, focused, and consistent, the model will be too. If it's noisy, contradictory, or off-topic, the model will reflect that.
More data isn't always better. 500 excellent examples will outperform 5,000 mediocre ones.
Guidelines
Include representative examples
Your data should cover the kinds of tasks you want the model to perform. If you want a support agent, include actual support conversations — not marketing copy.
Be consistent in tone and style
If your data mixes formal academic writing with casual Slack messages, the model won't have a clear signal for what style to use. Either:
- Filter your data to one consistent style, or
- Describe in your use case which style to prioritize
Remove sensitive information
Before uploading, strip out:
- Passwords, API keys, and secrets
- Personal identification numbers (SSN, credit cards)
- Private customer data you don't have consent to use
Commissioned does not automatically redact sensitive information. You are responsible for ensuring your data is safe to use for training.
Aim for variety within your domain
While consistency in style is important, variety in content helps the model generalize:
| ❌ Too narrow | ✅ Good variety |
|---|---|
| 100 examples all about "password reset" | Examples covering password reset, billing, setup, features, and troubleshooting |
| Only formal emails | Mix of emails, docs, and internal notes (all in the same professional tone) |
| One long document repeated | Multiple documents covering different aspects of your domain |
Use your description effectively
The use-case description isn't just metadata — it actively guides how your data is structured. Be specific:
❌ "Make a bot"
✅ "Create a customer support assistant for a B2B SaaS product. It should answer questions about features, billing, and integrations using a professional but friendly tone. The training data includes help center articles and resolved support tickets."
How much data do you need?
There's no hard minimum, but here are rough guidelines:
| Data volume | Expected quality |
|---|---|
| < 50 examples | Model may not pick up patterns reliably |
| 50–500 examples | Good starting point for style and domain adaptation |
| 500–5,000 examples | Strong fine-tune with consistent behavior |
| 5,000+ examples | Excellent for specialized, production-grade models |
"Examples" here means distinct pieces of content — individual conversations, documents, code files, etc. Not word count.
Iterating
Fine-tuning is iterative. Your first model probably won't be perfect. The workflow is:
- Upload your initial data and train
- Chat with the model — note where it falls short
- Add more data that covers the gaps
- Train a new model
- Compare and repeat
Each iteration costs nothing on the free tier (up to 3 models). On Pro, you have 15 slots to experiment with.