Commissioned
Data

Data Best Practices

Get the most out of fine-tuning with high-quality training data.

Quality over quantity

A fine-tuned model mirrors your data. If your data is high-quality, focused, and consistent, the model will be too. If it's noisy, contradictory, or off-topic, the model will reflect that.

More data isn't always better. 500 excellent examples will outperform 5,000 mediocre ones.

Guidelines

Include representative examples

Your data should cover the kinds of tasks you want the model to perform. If you want a support agent, include actual support conversations — not marketing copy.

Be consistent in tone and style

If your data mixes formal academic writing with casual Slack messages, the model won't have a clear signal for what style to use. Either:

  • Filter your data to one consistent style, or
  • Describe in your use case which style to prioritize

Remove sensitive information

Before uploading, strip out:

  • Passwords, API keys, and secrets
  • Personal identification numbers (SSN, credit cards)
  • Private customer data you don't have consent to use

Commissioned does not automatically redact sensitive information. You are responsible for ensuring your data is safe to use for training.

Aim for variety within your domain

While consistency in style is important, variety in content helps the model generalize:

❌ Too narrow✅ Good variety
100 examples all about "password reset"Examples covering password reset, billing, setup, features, and troubleshooting
Only formal emailsMix of emails, docs, and internal notes (all in the same professional tone)
One long document repeatedMultiple documents covering different aspects of your domain

Use your description effectively

The use-case description isn't just metadata — it actively guides how your data is structured. Be specific:

"Make a bot"

"Create a customer support assistant for a B2B SaaS product. It should answer questions about features, billing, and integrations using a professional but friendly tone. The training data includes help center articles and resolved support tickets."

How much data do you need?

There's no hard minimum, but here are rough guidelines:

Data volumeExpected quality
< 50 examplesModel may not pick up patterns reliably
50–500 examplesGood starting point for style and domain adaptation
500–5,000 examplesStrong fine-tune with consistent behavior
5,000+ examplesExcellent for specialized, production-grade models

"Examples" here means distinct pieces of content — individual conversations, documents, code files, etc. Not word count.

Iterating

Fine-tuning is iterative. Your first model probably won't be perfect. The workflow is:

  1. Upload your initial data and train
  2. Chat with the model — note where it falls short
  3. Add more data that covers the gaps
  4. Train a new model
  5. Compare and repeat

Each iteration costs nothing on the free tier (up to 3 models). On Pro, you have 15 slots to experiment with.

On this page