Data

Data Best Practices

Get the most out of fine-tuning with high-quality training data.

Quality over quantity

A fine-tuned model mirrors your data. If your data is high-quality, focused, and consistent, the model will be too. If it's noisy, contradictory, or off-topic, the model will reflect that.

More data isn't always better. 500 excellent examples will outperform 5,000 mediocre ones.

Guidelines

Include representative examples

Your data should cover the kinds of tasks you want the model to perform. If you want a support agent, include actual support conversations — not marketing copy.

Be consistent in tone and style

If your data mixes formal academic writing with casual Slack messages, the model won't have a clear signal for what style to use. Either:

Filter your data to one consistent style, or
Describe in your use case which style to prioritize

Remove sensitive information

Before uploading, strip out:

Passwords, API keys, and secrets
Personal identification numbers (SSN, credit cards)
Private customer data you don't have consent to use

Commissioned does not automatically redact sensitive information. You are responsible for ensuring your data is safe to use for training.

Aim for variety within your domain

While consistency in style is important, variety in content helps the model generalize:

❌ Too narrow	✅ Good variety
100 examples all about "password reset"	Examples covering password reset, billing, setup, features, and troubleshooting
Only formal emails	Mix of emails, docs, and internal notes (all in the same professional tone)
One long document repeated	Multiple documents covering different aspects of your domain

Use your description effectively

The use-case description isn't just metadata — it actively guides how your data is structured. Be specific:

❌ "Make a bot"

✅ "Create a customer support assistant for a B2B SaaS product. It should answer questions about features, billing, and integrations using a professional but friendly tone. The training data includes help center articles and resolved support tickets."

How much data do you need?

There's no hard minimum, but here are rough guidelines:

Data volume	Expected quality
< 50 examples	Model may not pick up patterns reliably
50–500 examples	Good starting point for style and domain adaptation
500–5,000 examples	Strong fine-tune with consistent behavior
5,000+ examples	Excellent for specialized, production-grade models

"Examples" here means distinct pieces of content — individual conversations, documents, code files, etc. Not word count.

Iterating

Fine-tuning is iterative. Your first model probably won't be perfect. The workflow is:

Upload your initial data and train
Chat with the model — note where it falls short
Add more data that covers the gaps
Train a new model
Compare and repeat

Each iteration costs nothing on the free tier (up to 3 models). On Pro, you have 15 slots to experiment with.

File Formats

Detailed guidance on each supported file format and how to get the best results.

Models Overview

Providers, base models, fine-tuning, and model management on Commissioned.

Data Best Practices

On this page