Data
Data Overview
How Commissioned handles your data — from upload through training.
Your data is the most important input to a fine-tune. The quality, relevance, and volume of your data directly determines how well your model performs.
What Commissioned does with your data
When you upload files, Commissioned runs an automated pipeline:
- Parsing — extracts text from PDFs, parses JSON/JSONL structures, reads plain text and Markdown
- Cleaning — removes boilerplate, headers/footers, encoding artifacts, and noise
- Deduplication — identifies and removes repeated content that would bias training
- Formatting — converts cleaned data into the provider-specific training format (OpenAI, Gemini, or Qwen)
- Validation — checks that the result meets the provider's requirements before submitting
You don't need to do any of this yourself. Upload raw files and Commissioned handles the rest.
Your use-case description matters here. It tells Commissioned how to structure your data — whether to treat it as conversation pairs, reference material, stylistic examples, etc.
Supported formats
| Format | Extension | Max size | Best for |
|---|---|---|---|
| JSONL | .jsonl | 5 GB | Pre-structured conversation data |
| JSON | .json | 5 GB | Structured data, nested content |
.pdf | 5 GB | Documents, papers, reports | |
| Plain text | .txt | 5 GB | Any unstructured text |
| Markdown | .md | 5 GB | Documentation, articles, notes |
You can upload multiple files in a single job. Mix and match formats as needed.
See File Formats for detailed guidance on each format.
Data privacy
- Your data is used only to train your model — it is never shared with other users or used to train other models
- Files are encrypted in transit (HTTPS) and at rest
- You can delete your models and associated data at any time
- See the Terms of Service and Privacy Policy for full details