Guides

Self-Host with LoRA

Download your LoRA adapter and run your fine-tuned model on your own infrastructure.

This guide walks through fine-tuning a Qwen 3 8B model, downloading the LoRA adapter, and deploying it on your own infrastructure.

Why self-host?

No rate limits — serve as many requests as your hardware allows
Data residency — all inference happens on your infrastructure
Offline / air-gapped — works without internet after downloading the adapter
Cost at scale — amortize GPU costs across high request volumes
Customization — tune inference parameters, batching, and caching

Step by step

Fine-tune on Qwen 3 8B

Create a fine-tune on Commissioned using Qwen 3 8B as the base model. Training takes ~5 minutes.

Upload your data and describe your use case as usual — the only difference is the model selection.

Download the adapter

Once the model shows Succeeded:

Go to your dashboard
Click Download adapter on the Qwen model card
Save the .zip file

Extract it — you'll find the LoRA weight files inside.

Set up your serving infrastructure

You need:

A machine with a GPU (NVIDIA recommended, 16 GB+ VRAM for 8B model)
The base Qwen 3 8B model weights (downloaded from Hugging Face)
Your LoRA adapter files

vLLM is the best option for production serving — high throughput, supports multiple concurrent LoRA adapters, and exposes an OpenAI-compatible API.

pip install vllm

vllm serve Qwen/Qwen3-8B \
  --enable-lora \
  --lora-modules my-model=/path/to/adapter \
  --port 8000

Your model is now accessible at http://localhost:8000/v1/chat/completions with model: "my-model".

Ollama is the easiest way to run models locally — one command to set up.

# Install Ollama
curl -fsSL https://ollama.com/install.sh | sh

# Create a Modelfile
cat > Modelfile << 'EOF'
FROM Qwen/Qwen3-8B
ADAPTER /path/to/adapter
EOF

# Build and run
ollama create my-model -f Modelfile
ollama run my-model

llama.cpp is lightweight and works on CPU (slower) or GPU.

# Build llama.cpp
git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp && make -j

# Serve with LoRA
./llama-server \
  -m /path/to/qwen3-8b.gguf \
  --lora /path/to/adapter.gguf \
  --port 8080

Note: you may need to convert the adapter to GGUF format first.

Point your application at it

If you're using vLLM or another tool that exposes an OpenAI-compatible API, update your code's base URL:

from openai import OpenAI

# Before: Commissioned hosted
# client = OpenAI(
#     base_url="https://app.commissioned.tech/v1",
#     api_key="your-api-key",
# )

# After: self-hosted
client = OpenAI(
    base_url="http://your-server:8000/v1",
    api_key="not-needed",  # or set up your own auth
)

response = client.chat.completions.create(
    model="my-model",
    messages=[{"role": "user", "content": "Hello!"}],
)

The rest of your code stays the same — same SDK, same interface.

Hardware recommendations

Use case	GPU	VRAM	Notes
Development / testing	Any NVIDIA GPU	16 GB+	RTX 4090, A4000, etc.
Production (single user)	Any NVIDIA GPU	16 GB+	Same as above
Production (multi-user)	A100, H100, L40S	40–80 GB	Higher throughput, more concurrent requests
CPU-only	None	32 GB+ RAM	Possible with llama.cpp, much slower

You can also use cloud GPUs from providers like AWS (p4d instances), GCP (A100 VMs), Lambda Labs, RunPod, or Vast.ai. vLLM works well in containers — deploy on any Kubernetes cluster with GPU nodes.

Build a Code Assistant

Fine-tune a model on your codebase, conventions, and internal APIs.

Plans & Pricing

Free, Pro, and Enterprise plans for Commissioned.