Self-Host with LoRA
Download your LoRA adapter and run your fine-tuned model on your own infrastructure.
This guide walks through fine-tuning a Qwen 3 8B model, downloading the LoRA adapter, and deploying it on your own infrastructure.
Why self-host?
- No rate limits — serve as many requests as your hardware allows
- Data residency — all inference happens on your infrastructure
- Offline / air-gapped — works without internet after downloading the adapter
- Cost at scale — amortize GPU costs across high request volumes
- Customization — tune inference parameters, batching, and caching
Step by step
Fine-tune on Qwen 3 8B
Create a fine-tune on Commissioned using Qwen 3 8B as the base model. Training takes ~5 minutes.
Upload your data and describe your use case as usual — the only difference is the model selection.
Download the adapter
Once the model shows Succeeded:
- Go to your dashboard
- Click Download adapter on the Qwen model card
- Save the
.zipfile
Extract it — you'll find the LoRA weight files inside.
Set up your serving infrastructure
You need:
- A machine with a GPU (NVIDIA recommended, 16 GB+ VRAM for 8B model)
- The base Qwen 3 8B model weights (downloaded from Hugging Face)
- Your LoRA adapter files
vLLM is the best option for production serving — high throughput, supports multiple concurrent LoRA adapters, and exposes an OpenAI-compatible API.
pip install vllm
vllm serve Qwen/Qwen3-8B \
--enable-lora \
--lora-modules my-model=/path/to/adapter \
--port 8000Your model is now accessible at http://localhost:8000/v1/chat/completions with model: "my-model".
Ollama is the easiest way to run models locally — one command to set up.
# Install Ollama
curl -fsSL https://ollama.com/install.sh | sh
# Create a Modelfile
cat > Modelfile << 'EOF'
FROM Qwen/Qwen3-8B
ADAPTER /path/to/adapter
EOF
# Build and run
ollama create my-model -f Modelfile
ollama run my-modelllama.cpp is lightweight and works on CPU (slower) or GPU.
# Build llama.cpp
git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp && make -j
# Serve with LoRA
./llama-server \
-m /path/to/qwen3-8b.gguf \
--lora /path/to/adapter.gguf \
--port 8080Note: you may need to convert the adapter to GGUF format first.
Point your application at it
If you're using vLLM or another tool that exposes an OpenAI-compatible API, update your code's base URL:
from openai import OpenAI
# Before: Commissioned hosted
# client = OpenAI(
# base_url="https://app.commissioned.tech/v1",
# api_key="your-api-key",
# )
# After: self-hosted
client = OpenAI(
base_url="http://your-server:8000/v1",
api_key="not-needed", # or set up your own auth
)
response = client.chat.completions.create(
model="my-model",
messages=[{"role": "user", "content": "Hello!"}],
)The rest of your code stays the same — same SDK, same interface.
Hardware recommendations
| Use case | GPU | VRAM | Notes |
|---|---|---|---|
| Development / testing | Any NVIDIA GPU | 16 GB+ | RTX 4090, A4000, etc. |
| Production (single user) | Any NVIDIA GPU | 16 GB+ | Same as above |
| Production (multi-user) | A100, H100, L40S | 40–80 GB | Higher throughput, more concurrent requests |
| CPU-only | None | 32 GB+ RAM | Possible with llama.cpp, much slower |
You can also use cloud GPUs from providers like AWS (p4d instances), GCP (A100 VMs), Lambda Labs, RunPod, or Vast.ai. vLLM works well in containers — deploy on any Kubernetes cluster with GPU nodes.