Commissioned
Guides

Self-Host with LoRA

Download your LoRA adapter and run your fine-tuned model on your own infrastructure.

This guide walks through fine-tuning a Qwen 3 8B model, downloading the LoRA adapter, and deploying it on your own infrastructure.

Why self-host?

  • No rate limits — serve as many requests as your hardware allows
  • Data residency — all inference happens on your infrastructure
  • Offline / air-gapped — works without internet after downloading the adapter
  • Cost at scale — amortize GPU costs across high request volumes
  • Customization — tune inference parameters, batching, and caching

Step by step

Fine-tune on Qwen 3 8B

Create a fine-tune on Commissioned using Qwen 3 8B as the base model. Training takes ~5 minutes.

Upload your data and describe your use case as usual — the only difference is the model selection.

Download the adapter

Once the model shows Succeeded:

  1. Go to your dashboard
  2. Click Download adapter on the Qwen model card
  3. Save the .zip file

Extract it — you'll find the LoRA weight files inside.

Set up your serving infrastructure

You need:

  • A machine with a GPU (NVIDIA recommended, 16 GB+ VRAM for 8B model)
  • The base Qwen 3 8B model weights (downloaded from Hugging Face)
  • Your LoRA adapter files

vLLM is the best option for production serving — high throughput, supports multiple concurrent LoRA adapters, and exposes an OpenAI-compatible API.

pip install vllm

vllm serve Qwen/Qwen3-8B \
  --enable-lora \
  --lora-modules my-model=/path/to/adapter \
  --port 8000

Your model is now accessible at http://localhost:8000/v1/chat/completions with model: "my-model".

Ollama is the easiest way to run models locally — one command to set up.

# Install Ollama
curl -fsSL https://ollama.com/install.sh | sh

# Create a Modelfile
cat > Modelfile << 'EOF'
FROM Qwen/Qwen3-8B
ADAPTER /path/to/adapter
EOF

# Build and run
ollama create my-model -f Modelfile
ollama run my-model

llama.cpp is lightweight and works on CPU (slower) or GPU.

# Build llama.cpp
git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp && make -j

# Serve with LoRA
./llama-server \
  -m /path/to/qwen3-8b.gguf \
  --lora /path/to/adapter.gguf \
  --port 8080

Note: you may need to convert the adapter to GGUF format first.

Point your application at it

If you're using vLLM or another tool that exposes an OpenAI-compatible API, update your code's base URL:

from openai import OpenAI

# Before: Commissioned hosted
# client = OpenAI(
#     base_url="https://app.commissioned.tech/v1",
#     api_key="your-api-key",
# )

# After: self-hosted
client = OpenAI(
    base_url="http://your-server:8000/v1",
    api_key="not-needed",  # or set up your own auth
)

response = client.chat.completions.create(
    model="my-model",
    messages=[{"role": "user", "content": "Hello!"}],
)

The rest of your code stays the same — same SDK, same interface.

Hardware recommendations

Use caseGPUVRAMNotes
Development / testingAny NVIDIA GPU16 GB+RTX 4090, A4000, etc.
Production (single user)Any NVIDIA GPU16 GB+Same as above
Production (multi-user)A100, H100, L40S40–80 GBHigher throughput, more concurrent requests
CPU-onlyNone32 GB+ RAMPossible with llama.cpp, much slower

You can also use cloud GPUs from providers like AWS (p4d instances), GCP (A100 VMs), Lambda Labs, RunPod, or Vast.ai. vLLM works well in containers — deploy on any Kubernetes cluster with GPU nodes.

On this page