→ Next slide
← Previous slide
Home First slide
End Last slide
Esc Back to blog
A case study showing how we built a fine-tuned Retrieval-Augmented Generation system to help DevOps engineers debug Nginx issues using StackOverflow data, Qwen2.5-1.5B-Instruct, QLoRA fine-tuning, vLLM serving on Modal, and a Docker Desktop extension interface.
DevOps engineers spend a significant amount of time digging through logs, documentation, and StackOverflow threads to debug infrastructure issues.
The question we asked:
Can we build a system that takes a DevOps engineer’s log error, understands it, and surfaces the most relevant solution — instantly?
more or less how do we provide a template for building a RAG system that can be fine-tuned on domain-specific data, served efficiently, and integrated directly into engineers’ workflows?
Devocle is a Retrieval-Augmented Generation (RAG) system purpose-built for DevOps engineers.
Core goals:
Where it lives:
Rather than solving every DevOps problem at once, we scoped tightly.
| Constraint | Choice |
|---|---|
| Domain | Nginx — one of the most common web servers / reverse proxies |
| Data source | StackOverflow Q&A threads tagged nginx |
| Goal | Debug Nginx log errors and config issues |
StackOverflow was ideal: real engineers asking real questions, with accepted answers and vote scores as a built-in quality signal.
StackOverflow nginx threads are a goldmine for this use case:
score and accepted fields give a natural quality signal we preserve as retrieval metadata{
"question_id": 24319662,
"question_title": "From inside a Docker container, how do I connect to the localhost?",
"question": "I have an Nginx instance running inside a Docker container...",
"answer_id": 24326540,
"answer": "If you are using Docker-for-mac or Docker-for-Windows 18.03+...",
"accepted": true,
"score": 4902,
"question_score": 3561,
"tags": ["docker", "nginx", "docker-container", "docker-network"],
"url": "https://stackoverflow.com/questions/24319662"
}
Each record has a question body, an answer body, acceptance status, answer vote score, question score, tags, and a URL — all preserved as metadata during indexing.
The StackOverflow JSONL is re-formatted into an Alpaca-style instruction format at training time:
### Instruction:
Why is nginx returning 502 Bad Gateway when proxying to my Node.js app?
### Response:
A 502 error means nginx received an invalid response from the upstream server.
Check that your Node.js app is actually running on the port nginx is proxying to...
Why this format?
LLMs learn behaviour from the shape of their training data. A raw “question\nanswer” pair teaches the model to complete text. An instruction-response pair teaches it to follow a prompt — to behave like an assistant, not just a text predictor.
Since our base model (Qwen2.5-1.5B-Instruct) was already pre-trained with this format, using it during fine-tuning keeps the model’s existing instruction-following behaviour intact while injecting domain knowledge.
We blend StackOverflow data with 2,000 samples from the Alpaca general instruction dataset (tatsu-lab/alpaca).
| Data source | Role |
|---|---|
| StackOverflow nginx JSONL | Domain-specific knowledge |
| Alpaca (2 000 samples) | Retain general instruction-following ability |
Why not train on domain data alone?
Fine-tuning on a narrow corpus causes catastrophic forgetting — the model overwrites its general instruction-following ability with domain-specific patterns. After a few hundred steps on nginx-only data, it may refuse to answer anything outside that domain, or start producing malformed responses.
Mixing in general Alpaca examples acts as a regulariser, anchoring the model’s conversational backbone while still shifting it toward DevOps expertise.
We chose Qwen/Qwen2.5-1.5B-Instruct as our base model.
Why Qwen2.5-1.5B?
Base model : Qwen/Qwen2.5-1.5B-Instruct
Context window : 32 768 tokens (model) — current deployment constrains prompts to 2 048 tokens via vLLM to keep latency predictable
Quantisation : 4-bit NF4 (QLoRA)
Adapter : steveoni/qwen25-1.5b-qlora-adapter
Full fine-tuning updates every weight in the model. For a 1.5B-parameter model in FP16 that’s ~3 GB of weights, plus optimizer states and gradients — easily 10–15 GB of GPU memory. That rules out free Colab.
LoRA (Low-Rank Adaptation) solves this: freeze the original weights and inject small trainable rank-decomposition matrices alongside them.
Full fine-tune: W' = W + ΔW (ΔW has the same shape as W — expensive)
LoRA: W' = W + A × B (A and B are low-rank — cheap)
If W is (d × d), A is (d × r) and B is (r × d) where r << d.
Trainable parameters drop from $d^2$ to $2 \cdot d \cdot r$ — orders of magnitude fewer.
Think of it as writing notes in the margins of a textbook instead of rewriting the whole thing.
QLoRA (Quantized LoRA) goes further: it compresses the frozen base model itself to 4-bit NF4, so even the weights we’re not training take far less VRAM.
┌─────────────────────────────────┐
│ Base Model (4-bit NF4) │ ← stored at ~950 MB instead of ~3 GB
│ (frozen — never updated) │
└─────────────────────────────────┘
+
┌─────────────────────────────────┐
│ LoRA Adapters (FP16) │ ← the only thing we actually train
│ (a few MB) │
└─────────────────────────────────┘
At compute time, weights are dequantized to FP16 on the fly. The math stays accurate; only the storage format is compressed.
Result: fine-tuning a 1.5B model on a free Colab T4 (15 GB VRAM) becomes feasible.
BNB_4BIT_QUANT_TYPE = "nf4" # NormalFloat4
BNB_4BIT_USE_DOUBLE_QUANT = True # quantize the quantization constants too
BNB_COMPUTE_DTYPE = "float16" # dequantize to FP16 for compute
| Setting | Why |
|---|---|
nf4 over fp4 |
NF4 is information-theoretically optimal for normally-distributed weights (neural net weights are approximately normal) |
| Double quantization | Quantizes the quantization constants themselves — saves ~0.4 bits/parameter at no accuracy cost |
float16 compute |
Compatible with all CUDA GPUs including T4 (sm_75) which does not support bfloat16 |
LORA_R = 8 # rank
LORA_ALPHA = 16 # scaling: effective contribution = alpha / r = 2×
LORA_DROPOUT = 0.05
LORA_TARGET_MODULES = [
"q_proj", "k_proj", "v_proj", "o_proj", # attention projections
"gate_proj", "up_proj", "down_proj", # MLP projections
]
Why rank 8? Rank controls the adapter’s capacity. Too low (r=4) and it can’t absorb new domain knowledge. Too high (r=32+) and it trains slowly with diminishing returns on a 1.5B model. Rank 8 is the pragmatic sweet spot.
Why target both attention and MLP? Domain knowledge (Nginx config syntax, error semantics) lives in both the attention layers and the feed-forward MLP layers. Adapting only attention is common but leaves capability on the table for instruction-following tasks.
BATCH_SIZE = 2
GRAD_ACC = 8 # effective batch = 16
EPOCHS = 1
LEARNING_RATE = 2e-4
MAX_LENGTH = 512
EARLY_STOPPING_PATIENCE = 2
VAL_SPLIT = 0.05 # 5% held out, capped at 500 examples
Why gradient accumulation? A batch size of 2 fits in VRAM, but produces noisy gradients. Accumulating over 8 steps gives an effective batch of 16 — stable training without OOM.
Why only 1 epoch? With curated instruction data, 1 epoch is often enough. More epochs on a small, narrow dataset risk overfitting — the model memorises answers rather than generalising.
Result: the adapter trains in ~30 minutes on a Colab T4 and ~10 minutes on an A10G.
After fine-tuning we need efficient inference. We use vLLM — a high-throughput LLM serving engine.
Why vLLM over plain HuggingFace generate()?
cmd = [
"python", "-m", "vllm.entrypoints.openai.api_server",
"--model", CACHE_DIR, # base Qwen2.5-1.5B
"--enable-lora",
"--lora-modules", f"finetunedqa={ADAPTER_DIR}", # adapter loaded by name
"--max-lora-rank", "8",
"--dtype", "float16",
"--max-model-len", "2048",
]
We deploy the vLLM server to Modal — a serverless GPU cloud.
modal deploy modal_deploy.py
→ https://steveoni--qwen25-vllm-lora-vllmserver-serve.modal.run
Why Modal?
| Property | Value |
|---|---|
| GPU | A10G (24 GB VRAM) |
| Min containers | 0 — scales to zero when idle |
| Timeout | 600 s |
| Weights | Cached in Modal Volumes (no re-download on cold start) |
Serverless means we pay only when the endpoint is actively serving requests. For a demo / low-traffic system this is dramatically cheaper than a always-on VM.
User query
│
▼
FastAPI (RAG server)
│
├──► ChromaDB ──► top-k relevant passages
│
▼
Prompt assembly (passages + question)
│
▼
vLLM on Modal (Qwen2.5-1.5B + LoRA adapter "finetunedqa")
│
▼
Generated answer + citations
The RAG server and LLM server are fully decoupled — the RAG system calls the Modal endpoint over HTTP using the OpenAI client, making the LLM backend swappable with zero code changes.
Fine-tuning teaches the model how to answer in a domain. It does not guarantee factual accuracy on specific questions — parametric memory degrades, and the model can still hallucinate.
RAG separates knowledge from behaviour:
Without RAG: User → LLM → Answer (from training memory — may hallucinate)
With RAG: User → Retriever → Relevant passages
↓
User + Passages → LLM → Grounded answer
Every answer is traceable to a source document. When the knowledge base changes (e.g. a new Nginx version), we re-index — no retraining required.
┌──────────────────────────────────┐
│ INDEXING (offline) │
│ │
stackoverflow │ JSONL → parse Q&A pairs │
_nginx.jsonl ───► │ → split into chunks │
│ → embed (all-MiniLM-L6-v2) │
│ → store in ChromaDB │
└──────────────────────────────────┘
┌──────────────────────────────────┐
│ QUERYING (online) │
│ │
User question ──► │ embed query │
│ → cosine similarity → ChromaDB │
│ → top-3 chunks + metadata │
│ → assemble prompt │
│ → Qwen2.5 on Modal │
│ → answer + citations │
└──────────────────────────────────┘
Each Q&A record is split into a question passage and an answer passage, then chunked with overlap:
chunk_size = 800 # characters per chunk
chunk_overlap = 150 # overlap between consecutive chunks
Why overlap?
Chunk 1: [...context A...][ ← overlap → ]
Chunk 2: [ ← overlap → ][...context B...]
A key sentence that falls at a chunk boundary won’t be lost — both adjacent chunks carry enough context for the retriever to score them correctly.
Why 800 characters?
Our vLLM deployment constrains Qwen2.5-1.5B to a 2,048-token context window (the model natively supports 32,768). With 3 retrieved chunks (~2,400 characters) plus the system prompt and question, we stay comfortably within the limit while maximising the information density per retrieval.
huggingface_embed_model = "all-MiniLM-L6-v2"
Why all-MiniLM-L6-v2?
Why ChromaDB?
You are a helpful assistant specialised in Nginx, web servers, and DevOps.
Use ONLY the CONTEXT PASSAGES below to answer the USER QUESTION.
If the context does not contain enough information, say so clearly.
CONTEXT PASSAGES:
[ANSWER] Nginx 502 Bad Gateway when proxying to Node.js (score=4902)
url: https://stackoverflow.com/questions/24319662
<passage text>
USER QUESTION:
Why is nginx returning 502?
Answer:
The Use ONLY the CONTEXT PASSAGES constraint is deliberate. Without it, the fine-tuned model will blend retrieved context with its parametric memory — producing confident but untraceable answers. Hard grounding keeps every claim auditable.
The /query/json endpoint returns a typed Pydantic schema:
{
"answer": "A 502 means nginx can't reach the upstream. Check the upstream is running...",
"citations": [
{
"question_id": "24319662",
"question_title": "From inside a Docker container, how do I connect to localhost?",
"url": "https://stackoverflow.com/questions/24319662",
"passage_type": "answer",
"quote": "Use --network=host in your docker run command..."
}
],
"confidence": 0.91,
"recommended_next_actions": [
"Check upstream server status",
"Review nginx error_log for upstream connect() errors"
]
}
confidence is a top-level score on the whole answer. Citations carry a quote — an exact excerpt that lets the engineer verify the source in one click.
DevOps engineers already live in Docker Desktop. Rather than asking them to switch to a browser tab or CLI tool, we brought Devocle directly into their workflow.
The extension registers a dashboard tab — a single-page HTML/CSS/JS UI, zero additional installs required:
{
"schema": "0.3.0",
"version": "1.0.0",
"title": "Devocle",
"vendor": "Brassin",
"description": "Devocle — ask any Nginx / DevOps question. Answers sourced from StackOverflow and generated by Qwen2.5 fine-tuned on Modal.",
"ui": {
"dashboard-tab": {
"title": "Devocle",
"root": "/ui",
"src": "index.html"
}
}
}
┌──────────────────────────────────────────────────────────────┐
│ Docker Desktop │
│ ┌────────────────────────────────────────────────────────┐ │
│ │ Devocle Extension Tab │ │
│ │ "Why is nginx returning 502 on /api?" │ │
│ └────────────────────┬───────────────────────────────────┘ │
└───────────────────────┼──────────────────────────────────────┘
│ HTTP POST /query/json
▼
┌───────────────────────┐
│ FastAPI RAG Server │
│ (devocle-rag) │
└─────────┬─────────────┘
│
┌───────────┴───────────┐
▼ ▼
┌───────────────┐ ┌──────────────────────┐
│ ChromaDB │ │ Qwen2.5-1.5B + LoRA │
│ (SO nginx │ │ via vLLM on Modal │
│ vectors) │ │ (OpenAI-compat API) │
└───────────────┘ └──────────────────────┘
| Component | Technology |
|---|---|
| Dataset | StackOverflow nginx JSONL (+ 2 000 Alpaca samples) |
| Base model | Qwen2.5-1.5B-Instruct (Apache 2.0) |
| Fine-tuning | QLoRA — 4-bit NF4, rank 8, bitsandbytes |
| Adapter | PEFT LoRA → HuggingFace Hub (steveoni/qwen25-1.5b-qlora-adapter) |
| Serving | vLLM with LoRA hot-swap, OpenAI-compatible API |
| Cloud deploy | Modal (A10G, serverless, scales to zero) |
| Embeddings | all-MiniLM-L6-v2 (local, no API key) |
| Vector DB | ChromaDB (local dev → ChromaDB Cloud) |
| Chunking | RecursiveCharacterTextSplitter — 800 chars / 150 overlap |
| API | FastAPI — /query (text) + /query/json (structured + citations) |
| Interface | Docker Desktop Extension |
Access the code here Devocle