In a recent deep dive, creator Nichonauta compared the performance of original, distilled, and fine-tuned local language models (LLMs) — with a focus on Qwen 3.5 — revealing significant trade-offs between speed and accuracy.
Key Findings
- Original vs. Distilled/Fine-Tuned: While distilled models run faster on consumer hardware, they often suffer from a loss of intelligence and produce more errors, especially in code generation and execution tasks.
- Fine-Tuning Limitations: Contrary to popular belief, fine-tuning does not add new knowledge to a model. It can only adjust behavior or style. Trying to inject facts through fine-tuning leads to errors and hallucinations.
- Hardware Matters: On powerful GPUs like the RTX 5090, running larger original models (e.g., Qwen 3.5) provides far better results than their trimmed counterparts.
"Fine-tuning is about shaping behavior, not adding facts. If you want a model to know something new, use retrieval-augmented generation (RAG) or retrain from scratch."
Recommendations
For users with high-end hardware, stick with the original, unaltered model. For those on limited resources, be prepared for a drop in quality when using distilled versions.
Nichonauta also advises against buying a local AI server solely for running fine-tuned models — the performance gain rarely justifies the cost.