In a recent hands-on comparison, YouTuber Nichonauta tested multiple distilled versions of the Qwen 3.5 9B language model against the original, using a practical programming challenge: building a Tetris game. The results were stark—every distilled model underperformed, producing buggy or incomplete code, while the original model delivered a working solution.
"The distilled models were a disaster," Nichonauta noted, highlighting that despite claims of efficiency gains through techniques like speculative decoding and fine-tuning, the trade-off in quality was severe.
The video, published May 1, 2026, systematically evaluates versions including Qwen 3.5 9B Original, several distilled variants (e.g., Omnicoder), and fine-tuned iterations. Nichonauta found that benchmarks often misrepresent real-world performance: distilled models scored well on synthetic tests but failed on an actual development task.
Key observations:
- The original Qwen 3.5 9B produced a fully functional Tetris game with correct logic and rendering.
- Distilled models either crashed, generated incorrect game mechanics, or omitted critical features.
- Results underscore that compression techniques may sacrifice reliability, especially for code generation.
The creator advises developers to test models on their specific use cases rather than relying solely on published benchmarks. For programming tasks, the original model remains the safest choice until distillation methods improve.