A new video from the YouTube channel Nichonauta puts two local language models head-to-head, comparing their performance on a consumer GPU. The analysis focuses on token generation speed, memory consumption, and code quality for 2-billion-parameter (2B) and 4-billion-parameter (4B) models.
The test setup explores the relationship between model parameters, bandwidth, and speed. The 2B model is first evaluated on code generation tasks, followed by an analysis of memory usage and context window. The video then contrasts Mixture of Experts architectures with dense models, before moving to the 4B model comparison.
A key topic is the impact of the KV cache on GPU memory, which can significantly affect the maximum context length and batch size. The video also addresses common questions about running models across multiple GPUs or on CPU.
The channel encourages viewers to subscribe and join the discussion about local LLM deployment.