Sakana AI has unveiled KAME, a novel tandem speech-to-speech architecture that dynamically injects knowledge from large language models (LLMs) into speech processing in real time. The system addresses a key challenge in voice-enabled AI: achieving low-latency, natural conversation while leveraging the vast contextual understanding of LLMs.
Traditional cascaded systems (speech-to-text → LLM → text-to-speech) introduce delays and lose prosodic cues. KAME instead uses a lightweight “adapter” module that sits between a streaming speech encoder and a speech decoder. This adapter can call upon an LLM at arbitrary intervals, allowing it to retrieve relevant knowledge—such as factual information, sentiment, or style—without waiting for full sentence completion.
In benchmarks, KAME achieved sub-300ms latency while outperforming baseline cascaded models on tasks like open-domain Q&A and expressive speech generation. The architecture is designed to be model-agnostic, meaning it can work with various speech encoder/decoder pairs and different LLMs.
“KAME enables truly fluid conversations where the AI can refine its response based on the latest LLM inference, all while the user continues speaking,” said the Sakana AI research team. The work is part of a broader effort to make AI assistants feel more “human-scale” in interaction.
The code and pretrained models are not yet publicly available, but Sakana AI plans to release them in the coming months. The paper detailing KAME is now on arXiv.