The OpenMOSS team, in collaboration with MOSI.AI and the Shanghai Innovation Institute, has introduced MOSS-Audio, an open-source foundation model designed to unify speech, sound, music, and time-aware audio reasoning within a single system.
MOSS-Audio goes beyond simple transcription, offering capabilities such as speaker identification, emotion analysis, background sound detection, music understanding, and the ability to answer time-grounded questions like "What did the speaker say at the 2-minute mark?" This eliminates the need to stitch together multiple specialized models.
Key Capabilities
- Speech & Content Understanding: Accurate transcription with word- and sentence-level timestamp alignment.
- Speaker, Emotion & Event Analysis: Identifies speakers, analyzes emotions from tone and context, and detects acoustic events.
- Scene & Sound Cue Extraction: Interprets background sounds to infer scene context.
- Music Understanding: Analyzes style, emotion progression, and instrumentation.
- Audio QA & Summarization: Handles questions and summaries across various audio types.
- Complex Reasoning: Multi-hop reasoning powered by chain-of-thought training and reinforcement learning.
The model comes in four variants: MOSS-Audio-4B-Instruct, MOSS-Audio-4B-Thinking, MOSS-Audio-8B-Instruct, and MOSS-Audio-8B-Thinking. Instruct variants are optimized for direct instruction following, while Thinking variants excel at chain-of-thought reasoning. The 4B and 8B models use Qwen3-4B and Qwen3-8B LLM backbones, respectively, with total sizes around 4.6B and 8.6B parameters.
Architecture
MOSS-Audio follows a modular design with three components: an audio encoder, a modality adapter, and a large language model. The architecture enables seamless integration of various audio understanding tasks.
Open-source code and weights are available on GitHub for community use and further development.