This tutorial walks through integrating state-of-the-art speech recognition into Unity games using the Hugging Face Unity API. The feature enables voice commands, NPC conversations, accessibility improvements, and any other functionality requiring speech-to-text conversion.
Check out the live demo on itch.io to try it yourself.
Prerequisites
Basic Unity knowledge is assumed. You'll also need the Hugging Face Unity API installed. See the earlier setup guide for instructions.
Steps
1. Set up the Scene
Create a simple scene with a Canvas containing three UI elements:
- Start Button: Initiates recording.
- Stop Button: Stops recording.
- Text (TextMeshPro): Displays the speech recognition result.
2. Set up the Script
Create a SpeechRecognitionTest script and attach it to an empty GameObject. Define UI references:
[SerializeField] private Button startButton;
[SerializeField] private Button stopButton;
[SerializeField] private TextMeshProUGUI text;
Assign them in the Inspector, then add button listeners in Start():
private void Start() {
startButton.onClick.AddListener(StartRecording);
stopButton.onClick.AddListener(StopRecording);
}
3. Record Microphone Input
Add member variables and use Microphone.Start() to record up to 10 seconds of audio at 44100 Hz:
private AudioClip clip;
private byte[] bytes;
private bool recording;
private void StartRecording() {
clip = Microphone.Start(null, false, 10, 44100);
recording = true;
}
Auto-stop when recording reaches max length:
private void Update() {
if (recording && Microphone.GetPosition(null) >= clip.samples) {
StopRecording();
}
}
4. Encode Audio as WAV
In StopRecording(), truncate the clip and encode it in WAV format:
private void StopRecording() {
var position = Microphone.GetPosition(null);
Microphone.End(null);
var samples = new float[position * clip.channels];
clip.GetData(samples, 0);
bytes = EncodeAsWAV(samples, clip.frequency, clip.channels);
recording = false;
}
private byte[] EncodeAsWAV(float[] samples, int frequency, int channels) {
using (var memoryStream = new MemoryStream(44 + samples.Length * 2)) {
using (var writer = new BinaryWriter(memoryStream)) {
writer.Write("RIFF".ToCharArray());
writer.Write(36 + samples.Length * 2);
writer.Write("WAVE".ToCharArray());
writer.Write("fmt ".ToCharArray());
writer.Write(16);
writer.Write((ushort)1);
writer.Write((ushort)channels);
writer.Write(frequency);
writer.Write(frequency * channels * 2);
writer.Write((ushort)(channels * 2));
writer.Write((ushort)16);
writer.Write("data".ToCharArray());
writer.Write(samples.Length * 2);
foreach (var sample in samples) {
writer.Write((short)(sample * short.MaxValue));
}
}
return memoryStream.ToArray();
}
}
5. Implement Speech Recognition
(Continuation of the script would use the Hugging Face API to send the WAV bytes for transcription and display the result in the Text field.)
The full script integrates all components for a working speech recognition system in Unity.