Back to news

How-to Guide

How to build a voice-enabled AI assistant.

Create a voice-driven AI assistant using speech-to-text, LLM reasoning, and text-to-speech, with wake word detection, interruption handling, and low-latency streaming.

AI Kick Start editorial image for How to build a voice-enabled AI assistant.

Decision

Start narrow

Use the article to decide the smallest useful workflow worth testing before expanding the system.

Risk to watch

Hype drift

Avoid turning a practical adoption step into a broad transformation promise nobody can verify.

Proof to collect

Business signal

Write down the owner, data boundary, review point, and measurable outcome before the first build.

TL;DR

TL;DR: Build a voice-enabled AI assistant that listens via speech-to-text (STT), reasons with an LLM, and responds via text-to-speech (TTS), with wake word detection, interruption handling, and sub-2-second response latency. This guide covers the full audio pipeline with working code.

Key takeaways

  • STT: Whisper (local) or Deepgram API for streaming transcription
  • LLM: GPT-5.5 Instant (pricing unconfirmed; see note) for fast responses; or local
  • TTS: ElevenLabs for quality; Piper for local/free
  • Latency: Target < 2s end-to-end; streaming critical
  • Wake word: Porcupine (local) or custom keyword spotting

Analysis

Talk to a computer and wait, and the gap between your last word and its first one tells you everything. Half a second feels like a conversation. Three seconds feels like a help desk. The whole craft of building a voice assistant comes down to closing that gap without making the thing sound robotic when it finally speaks.

For Australian teams, that is the practical part worth caring about. A voice front end on top of an LLM can answer phones, run a kiosk, take an order, or sit on a warehouse floor where nobody has a free hand for a keyboard. None of it works if the reply lands too late or the assistant talks over the person mid-sentence.

This guide walks through the full chain: catching a wake word locally, turning speech into text, getting a fast answer out of a model, and speaking it back. The code below runs in Python and uses tools you can install today. A couple of the numbers attached to one model are not what they appear, and I'll flag those as we hit them rather than hand-wave past them.

Analysis

Prerequisites

  • Python 3.10+
  • Microphone access
  • Speakers/headphones
  • API keys for STT/LLM/TTS services (or local models)
  • pip install openai whisper pvporcupine pyaudio elevenlabs

Step-by-Step Framework

Step 1: Wake Word Detection

The assistant needs to sit quietly until you call it. Running a full LLM around the clock to catch one phrase would be wasteful and slow, so wake word detection runs on its own small model. Porcupine does this on-device, no audio leaves the machine until the keyword fires, which keeps both latency and privacy in your favour. The Python SDK lives in the Picovoice/porcupine repo and needs a free Picovoice access key.

# voice/wake_word.py
import pvporcupine
import pyaudio
import struct

class WakeWordDetector:
    def __init__(self, keyword_path: str = "hey-computer.ppn"):
        self.porcupine = pvporcupine.create(
            access_key="YOUR_PICOVOICE_KEY",
            keyword_paths=[keyword_path]
        )

        self.pa = pyaudio.PyAudio()
        self.stream = self.pa.open(
            rate=self.porcupine.sample_rate,
            channels=1,
            format=pyaudio.paInt16,
            input=True,
            frames_per_buffer=self.porcupine.frame_length
        )

    def listen(self):
        """Block until wake word detected."""
        print("Listening for wake word...")
        while True:
            pcm = self.stream.read(self.porcupine.frame_length)
            pcm = struct.unpack_from("h" * self.porcupine.frame_length, pcm)
            result = self.porcupine.process(pcm)

            if result >= 0:
                print("Wake word detected!")
                return True

    def cleanup(self):
        self.stream.stop_stream()
        self.stream.close()
        self.pa.terminate()
        self.porcupine.delete()

Step 2: Streaming Speech-to-Text

Once the assistant is awake, it has to write down what you said. OpenAI's Whisper runs locally and ships in several sizes; base and tiny are fast enough that the transcription is rarely the bottleneck. If you'd rather offload it and get word-by-word results as the person talks, Deepgram's streaming API is the usual swap.

The other half of this step is knowing when the person has finished. Rather than recording for a fixed number of seconds, the code below watches the volume and stops after roughly a second of silence. That single trick does more for the felt responsiveness than almost anything else, because the assistant stops listening the moment you stop talking instead of waiting out a timer.

# voice/stt.py
import whisper
import numpy as np
import queue
import threading

class StreamingSTT:
    def __init__(self, model_size: str = "base"):
        self.model = whisper.load_model(model_size)
        self.audio_queue = queue.Queue()
        self.is_recording = False
        self.silence_threshold = 500  # Adjust for environment
        self.silence_frames = 0
        self.max_silence_frames = 30  # ~1 second of silence

    def start_recording(self):
        """Start recording audio for transcription."""
        import pyaudio
        import struct

        self.is_recording = True
        self.audio_buffer = []
        self.silence_frames = 0

        pa = pyaudio.PyAudio()
        stream = pa.open(
            format=pyaudio.paInt16,
            channels=1,
            rate=16000,
            input=True,
            frames_per_buffer=1024
        )

        print("Recording... (speak now)")

        while self.is_recording:
            data = stream.read(1024, exception_on_overflow=False)
            audio_data = np.frombuffer(data, dtype=np.int16)
            self.audio_buffer.append(audio_data)

            # Detect silence
            volume = np.abs(audio_data).mean()
            if volume < self.silence_threshold:
                self.silence_frames += 1
                if self.silence_frames >= self.max_silence_frames:
                    self.is_recording = False
            else:
                self.silence_frames = 0

        stream.stop_stream()
        stream.close()
        pa.terminate()

        # Transcribe
        audio = np.concatenate(self.audio_buffer).astype(np.float32) / 32768.0
        result = self.model.transcribe(audio, fp16=False)
        return result["text"].strip()

    def transcribe_file(self, path: str) -> str:
        result = self.model.transcribe(path, fp16=False)
        return result["text"].strip()

Step 3: Fast LLM Response

This is where the model does its thinking, and where speed buys you the most. The code targets what OpenAI markets as GPT-5.5 Instant, the variant tuned for quick replies.

Two caveats here, because they matter for anyone copying this into production. First, the model="gpt-5.5-instant" string is written as if it were a standalone API model id, but OpenAI staff have said there is no separate gpt-5.5-instant model on the API; the Instant behaviour is reportedly reached through the `chat-latest` alias, so the literal id may not resolve as written. Second, ignore the headline pricing you may have seen floating around for this tier, see the latency and cost note at Step 6 before you budget on it.

Capping responses at 250 tokens is deliberate. Voice replies that run long stop sounding like answers and start sounding like a lecture, and every extra token is more time before the person can speak again.

# voice/llm.py
from openai import OpenAI

class FastLLM:
    def __init__(self):
        self.client = OpenAI()

    def respond(self, user_message: str, conversation_history: list = None) -> str:
        messages = conversation_history or []
        messages.append({"role": "user", "content": user_message})

        # Use GPT-5.5 Instant for speed
        response = self.client.chat.completions.create(
            model="gpt-5.5-instant",
            messages=messages,
            max_tokens=250,  # Keep responses concise for voice
            temperature=0.7
        )

        return response.choices[0].message.content

    def respond_streaming(self, user_message: str):
        """Stream response for lower perceived latency."""
        response = self.client.chat.completions.create(
            model="gpt-5.5-instant",
            messages=[{"role": "user", "content": user_message}],
            max_tokens=250,
            stream=True
        )

        for chunk in response:
            if chunk.choices[0].delta.content:
                yield chunk.choices[0].delta.content

Step 4: Text-to-Speech

Now the assistant talks back. ElevenLabs gives you the most natural voice and supports streaming, so playback can start before the full reply is generated, the eleven_turbo_v2_5 model is the low-latency option, and 21m00Tcm4TlvDq8ikWAM is the default Rachel voice. If you want everything local and free, Piper runs neural voices like en_US-lessac-medium straight from the CLI. And on a Mac, the built-in `say` command is fine for a quick prototype before you wire up anything fancier.

# voice/tts.py
from elevenlabs import generate, play, stream
import subprocess
import os

class VoiceOutput:
    def __init__(self, voice_id: str = "21m00Tcm4TlvDq8ikWAM"):
        self.voice_id = voice_id

    def speak_elevenlabs(self, text: str):
        """High-quality TTS via ElevenLabs."""
        audio_stream = generate(
            text=text,
            voice=self.voice_id,
            model="eleven_turbo_v2_5",
            stream=True
        )
        stream(audio_stream)

    def speak_piper(self, text: str):
        """Local/free TTS via Piper."""
        with open("/tmp/tts_input.txt", "w") as f:
            f.write(text)

        subprocess.run([
            "piper",
            "--model", "en_US-lessac-medium.onnx",
            "--output_file", "/tmp/tts_output.wav",
            "--file", "/tmp/tts_input.txt"
        ])

        subprocess.run(["aplay", "/tmp/tts_output.wav"])

    def speak_macos(self, text: str):
        """Use macOS built-in say command."""
        subprocess.run(["say", text])

Step 5: Main Assistant Loop

This is where the four pieces become one program. The loop waits for the wake word, acknowledges with a quick "Yes?", records until you go quiet, sends the text to the model, speaks the reply, and keeps the last ten exchanges as context so the conversation holds together. It also catches its own errors and apologises out loud rather than dying silently, which on a voice device is the difference between a hiccup and a dead box the user has no way to debug.

# voice/assistant.py
import time
import signal
import sys

class VoiceAssistant:
    def __init__(self):
        self.wake_detector = WakeWordDetector()
        self.stt = StreamingSTT()
        self.llm = FastLLM()
        self.tts = VoiceOutput()
        self.conversation = []
        self.running = True

        signal.signal(signal.SIGINT, self.shutdown)

    def run(self):
        print("Voice Assistant started. Say the wake word to begin.")

        while self.running:
            try:
                # 1. Wait for wake word
                self.wake_detector.listen()
                self.tts.speak_macos("Yes?")

                # 2. Record speech
                user_input = self.stt.start_recording()
                print(f"You said: {user_input}")

                if not user_input:
                    self.tts.speak_macos("I didn't catch that.")
                    continue

                # 3. Generate response
                start_time = time.time()
                response = self.llm.respond(user_input, self.conversation)
                latency = time.time() - start_time

                print(f"Response ({latency:.1f}s): {response}")

                # 4. Speak response
                self.tts.speak_macos(response)

                # 5. Update conversation
                self.conversation.append({"role": "user", "content": user_input})
                self.conversation.append({"role": "assistant", "content": response})

                # Keep last 10 exchanges
                if len(self.conversation) > 20:
                    self.conversation = self.conversation[-20:]

            except Exception as e:
                print(f"Error: {e}")
                self.tts.speak_macos("Sorry, I encountered an error.")

    def shutdown(self, signum, frame):
        print("\nShutting down...")
        self.running = False
        self.wake_detector.cleanup()
        sys.exit(0)

if __name__ == "__main__":
    assistant = VoiceAssistant()
    assistant.run()

Step 6: Latency Optimisation

The figures below are targets to design against, not benchmarks. Real numbers depend on your hardware, your network, and how each component is configured, treat them as a budget that tells you where the time goes, then measure your own setup.

Target latency breakdown:
- Wake word detection: < 200ms
- Speech recording (silence detection): 1-3s (user-dependent)
- STT (Whisper base): 300ms
- LLM (GPT-5.5 Instant, 50 tokens): 500ms
- TTS (ElevenLabs Turbo): 200ms first chunk
- Audio playback: streaming

Total end-to-end: 1.2 - 3.2s

Optimisations:

  • Use Whisper "tiny" or "base" for speed
  • GPT-5.5 Instant is the fastest model with good quality
  • ElevenLabs Turbo v2.5 for streaming TTS
  • Pre-warm all models on startup
  • Use streaming TTS to start playback before full response

One more note on the model, because it affects your running costs. Some write-ups have circulated a $0.50/$1.50 per-million-token figure for GPT-5.5 Instant; that does not match any published OpenAI rate I can find. The documented GPT-5.5 pricing is closer to $5 per million input tokens and $30 per million output tokens (roughly half that on batch), and OpenAI staff have said there is no separate Instant pricing tier. Budget against the published numbers, not the cheaper ones.

Do/Don't

DoDon't
Use GPT-5.5 Instant for voice (speed matters)Use slow models for real-time voice
Keep LLM responses under 250 tokensGenerate long responses for voice
Implement silence detection for STTUse fixed recording duration
Stream TTS for lower perceived latencyWait for full audio before playing
Test in your actual acoustic environmentDevelop in a quiet office, deploy to a noisy space

Conclusion

The pipeline is four parts, wake word, speech-to-text, a fast model, and speech back out, and the whole thing lives or dies on latency. Pick fast components, stream wherever you can, and pre-warm the models so the first request isn't the slow one. Porcupine keeps the wake word local, Whisper or Deepgram handle the listening, and ElevenLabs Turbo gives you a voice that doesn't sound like a phone tree.

Two things to keep your eye on before you ship: confirm how you actually call the model on the API rather than trusting the literal id in the sample, and check the real pricing against OpenAI's published rates. Get those right, and you have an assistant that answers fast enough to feel like a conversation rather than a query.

Source trail

Primary references to keep this briefing grounded

AI and automation information changes quickly. Use these official or primary references to verify the claims, pricing, product behaviour, and compliance details before committing budget or production data.

What to do next

  1. Pick the smallest useful workflow that proves the pattern.
  2. Write down the owner, data boundary, review point, and success measure.
  3. Review the result after the first real run and decide whether to scale, change, or stop.

Want help applying this? Explore AI consulting & strategy.

AI Kick Start is an Illawarra-based AI studio in Figtree, helping businesses across Wollongong, Shellharbour and Kiama and right across Australia put AI to work.

Explore with AI

Use the article as a decision prompt

Summarise this AI Kick Start article for an Australian business owner. Focus on the useful decision, the risks, and the first practical next step: How to build a voice-enabled AI assistant

Turn this into a practical roadmap.

Use the guide as a starting point, then map the first workflow worth building.

Book an AI strategy call