Back to news

Code

Building Voice-Enabled Coding Agents.

OpenHuman's on-device STT/TTS makes voice-driven development practical. We cover the architecture, real workflows, current limitations, and how to add voice to Hermes and Claude Code.

AI Kick Start editorial image for Building Voice-Enabled Coding Agents.

Decision

Start narrow

Use the article to decide the smallest useful workflow worth testing before expanding the system.

Risk to watch

Hype drift

Avoid turning a practical adoption step into a broad transformation promise nobody can verify.

Proof to collect

Business signal

Write down the owner, data boundary, review point, and measurable outcome before the first build.

TL;DR

TL;DR: OpenHuman's on-device STT/TTS makes voice-driven development practical. We cover the architecture, real workflows, current limitations, and how to add voice to Hermes and Claude Code.

Key takeaways

  • Briefing: Talk to your computer and watch it write code.
  • The Architecture: A voice-enabled coding agent has four parts: Voice Input -> Speech-to-Text -> Agent Processing -> Text-to-Speech -> Voice Output Speech-to-Text (STT) OpenHuman uses Whisper for speech recognition and runs it locally on the [Tauri](https://tinyhumans.gitbook.io/openhuman/features/native-tools/voice) v2 runtime.
  • Practical Voice Workflows: Workflow 1: Hands-Free Code Review Review code while you eat lunch or walk around: You: "Read the auth middleware file" Agent: "Reading auth middleware.
  • Limitations (June 2026): Voice is not yet a primary way to write code.
  • When Voice Works Best: Voice coding earns its keep for: **Quick queries**: "What does this function do?" **Note capture**: "Remind me to fix the auth bug" **Simple commands**: "Run the tests" **Documentation**: dictating comments and docstrings **Accessibility**: developers with repetitive strain injury or visual impairments It falls down on: **Complex refactoring**: too many files, too many constraints **Precise syntax**: "Angle bracket question mark extends T greater than" is worse than typing `<?

Briefing

Talk to your computer and watch it write code. That used to be a sci-fi gag. It is now a feature you can switch on today, and a small open-source project is one of the clearer examples of where it actually helps.

OpenHuman, built by TinyHumans AI, is a desktop agent with a little animated mascot that listens when you speak and talks back. As of its v0.54.0 release, both the speech recognition and the speech output run on your own machine, so you can have a back-and-forth with your agent without sending audio anywhere. For a business team weighing whether voice belongs in their developers' day, the honest answer is: sometimes, and it depends heavily on the task.

The short version for non-technical readers is this. Voice is great for the quick, low-stakes stuff. Asking what a file does, jotting a reminder, kicking off a test run. It is poor at the fiddly, precise work where every character matters. Knowing which is which is the whole game, and that is what the rest of this piece walks through.

The Architecture

A voice-enabled coding agent has four parts:

Voice Input -> Speech-to-Text -> Agent Processing -> Text-to-Speech -> Voice Output

Speech-to-Text (STT)

OpenHuman uses Whisper for speech recognition and runs it locally on the Tauri v2 runtime. (The project documents Whisper-based STT and a Tauri build; the framing of a model "derived" from Whisper and tuned for Tauri specifically goes a bit beyond what the docs actually say.) Running on-device means:

  • No audio leaves your machine
  • It works offline
  • Latency is reportedly around 200-500ms for a 10-second utterance, though that figure is not published by OpenHuman and looks like an estimate
  • It is said to support English, Mandarin, Spanish, and Japanese, though the docs do not list supported languages

The agent is also described as recognising technical vocabulary (function names, library names, coding terms) more reliably than generic Whisper. That claim is unconfirmed; the docs mention punctuation and dictation cleanup, not a coding-specific fine-tune.

Agent Processing

The transcribed text goes into the agent's normal pipeline. OpenHuman treats a spoken command the same as a typed one. "Create a new function called calculate total that takes an array of prices" runs the same way whether you say it or type it.

Text-to-Speech (TTS)

Responses are read back using a lightweight TTS model. OpenHuman ships Piper for local voice and ElevenLabs for cloud voice, and you can pick the voice you want. Some developers reportedly bump the speech rate to 1.5x for routine confirmations and drop back to 1.0x for longer explanations, though an adjustable rate is not something OpenHuman documents.

Practical Voice Workflows

Workflow 1: Hands-Free Code Review

Review code while you eat lunch or walk around:

You: "Read the auth middleware file"
Agent: "Reading auth middleware. The file has 47 lines. It validates JWT tokens..."
You: "What exceptions does it handle?"
Agent: "It handles TokenExpiredError, InvalidTokenError, and MissingTokenError."
You: "Add handling for MalformedTokenError"
Agent: "Added MalformedTokenError handling. Should I also add a test for it?"
You: "Yes, add a test"

Workflow 2: Rapid Note Capture

Grab an idea without breaking your flow:

You: "Note: the database migration needs a rollback script"
Agent: "Noted. I will add it to the migration task in your Memory Tree."

Those notes land in OpenHuman's Memory Tree, a hierarchical store of Markdown files backed by a local SQLite database.

Workflow 3: Meeting Participation

OpenHuman can join a Google Meet call as a real participant: it hears everyone, takes notes, can speak back, and pipes its animated face in as the camera feed.

You: "Join the standup and take notes"
Agent: "Joining the standup. I will transcribe and extract action items."

After the meeting:

You: "What were my action items?"
Agent: "Three action items: fix the login bug, review Sarah's PR, and update the API documentation."

Limitations (June 2026)

Voice is not yet a primary way to write code. The sticking points:

  • Precision: Code is exact; speech is loose. "Function called calculate total that takes an array of numbers" is clear. "The thing that does the sum with the list" is not.
  • Context: Voice has none of the visual context of an IDE. You cannot point at a line while you talk.
  • Environment: Open offices and background noise drag accuracy down fast.
  • Privacy: Saying a coding task out loud tells everyone near you what you are working on.
  • Complexity: Multi-step reasoning is harder to track by ear than by eye.

When Voice Works Best

Voice coding earns its keep for:

  • Quick queries: "What does this function do?"
  • Note capture: "Remind me to fix the auth bug"
  • Simple commands: "Run the tests"
  • Documentation: dictating comments and docstrings
  • Accessibility: developers with repetitive strain injury or visual impairments

It falls down on:

  • Complex refactoring: too many files, too many constraints
  • Precise syntax: "Angle bracket question mark extends T greater than" is worse than typing <? extends T>
  • Visual review: reading diffs and comparing screenshots

Implementation for Other Agents

Hermes and Claude Code do not ship a native voice interface, but you can bolt one on. (Hermes Agent from Nous Research lists multi-channel access over Telegram, Slack, Discord and the terminal, with no documented voice mode; Claude Code is a CLI with no built-in voice either.)

# Voice bridge for Hermes
import speech_recognition as sr

def voice_command():
    recognizer = sr.Recognizer()
    with sr.Microphone() as source:
        audio = recognizer.listen(source)
    text = recognizer.recognize_whisper(audio)
    return hermes.execute(text)

That pattern uses the Uberi/SpeechRecognition Python library, whose Recognizer, Microphone, listen() and recognize_whisper() calls all work as shown.

For Claude Code, macOS has built-in system dictation that types into any text field, including a terminal. Some people pair it with third-party STT tools (one reportedly named WhisperDesktop, which I could not verify as a current product) to feed the terminal.

Voice is the easiest way to interact with a coding agent casually. It will not take over from typing for precise work. For most of the quick stuff around that work, it is on track to become the default.

Source trail

Primary references to keep this briefing grounded

AI and automation information changes quickly. Use these official or primary references to verify the claims, pricing, product behaviour, and compliance details before committing budget or production data.

What to do next

  1. Pick the smallest useful workflow that proves the pattern.
  2. Write down the owner, data boundary, review point, and success measure.
  3. Review the result after the first real run and decide whether to scale, change, or stop.

Want help applying this? Explore AI agent design systems.

AI Kick Start is an Illawarra-based AI studio in Figtree, helping businesses across Wollongong, Shellharbour and Kiama and right across Australia put AI to work.

Explore with AI

Use the article as a decision prompt

Summarise this AI Kick Start article for an Australian business owner. Focus on the useful decision, the risks, and the first practical next step: Building Voice-Enabled Coding Agents

Turn this into a practical roadmap.

Use the guide as a starting point, then map the first workflow worth building.

Book an AI strategy call