Building a Voice-Controlled Local AI Agent with Human-in-the-Loop Execution

Building an AI agent that can write code and modify your file system is exciting, but it’s also a massive security risk if left unchecked. For this project, the goal was to build a voice-controlled local AI agent from scratch that could transcribe audio, understand user intent, and execute file operations—all while being strictly sandboxed and requiring human approval before doing any real damage.
Here is the link to the project.

Here is a breakdown of the architecture, the models I chose, and the engineering challenges I ran into along the way.

System Architecture

The project is built in Python and entirely containerized using Docker. The architecture is built around a central AgentOrchestrator that manages a linear pipeline:

Audio Ingestion: A Streamlit frontend captures audio via microphone or file upload.
Speech-to-Text (STT): The audio is transcribed into text.
Intent \& Tool Parsing: The text, along with recent chat history, is sent to an LLM. The LLM determines if it should just chat, or if it needs to invoke specific Python functions (tools).
Human-in-the-Loop (HITL): If tools are called, execution pauses. The pending actions are saved to a local TinyDB database, and the UI prompts the user to approve or reject the action.
Safe Execution: Once approved, the tools are executed through a SafeExecutor class.

Security: The SafeExecutor

One of the core requirements was ensuring the LLM couldn't overwrite system files. I implemented a SafeExecutor wrapper for all file operations (create_file_tool, write_code_tool). It uses os.path.abspath and os.path.commonpath to resolve requested file paths and verify they strictly reside within a dedicated output/ directory. If the LLM attempts a path traversal attack (e.g., ../../etc/passwd), the executor intercepts and blocks it.

Models Chosen

I designed the provider layer using Abstract Base Classes (BaseSTT, BaseLLM) so I could easily hot-swap models from the UI.

Speech-to-Text (STT)

Local: Hugging Face Transformers pipeline running openai/whisper-tiny.en. Chosen for privacy and the ability to run completely offline on CPU/MPS.
Cloud Fallback: Groq's API running whisper-large-v3-turbo. Chosen for blazing-fast transcription when internet access is available.

Large Language Models (LLM)

Local: Ollama running llama3.2. The v0.4 Ollama Python client natively supports function calling, making it trivial to pass a list of Python tools and get structured JSON tool calls back without manual prompt engineering.
Cloud Fallback: Groq API running llama-3.3-70b-versatile. Chosen for its high reasoning capabilities, specifically for handling complex, multi-step compound commands (e.g., "Summarize this text and save it to two different files").

Engineering Challenges

Building the "happy path" is easy, but integrating LLMs, audio processing, and Docker brought up several interesting edge cases.

1. The Audio Extension Trap

Initially, when users uploaded .mp3 or .m4a files via the Streamlit UI, the transcription would return absolute garbage—classic Whisper hallucinations.
The issue: Streamlit passes uploaded files as raw bytes. We were saving these bytes into a temporary .wav file before passing them to the STT models. Because the file extension didn't match the actual underlying audio codec, the models couldn't extract the audio features properly and tried to decode "silence."
The fix: I updated the UI to extract the true file extension from the uploaded file and dynamically assign it to the temporary file, allowing the STT engines to decode the formats perfectly.

2. Tool-Calling Hallucinations

Even modern models like Llama 3.3 struggle with API constraints. Instead of using the native API tool-calling schema (which the backend expects), the model would occasionally output raw XML tags (e.g., <function=create_file>) or dump concatenated JSON objects directly into its text response like this:
{"type": "function", "name": "create_file_tool"}{"type": "function", "name": "write_code_tool"}.

Because this was just raw text, the orchestrator thought it was a general chat response and skipped the tool execution queue.
The fix: I updated the system prompt to explicitly forbid XML tags, and wrote a fallback JSON parser in the orchestrator. If the LLM returns an empty tool array but its text output starts with {, the parser intercepts it, splits any concatenated }{ blocks, and manually queues them up for Human-in-the-Loop approval.

3. Python 3.14 Alphas vs. Legacy Audio Libraries

For package management, I used uv. I initially set the pyproject.toml to requires-python = ">=3.12". uv saw this and aggressively downloaded the latest Python 3.14 alpha for the Docker container.
This immediately broke the app because audiorecorder relies on pydub, which relies on audioop—a core C-library that was completely removed from Python starting in version 3.13. The app crashed trying to install a backported pyaudioop package.
The fix: I strictly pinned the project to ==3.12.*, ensuring a stable environment where native audio processing libraries still exist.

Conclusion

By combining strict sandboxing, an intuitive Human-in-the-Loop UI, and robust fallback parsers to handle LLM quirks, the result is a local, voice-controlled developer assistant that is genuinely useful—and more importantly, safe to run on a local machine.