How to Build a Local Jarvis with OpenVox and PocketTTS
Create a private desktop assistant that thinks through a local Ollama model and speaks through PocketTTS using the OpenVox Local API.
OpenVox Editorial Team
Practical guides for private, local AI voice workflows.
A useful “Jarvis” is not one giant AI model. It is a small local system made from separate parts: an interface that receives a request, a language model that decides what to say, optional tools that perform approved actions, and a text-to-speech engine that speaks the final response.
In this guide, the reasoning layer runs through Ollama, while the voice layer uses PocketTTS inside OpenVox. Both services stay on your own computer. The result is a practical Jarvis-style desktop assistant for macOS or Windows that can answer questions aloud without sending every prompt and spoken response to a cloud voice provider.
Start with a local assistant that can listen, think, and speak reliably. Add tools only after the basic loop is stable, and never give an experimental agent unrestricted command execution.
What we are building
You type or dictate a request
-> local Python controller
-> Ollama local LLM
-> text response
-> OpenVox Local API
-> PocketTTS
-> spoken WAV outputThe first version uses keyboard input because it is easier to debug. Once the full response loop works, you can add local speech recognition with whisper.cpp, a push-to-talk button, or another offline transcription tool. Keeping input and output separate also makes it obvious whether a failure comes from transcription, reasoning, or speech generation.
Why PocketTTS fits a local assistant
PocketTTS is a 100-million-parameter text-to-speech model supporting English, Spanish, French, German, Portuguese, and Italian, with efficient CPU inference, streaming, long inputs, and voice cloning. That profile is useful for an assistant because short spoken replies should begin quickly and should not require a dedicated gaming GPU.
OpenVox handles the model download, local model management, generation history, and API layer. Your assistant only needs to send text to 127.0.0.1 and play the returned audio. The Local API is available on macOS and Windows; it is not available on iPadOS.
What you need
- A macOS or Windows computer supported by OpenVox.
- OpenVox with PocketTTS downloaded and ready.
- The OpenVox Local API enabled.
- Ollama installed with a local instruction model.
- Python 3.10 or later.
- The Python
requestspackage.
An 8 GB system can run smaller Ollama models and PocketTTS, but avoid loading several large models at once. Computers with 16 GB or more memory give you more freedom to use a stronger language model while keeping the speech layer responsive.
Step 1: Install and test Ollama
Install Ollama from its official website, then choose a model that fits your memory budget. A small instruction model is better for the first test than the largest model your machine can barely load.
ollama pull llama3.2:3b
ollama run llama3.2:3bAsk the model a simple question. If it responds in the terminal, the local reasoning layer works. Ollama exposes its local API on http://127.0.0.1:11434 by default.
Step 2: Enable PocketTTS in OpenVox
- Open OpenVox on macOS or Windows.
- Download PocketTTS from the model manager.
- Generate a short test sentence in the OpenVox interface.
- Enable the Local API.
- Keep the API on loopback unless another trusted device genuinely needs access.
The OpenVox API base URL is:
http://127.0.0.1:8000/v1Do not guess the PocketTTS model identifier. Query the API and use the exact ID returned by your installed OpenVox version:
curl http://127.0.0.1:8000/v1/modelsCopy the PocketTTS model ID from the response. You will place it in the script configuration below. Before the first generation, warm the model using that same ID:
curl -X POST http://127.0.0.1:8000/v1/models/YOUR_POCKETTTS_MODEL_ID/loadStep 3: Create the local assistant
Create a folder for the project, install the only external Python dependency, and add a file named jarvis.py.
mkdir local-jarvis
cd local-jarvis
python -m venv .venv
# macOS
source .venv/bin/activate
# Windows PowerShell
.venv\Scripts\Activate.ps1
pip install requestsUse this controller as the starting point:
import os
import platform
import subprocess
from pathlib import Path
import requests
OLLAMA_URL = "http://127.0.0.1:11434/api/chat"
OLLAMA_MODEL = os.getenv("OLLAMA_MODEL", "llama3.2:3b")
OPENVOX_API = "http://127.0.0.1:8000/v1"
OPENVOX_MODEL = os.getenv("OPENVOX_MODEL")
OPENVOX_LANGUAGE = os.getenv("OPENVOX_LANGUAGE", "en")
OUTPUT_FILE = Path("jarvis-reply.wav").resolve()
SYSTEM_PROMPT = """
You are a concise local desktop assistant.
Answer clearly and conversationally.
Keep most spoken replies under 120 words unless the user asks for detail.
Do not claim that you performed an action unless the controller confirms it.
"""
def ask_ollama(messages):
response = requests.post(
OLLAMA_URL,
json={
"model": OLLAMA_MODEL,
"messages": messages,
"stream": False,
},
timeout=180,
)
response.raise_for_status()
return response.json()["message"]["content"].strip()
def speak_with_openvox(text):
if not OPENVOX_MODEL:
raise RuntimeError(
"Set OPENVOX_MODEL to the PocketTTS model id returned by GET /v1/models."
)
response = requests.post(
f"{OPENVOX_API}/audio/speech",
json={
"model": OPENVOX_MODEL,
"input": text,
"language": OPENVOX_LANGUAGE,
"response_format": "wav",
},
timeout=180,
)
response.raise_for_status()
OUTPUT_FILE.write_bytes(response.content)
play_audio(OUTPUT_FILE)
def play_audio(path):
system = platform.system()
if system == "Darwin":
subprocess.run(["afplay", str(path)], check=True)
return
if system == "Windows":
import winsound
winsound.PlaySound(str(path), winsound.SND_FILENAME)
return
subprocess.run(["aplay", str(path)], check=True)
def main():
messages = [{"role": "system", "content": SYSTEM_PROMPT}]
print("Local Jarvis is ready. Type 'quit' to stop.")
while True:
user_text = input("\nYou: ").strip()
if not user_text:
continue
if user_text.lower() in {"quit", "exit"}:
break
messages.append({"role": "user", "content": user_text})
try:
reply = ask_ollama(messages)
print(f"\nJarvis: {reply}")
messages.append({"role": "assistant", "content": reply})
speak_with_openvox(reply)
except requests.RequestException as error:
print(f"Local service error: {error}")
except RuntimeError as error:
print(error)
if __name__ == "__main__":
main()Step 4: Configure and run it
Set the model ID you copied from OpenVox. You can also change the Ollama model without editing the script.
# macOS
export OPENVOX_MODEL="YOUR_POCKETTTS_MODEL_ID"
export OPENVOX_LANGUAGE="en"
export OLLAMA_MODEL="llama3.2:3b"
python jarvis.py
# Windows PowerShell
$env:OPENVOX_MODEL="YOUR_POCKETTTS_MODEL_ID"
$env:OPENVOX_LANGUAGE="en"
$env:OLLAMA_MODEL="llama3.2:3b"
python .\jarvis.pyThe example uses en for English. PocketTTS also supports Spanish, French, German, Portuguese, and Italian. Query GET /models/{model}/languages and use the exact language code returned by your OpenVox version.
Try prompts that produce short, easy-to-check answers:
- “Summarize the difference between RAM and storage.”
- “Give me a three-step plan for today.”
- “Explain this error message in plain English.”
- “Write a short spoken reminder to take a break in twenty minutes.”
How the code works
| Component | Responsibility |
|---|---|
| Python loop | Collects input, maintains conversation history, and coordinates local services. |
| Ollama | Generates the assistant's text response locally. |
| OpenVox API | Accepts the response text and manages local speech generation. |
| PocketTTS | Produces the English voice output efficiently on local hardware. |
| System audio player | Plays the returned WAV file through the computer's speakers. |
Add local microphone input
Once typed requests work, add speech-to-text as a separate stage. whisper.cpp is a common local option because it can run offline on macOS, Windows, and Linux. A clean push-to-talk workflow is easier and more private than an always-listening microphone:
- Press a key or button to begin recording.
- Record a short WAV file.
- Transcribe it locally with whisper.cpp.
- Send the transcript to the same
ask_ollamafunction. - Speak the answer through OpenVox and PocketTTS.
Microphone
-> local WAV recording
-> whisper.cpp transcript
-> Ollama response
-> OpenVox PocketTTS
-> speakersStart with a small Whisper model on low-end systems. Larger transcription models can improve accuracy but also increase latency and memory pressure. Keep the recording trigger visible so the user always knows when the microphone is active.
Give your assistant a consistent PocketTTS voice
PocketTTS supports voice cloning from an audio prompt. In OpenVox, create or select the PocketTTS voice you want, then query the model's available voices through the Local API:
curl "http://127.0.0.1:8000/v1/models/YOUR_POCKETTTS_MODEL_ID/voices?language=en"Replace en with the selected supported language code. If the API returns a voice ID, add it to the speech request as the voice value. Use only your own voice or a voice for which you have explicit permission. A fictional assistant persona does not make cloning a real person without consent acceptable.
Add safe local tools
A conversational assistant becomes more useful when it can perform small actions. Do not begin by passing model output directly into PowerShell, Terminal, or subprocess. Instead, expose a short allowlist of functions with fixed arguments and ask for confirmation before anything changes.
SAFE_TOOLS = {
"get_time": get_time,
"open_notes": open_notes,
"read_today_tasks": read_today_tasks,
}
# Never do this:
# subprocess.run(model_generated_command, shell=True)- Start with read-only tools such as time, weather from a chosen source, or local task summaries.
- Require explicit confirmation before opening apps, editing files, or sending messages.
- Keep tool arguments structured and validate every path or identifier.
- Log requested and completed actions separately from ordinary conversation.
- Let the assistant say “I cannot do that” rather than improvising a dangerous command.
Improve response speed
- Use a smaller Ollama model for routine assistant conversations.
- Keep Ollama and PocketTTS loaded instead of restarting them for every request.
- Tell the LLM to keep spoken answers concise.
- Summarize older conversation turns instead of sending unlimited chat history.
- Use OpenVox streaming when your client is ready to decode incremental audio.
- Avoid running transcription, LLM inference, and several TTS jobs concurrently on a low-memory system.
Troubleshooting
| Problem | Likely cause | Fix |
|---|---|---|
| Ollama connection refused | Ollama is not running or its local service is unavailable. | Start Ollama and test the selected model in its CLI first. |
| OpenVox returns 404 | The model ID or endpoint is incorrect. | Call GET /v1/models and copy the exact PocketTTS ID. |
| OpenVox returns 429 | Another model load or generation job is active. | Wait briefly and retry instead of starting parallel requests. |
| Audio file is empty or invalid | The API returned an error body instead of WAV data. | Print the response status and content type before writing the file. |
| The assistant feels slow | The Ollama model is too large or memory is being swapped. | Use a smaller local LLM and close other memory-heavy applications. |
Privacy and naming note
This is an independent Jarvis-style assistant tutorial, not an official Marvel or Iron Man product. Give your personal assistant any name you prefer. More importantly, verify the privacy behavior of every optional service you add. The core Ollama and OpenVox path can remain local, but web search, weather APIs, cloud calendars, and messaging integrations may transmit data to their own providers.
Final architecture
The useful version of a local Jarvis is intentionally modular. Ollama handles language, OpenVox handles the localhost speech interface, and PocketTTS provides fast voice output in six supported languages without requiring a dedicated GPU. Optional speech recognition and carefully allowlisted tools can be added later without replacing the working core.
Build the smallest loop first: ask, answer, speak. Once that is dependable, add push-to-talk, memory, reminders, and safe tools one at a time. That approach produces a much more capable assistant than starting with unrestricted automation and hoping the model behaves.
Official resources
Download OpenVox
Give your local assistant a fast private voice.
OpenVox runs PocketTTS locally on macOS and Windows and exposes a localhost API for assistants, apps, and automations. Download the model once and keep spoken responses on your computer.
Suggested blogs
Keep reading
How to Run Local TTS Without a GPU on Low-End Systems
A practical CPU-only TTS guide using PocketTTS and Kokoro on modest Windows, Mac, and Linux computers.
Read articleSpeechify Alternative: Why OpenVox Is Better for Private Local TTS
A practical comparison of Speechify and OpenVox for people who want private local voice workflows on Mac.
Read articleVoice Cloning Ethics and Privacy: How to Use AI Voices Responsibly
A trust-focused guide to consent, privacy, and responsible synthetic voice use.
Read article