Local AI AssistantsJune 15, 2026•12 min read

How to Build a Local Jarvis with OpenVox and PocketTTS

Create a private desktop assistant that thinks through a local Ollama model and speaks through PocketTTS using the OpenVox Local API.

OpenVox Editorial Team

Practical guides for private, local AI voice workflows.

A useful “Jarvis” is not one giant AI model. It is a small local system made from separate parts: an interface that receives a request, a language model that decides what to say, optional tools that perform approved actions, and a text-to-speech engine that speaks the final response.

In this guide, the reasoning layer runs through Ollama, while the voice layer uses PocketTTS inside OpenVox. Both services stay on your own computer. The result is a practical Jarvis-style desktop assistant for macOS or Windows that can answer questions aloud without sending every prompt and spoken response to a cloud voice provider.

Start with a local assistant that can listen, think, and speak reliably. Add tools only after the basic loop is stable, and never give an experimental agent unrestricted command execution.

What we are building

You type or dictate a request
  -> local Python controller
  -> Ollama local LLM
  -> text response
  -> OpenVox Local API
  -> PocketTTS
  -> spoken WAV output

The first version uses keyboard input because it is easier to debug. Once the full response loop works, you can add local speech recognition with whisper.cpp, a push-to-talk button, or another offline transcription tool. Keeping input and output separate also makes it obvious whether a failure comes from transcription, reasoning, or speech generation.

Why PocketTTS fits a local assistant

PocketTTS is a 100-million-parameter text-to-speech model supporting English, Spanish, French, German, Portuguese, and Italian, with efficient CPU inference, streaming, long inputs, and voice cloning. That profile is useful for an assistant because short spoken replies should begin quickly and should not require a dedicated gaming GPU.

OpenVox handles the model download, local model management, generation history, and API layer. Your assistant only needs to send text to 127.0.0.1 and play the returned audio. The Local API is available on macOS and Windows; it is not available on iPadOS.

What you need

A macOS or Windows computer supported by OpenVox.
OpenVox with PocketTTS downloaded and ready.
The OpenVox Local API enabled.
Ollama installed with a local instruction model.
Python 3.10 or later.
The Python requests package.

An 8 GB system can run smaller Ollama models and PocketTTS, but avoid loading several large models at once. Computers with 16 GB or more memory give you more freedom to use a stronger language model while keeping the speech layer responsive.

Step 1: Install and test Ollama

Install Ollama from its official website, then choose a model that fits your memory budget. A small instruction model is better for the first test than the largest model your machine can barely load.

ollama pull llama3.2:3b
ollama run llama3.2:3b

Ask the model a simple question. If it responds in the terminal, the local reasoning layer works. Ollama exposes its local API on http://127.0.0.1:11434 by default.

Step 2: Enable PocketTTS in OpenVox

Open OpenVox on macOS or Windows.
Download PocketTTS from the model manager.
Generate a short test sentence in the OpenVox interface.
Enable the Local API.
Keep the API on loopback unless another trusted device genuinely needs access.

The OpenVox API base URL is:

http://127.0.0.1:8000/v1

Do not guess the PocketTTS model identifier. Query the API and use the exact ID returned by your installed OpenVox version:

curl http://127.0.0.1:8000/v1/models

Copy the PocketTTS model ID from the response. You will place it in the script configuration below. Before the first generation, warm the model using that same ID:

curl -X POST http://127.0.0.1:8000/v1/models/YOUR_POCKETTTS_MODEL_ID/load

Step 3: Create the local assistant

Create a folder for the project, install the only external Python dependency, and add a file named jarvis.py.

mkdir local-jarvis
cd local-jarvis
python -m venv .venv

# macOS
source .venv/bin/activate

# Windows PowerShell
.venv\Scripts\Activate.ps1

pip install requests

Use this controller as the starting point:

import os
import platform
import subprocess
from pathlib import Path

import requests

OLLAMA_URL = "http://127.0.0.1:11434/api/chat"
OLLAMA_MODEL = os.getenv("OLLAMA_MODEL", "llama3.2:3b")

OPENVOX_API = "http://127.0.0.1:8000/v1"
OPENVOX_MODEL = os.getenv("OPENVOX_MODEL")
OPENVOX_LANGUAGE = os.getenv("OPENVOX_LANGUAGE", "en")
OUTPUT_FILE = Path("jarvis-reply.wav").resolve()

SYSTEM_PROMPT = """
You are a concise local desktop assistant.
Answer clearly and conversationally.
Keep most spoken replies under 120 words unless the user asks for detail.
Do not claim that you performed an action unless the controller confirms it.
"""


def ask_ollama(messages):
    response = requests.post(
        OLLAMA_URL,
        json={
            "model": OLLAMA_MODEL,
            "messages": messages,
            "stream": False,
        },
        timeout=180,
    )
    response.raise_for_status()
    return response.json()["message"]["content"].strip()


def speak_with_openvox(text):
    if not OPENVOX_MODEL:
        raise RuntimeError(
            "Set OPENVOX_MODEL to the PocketTTS model id returned by GET /v1/models."
        )

    response = requests.post(
        f"{OPENVOX_API}/audio/speech",
        json={
            "model": OPENVOX_MODEL,
            "input": text,
            "language": OPENVOX_LANGUAGE,
            "response_format": "wav",
        },
        timeout=180,
    )
    response.raise_for_status()
    OUTPUT_FILE.write_bytes(response.content)
    play_audio(OUTPUT_FILE)


def play_audio(path):
    system = platform.system()

    if system == "Darwin":
        subprocess.run(["afplay", str(path)], check=True)
        return

    if system == "Windows":
        import winsound

        winsound.PlaySound(str(path), winsound.SND_FILENAME)
        return

    subprocess.run(["aplay", str(path)], check=True)


def main():
    messages = [{"role": "system", "content": SYSTEM_PROMPT}]
    print("Local Jarvis is ready. Type 'quit' to stop.")

    while True:
        user_text = input("\nYou: ").strip()
        if not user_text:
            continue
        if user_text.lower() in {"quit", "exit"}:
            break

        messages.append({"role": "user", "content": user_text})

        try:
            reply = ask_ollama(messages)
            print(f"\nJarvis: {reply}")
            messages.append({"role": "assistant", "content": reply})
            speak_with_openvox(reply)
        except requests.RequestException as error:
            print(f"Local service error: {error}")
        except RuntimeError as error:
            print(error)


if __name__ == "__main__":
    main()

Step 4: Configure and run it

Set the model ID you copied from OpenVox. You can also change the Ollama model without editing the script.

# macOS
export OPENVOX_MODEL="YOUR_POCKETTTS_MODEL_ID"
export OPENVOX_LANGUAGE="en"
export OLLAMA_MODEL="llama3.2:3b"
python jarvis.py

# Windows PowerShell
$env:OPENVOX_MODEL="YOUR_POCKETTTS_MODEL_ID"
$env:OPENVOX_LANGUAGE="en"
$env:OLLAMA_MODEL="llama3.2:3b"
python .\jarvis.py

The example uses en for English. PocketTTS also supports Spanish, French, German, Portuguese, and Italian. Query GET /models/{model}/languages and use the exact language code returned by your OpenVox version.

Try prompts that produce short, easy-to-check answers:

“Summarize the difference between RAM and storage.”
“Give me a three-step plan for today.”
“Explain this error message in plain English.”
“Write a short spoken reminder to take a break in twenty minutes.”

How the code works

Component	Responsibility
Python loop	Collects input, maintains conversation history, and coordinates local services.
Ollama	Generates the assistant's text response locally.
OpenVox API	Accepts the response text and manages local speech generation.
PocketTTS	Produces the English voice output efficiently on local hardware.
System audio player	Plays the returned WAV file through the computer's speakers.

Add local microphone input

Once typed requests work, add speech-to-text as a separate stage. whisper.cpp is a common local option because it can run offline on macOS, Windows, and Linux. A clean push-to-talk workflow is easier and more private than an always-listening microphone:

Press a key or button to begin recording.
Record a short WAV file.
Transcribe it locally with whisper.cpp.
Send the transcript to the same ask_ollama function.
Speak the answer through OpenVox and PocketTTS.

Microphone
  -> local WAV recording
  -> whisper.cpp transcript
  -> Ollama response
  -> OpenVox PocketTTS
  -> speakers

Start with a small Whisper model on low-end systems. Larger transcription models can improve accuracy but also increase latency and memory pressure. Keep the recording trigger visible so the user always knows when the microphone is active.

Give your assistant a consistent PocketTTS voice

PocketTTS supports voice cloning from an audio prompt. In OpenVox, create or select the PocketTTS voice you want, then query the model's available voices through the Local API:

curl "http://127.0.0.1:8000/v1/models/YOUR_POCKETTTS_MODEL_ID/voices?language=en"

Replace en with the selected supported language code. If the API returns a voice ID, add it to the speech request as the voice value. Use only your own voice or a voice for which you have explicit permission. A fictional assistant persona does not make cloning a real person without consent acceptable.

Add safe local tools

A conversational assistant becomes more useful when it can perform small actions. Do not begin by passing model output directly into PowerShell, Terminal, or subprocess. Instead, expose a short allowlist of functions with fixed arguments and ask for confirmation before anything changes.

SAFE_TOOLS = {
    "get_time": get_time,
    "open_notes": open_notes,
    "read_today_tasks": read_today_tasks,
}

# Never do this:
# subprocess.run(model_generated_command, shell=True)

Start with read-only tools such as time, weather from a chosen source, or local task summaries.
Require explicit confirmation before opening apps, editing files, or sending messages.
Keep tool arguments structured and validate every path or identifier.
Log requested and completed actions separately from ordinary conversation.
Let the assistant say “I cannot do that” rather than improvising a dangerous command.

Improve response speed

Use a smaller Ollama model for routine assistant conversations.
Keep Ollama and PocketTTS loaded instead of restarting them for every request.
Tell the LLM to keep spoken answers concise.
Summarize older conversation turns instead of sending unlimited chat history.
Use OpenVox streaming when your client is ready to decode incremental audio.
Avoid running transcription, LLM inference, and several TTS jobs concurrently on a low-memory system.

Troubleshooting

Problem	Likely cause	Fix
Ollama connection refused	Ollama is not running or its local service is unavailable.	Start Ollama and test the selected model in its CLI first.
OpenVox returns 404	The model ID or endpoint is incorrect.	Call `GET /v1/models` and copy the exact PocketTTS ID.
OpenVox returns 429	Another model load or generation job is active.	Wait briefly and retry instead of starting parallel requests.
Audio file is empty or invalid	The API returned an error body instead of WAV data.	Print the response status and content type before writing the file.
The assistant feels slow	The Ollama model is too large or memory is being swapped.	Use a smaller local LLM and close other memory-heavy applications.

Privacy and naming note

This is an independent Jarvis-style assistant tutorial, not an official Marvel or Iron Man product. Give your personal assistant any name you prefer. More importantly, verify the privacy behavior of every optional service you add. The core Ollama and OpenVox path can remain local, but web search, weather APIs, cloud calendars, and messaging integrations may transmit data to their own providers.

Final architecture

The useful version of a local Jarvis is intentionally modular. Ollama handles language, OpenVox handles the localhost speech interface, and PocketTTS provides fast voice output in six supported languages without requiring a dedicated GPU. Optional speech recognition and carefully allowlisted tools can be added later without replacing the working core.

Build the smallest loop first: ask, answer, speak. Once that is dependable, add push-to-talk, memory, reminders, and safe tools one at a time. That approach produces a much more capable assistant than starting with unrestricted automation and hoping the model behaves.

Official resources

Relevant OpenVox AI workflow

Explore the local TTS API and MCP server

Download OpenVox

Give your local assistant a fast private voice.

OpenVox runs PocketTTS locally on macOS and Windows and exposes a localhost API for assistants, apps, and automations. Download the model once and keep spoken responses on your computer.

Download OpenVox on the App Store for Mac or iPad

Free download • No account required

Share this post

Know someone who would find this useful?

Related guides

Continue with the right next guide

View all posts

April 22, 2026•10 min read

How to Set Up OpenClaw on a Mac Mini with Local Voice

A practical OpenClaw setup guide with local voice output through OpenVox.

Read article

May 25, 2026•10 min read

Local TTS API for AI Agents: Add Private Voice Output to Your Automation

A developer guide to localhost voice output for agents, scripts, and automations.

Read article

April 22, 2026•9 min read

The Best Free and Local TTS Models in 2026

A practical comparison of the strongest free local TTS models in 2026.

Read article