How to Run Local TTS Without a GPU on Low-End Systems
You do not need a gaming PC or dedicated GPU to generate speech locally. PocketTTS and Kokoro make useful CPU-only text-to-speech possible on modest hardware.
OpenVox Editorial Team
Practical guides for private, local AI voice workflows.
Local text-to-speech is often presented as a GPU workload. That is true for some large speech models, but it is not true for every useful TTS system. If your computer has an older CPU, integrated graphics, limited memory, or no dedicated GPU at all, you can still generate natural speech locally by choosing a compact model and keeping the workflow lean.
Two models stand out for this job: PocketTTS and Kokoro-82M. Both are small compared with heavyweight voice models, but they solve slightly different problems. PocketTTS is explicitly designed for efficient CPU execution, streaming, long inputs, six-language generation, and voice cloning. Kokoro is an 82-million-parameter open-weight model that offers multiple languages and preset voices with a permissive Apache 2.0 license.
The best model for a low-end system is not simply the smallest one. It is the model that produces acceptable speech without exhausting your memory, overheating the machine, or making you wait longer than the audio itself.
Can local TTS really run without a GPU?
Yes. Speech generation still requires computation, but compact models can run entirely on a CPU. A GPU may improve some workloads, yet it is not a requirement for these two models. PocketTTS is built around CPU efficiency and uses only two CPU cores in Kyutai's published reference measurements. Its model card also reports roughly 200 milliseconds to the first streamed audio chunk on the tested hardware.
Kokoro is similarly compact at 82 million parameters. Its official model card describes it as faster and more cost-efficient than larger models, with quality that remains competitive despite the lightweight architecture. Actual speed for either model depends on your processor, available memory, operating system, text length, and software backend, so treat published benchmarks as examples rather than guarantees for your machine.
PocketTTS vs Kokoro for low-end computers
| Requirement | PocketTTS | Kokoro-82M |
|---|---|---|
| Model size | 100M parameters | 82M parameters |
| GPU required | No; designed for CPU use | No; compact enough for CPU workflows |
| Languages | 6 languages: English, Spanish, French, German, Portuguese, and Italian | Multiple supported language pipelines |
| Voice cloning | Yes, from an audio prompt | Primarily preset voices in the official pipeline |
| Streaming | Core design goal | Generator yields speech in segments |
| Best fit | Voice cloning, agents, responsive playback in six supported languages | General narration, multiple voices, multilingual work |
What hardware should you expect to need?
There is no universal minimum because CPU generations vary enormously. A recent low-power laptop can outperform an older desktop processor while consuming far less energy. As a practical starting point, aim for a 64-bit operating system, at least 8 GB of RAM, a few gigabytes of free storage, and a modern Python installation if you plan to run the models directly.
- Use a 64-bit version of Windows, macOS, or Linux.
- Close memory-heavy browsers, games, and creative apps before generating long audio.
- Keep the computer plugged in and disable aggressive battery-saving modes.
- Generate a short paragraph first instead of beginning with an entire book.
- Store models on an SSD when possible; a slow hard drive increases loading time.
Option 1: Run PocketTTS on CPU
PocketTTS supports Python 3.10 through 3.14 and requires PyTorch 2.5 or later. Kyutai explicitly states that the GPU build of PyTorch is not required. The quickest command-line test uses the packaged CLI:
pip install pocket-tts
pocket-tts generate --text "This speech is being generated locally on my CPU."The command writes a WAV file locally. You can select one of the provided voices or pass a clean WAV recording as the voice prompt. Because PocketTTS reproduces characteristics of the reference audio, background noise, clipping, room echo, and music in the sample can also affect the generated result.
PocketTTS is especially useful when you want playback to begin before a long passage is completely synthesized. Its streaming architecture can return an initial audio chunk quickly, while its long-input support reduces the amount of manual text splitting required for articles, assistants, and continuous reading.
Important PocketTTS limitations
- It supports English, Spanish, French, German, Portuguese, and Italian.
- The Hugging Face model requires accepting its access conditions.
- Voice cloning must only use your voice or a voice you have explicit lawful permission to use.
- Adding explicit silence through the text input is not currently supported by the reference project.
- Kyutai reports no GPU speedup in its tested batch-one workflow, so CPU use is not merely a fallback.
Option 2: Run Kokoro-82M locally
Kokoro has an even smaller parameter count and a straightforward Python package. The official setup also uses espeak-ng for fallback phonemization and some non-English languages. A minimal Python example looks like this:
pip install "kokoro>=0.9.4" soundfile
from kokoro import KPipeline
import soundfile as sf
pipeline = KPipeline(lang_code="a")
generator = pipeline(
"Kokoro is generating this speech locally without a dedicated GPU.",
voice="af_heart",
)
for index, (_, _, audio) in enumerate(generator):
sf.write(f"speech-{index}.wav", audio, 24000)The language code must match the chosen voice. Kokoro's documented pipelines include American and British English, Spanish, French, Hindi, Italian, Japanese, Brazilian Portuguese, and Mandarin Chinese. Some languages require additional Misaki packages, so check the official setup instructions before assuming every pipeline is installed by default.
Kokoro is a strong choice for narration, accessibility tools, article reading, batch generation, and projects that need a permissive deployment license. Its official weights are Apache 2.0 licensed, but you should still check the terms attached to any separate voice assets you use.
How to improve performance on a slow CPU
1. Keep the model loaded
Model startup can take longer than generating a short sentence. If you are processing multiple passages, keep one process running and reuse the loaded model rather than launching Python for every line.
2. Process sensible chunks
Smaller chunks make failures easier to recover from and keep memory use predictable. Split at paragraph or sentence boundaries instead of cutting text at an arbitrary character count. PocketTTS can handle very long inputs, but chunking is still useful when you need separate files or resumable batch jobs.
3. Avoid unnecessary sample-rate conversion
Generate and save at the model's native sample rate when possible. Repeated resampling and format conversion add CPU work without improving the original synthesis.
4. Limit parallel jobs
Running several generations simultaneously can make a low-end system slower overall and may push it into swap. Start with one job. Add concurrency only after watching memory and CPU usage during a full generation.
5. Prefer WAV during generation
WAV is simple to write and avoids real-time compression overhead. Convert completed files to MP3, M4A, or another compressed format afterward if storage size matters.
6. Use clean text
Remove duplicated whitespace, navigation text, page numbers, broken PDF headers, and unusual symbols before synthesis. Cleaner input reduces pronunciation surprises and prevents wasting CPU time on content you did not want spoken.
Which model should you choose?
- Choose PocketTTS when your content is in English, Spanish, French, German, Portuguese, or Italian and you want voice cloning or responsive streaming.
- Choose Kokoro when you want a very compact general-purpose model, preset voices, multiple language pipelines, or Apache-licensed weights.
- Test both on your own hardware when possible. Parameter count alone does not predict the speed of the complete text-processing and audio-generation pipeline.
The easier route: use OpenVox
Direct Python setup is useful for developers, but it introduces package versions, model downloads, command-line tools, phonemizers, audio dependencies, and scripts that need ongoing maintenance. OpenVox packages local voice workflows into a desktop interface and supports both PocketTTS and Kokoro alongside larger models for cases where your hardware and quality requirements allow them.
With OpenVox, you can download the model once, generate speech locally, manage voices and history, and export audio without sending your text to a cloud TTS provider. PocketTTS gives low-end and CPU-only systems a practical six-language option with cloning, while Kokoro remains a fast everyday choice for broader voice and language workflows.
Official model resources
Download OpenVox
Run PocketTTS and Kokoro locally without managing Python.
OpenVox gives you a simple desktop interface for private local TTS on Mac, iPad, and Windows. Download models once, generate offline, and switch between PocketTTS, Kokoro, and other local voice models from one app.
Suggested blogs
Keep reading
How to Build a Local Jarvis with OpenVox and PocketTTS
Build a private Jarvis-style assistant with Ollama for local reasoning and PocketTTS in OpenVox for voice output.
Read articleSpeechify Alternative: Why OpenVox Is Better for Private Local TTS
A practical comparison of Speechify and OpenVox for people who want private local voice workflows on Mac.
Read articleVoice Cloning Ethics and Privacy: How to Use AI Voices Responsibly
A trust-focused guide to consent, privacy, and responsible synthetic voice use.
Read article