RapidSpeech.cpp 🎙️

RapidSpeech.cpp is a high-performance, edge-native speech intelligence framework built on top of ggml. It aims to provide pure C++, zero-dependency, and on-device inference for large-scale ASR (Automatic Speech Recognition) and TTS (Text-to-Speech) models.

🌟 Key Differentiators

While the open-source ecosystem already offers powerful cloud-side frameworks such as vLLM-omni, as well as mature on-device solutions like sherpa-onnx, RapidSpeech.cpp introduces a new generation of design choices focused on edge deployment.

1. vs. vLLM: Edge-first, not cloud-throughput-first

vLLM
- Designed for data centers and cloud environments
- Strongly coupled with Python and CUDA
- Maximizes GPU throughput via techniques such as PageAttention
RapidSpeech.cpp
- Designed specifically for edge and on-device inference
- Optimized for low latency, low memory footprint, and lightweight deployment
- Runs on embedded devices, mobile platforms, laptops, and even NPU-only systems
- No Python runtime required

2. vs. sherpa-onnx: Deeper control over the inference stack

Aspect	sherpa-onnx (ONNX Runtime)	RapidSpeech.cpp (ggml)
Memory Management	Managed internally by ORT, relatively opaque	Zero runtime allocation — memory is fully planned during graph construction to avoid edge-side OOM
Quantization	Primarily INT8, limited support for ultra-low bit-width	Full K-Quants family (Q4_K / Q5_K / Q6_K), significantly reducing bandwidth and memory usage while preserving accuracy
GPU Performance	Relies on execution providers with operator mapping overhead	Native backends (`ggml-cuda`, `ggml-metal`) with speech-specific optimizations, outperforming generic `onnxruntime-gpu`
Deployment	Requires shared libraries and external config files	Single binary deployment — model weights and configs are fully encapsulated in GGUF

📦 Model Support

Automatic Speech Recognition (ASR)

SenseVoice-small
FunASR-nano
Qwen3-ASR
FireRedASR2

Text-to-Speech (TTS)

OpenVoice2 (MeloTTS + voice cloning)
OmniVoice (single-stage non-autoregressive diffusion TTS, multilingual + voice cloning)
CosyVoice3
Qwen3-TTS

🏗️ Architecture Overview

RapidSpeech.cpp is not just an inference wrapper — it is a full-featured speech application framework:

Core Engine A ggml-based computation backend supporting mixed-precision inference from INT4 to FP32.
Architecture Layer A plugin-style model construction and loading system, with support for FunASR-nano, SenseVoice, and planned support for CosyVoice, Qwen3-TTS, and more.
Business Logic Layer Built-in ring buffers, VAD (voice activity detection), text frontend processing (e.g., phonemization), and multi-session management.

🚀 Core Features

Extreme Quantization: Native support for 4-bit, 5-bit, and 6-bit quantization schemes to match diverse hardware constraints.
Zero Dependencies: Implemented entirely in C/C++, producing a single lightweight binary.
GPU / NPU Acceleration: Customized CUDA and Metal backends optimized for speech models.
Unified Model Format: Both ASR and TTS models use an extended GGUF format.
Python Bindings: Python API via pybind11, installable with pip install.

🛠️ Quick Start

Download Models

Models are available on:

🤗 Hugging Face: https://huggingface.co/RapidAI/RapidSpeech
ModelScope: https://www.modelscope.cn/models/RapidAI/RapidSpeech

Build from Source

git clone https://github.com/RapidAI/RapidSpeech.cpp
cd RapidSpeech.cpp
git submodule sync && git submodule update --init --recursive
cmake -B build
cmake --build build --config Release

Build artifacts are located in the build/ directory:

rs-asr-offline — Offline ASR command-line tool
rs-asr-online — Online (streaming) ASR command-line tool
rs-tts-offline — Offline TTS command-line tool
rs-quantize — Model quantization tool

C++ CLI Usage

Offline Recognition (rs-asr-offline)

Basic — single file without VAD:

./build/rs-asr-offline \
  -m /path/to/funasr-nano-fp16.gguf \
  -w /path/to/audio.wav \
  -t 4 \
  --gpu true

With VAD segmentation (recommended for long audio):

./build/rs-asr-offline \
  -m /path/to/funasr-nano-fp16.gguf \
  -v /path/to/silero_vad_v6.gguf \
  -w /path/to/audio.wav \
  -t 4 \
  --vad-threshold 0.5 \
  --silence-ms 600

When a VAD model is provided, the tool automatically segments the audio by speech activity and produces timestamped results per segment.

Parameters:

Flag	Description	Default
`-m, --model`	Path to GGUF model file (required)	—
`-w, --wav`	Path to WAV audio file (16 kHz, required)	—
`-v, --vad`	Path to VAD GGUF model — Silero or FireRed, auto-detected from `general.architecture` (optional, enables VAD segmentation)	—
`-t, --threads`	Number of CPU threads	4
`--gpu`	Enable GPU acceleration (`true`/`false`)	true
`--vad-threshold`	VAD speech probability threshold (0–1, lower = more sensitive)	0.5
`--silence-ms`	Silence duration to split segments (ms)	600
`--max-segment-s`	Max segment length for ASR input (seconds)	30.0

Online / Streaming Recognition (rs-asr-online)

WAV file (simulate streaming):

./build/rs-asr-online \
  -m /path/to/funasr-nano-fp16.gguf \
  -v /path/to/silero_vad_v6.gguf \
  -w /path/to/audio.wav \
  -t 4 \
  --vad-threshold 0.5 \
  --silence-ms 600

Microphone (live mode):

./build/rs-asr-online \
  -m /path/to/funasr-nano-fp16.gguf \
  -v /path/to/silero_vad_v6.gguf \
  --mic \
  -t 4

Two-pass mode (CTC fast pass + LLM rescoring, FunASR-Nano only):

./build/rs-asr-online \
  -m /path/to/funasr-nano-fp16.gguf \
  -v /path/to/silero_vad_v6.gguf \
  -w /path/to/audio.wav \
  --two-pass

Parameters:

Flag	Description	Default
`-m, --model`	Path to ASR GGUF model file (required)	—
`-v, --vad`	Path to Silero VAD model file (required)	—
`-w, --wav`	Path to WAV audio file (16 kHz)	—
`--mic`	Use microphone input (live mode)	off
`--mic-device`	Audio device index for mic input	auto
`--mic-chunk-ms`	Mic read chunk size (ms)	32
`-t, --threads`	Number of CPU threads	4
`--gpu`	Enable GPU acceleration (`true`/`false`)	true
`--vad-threshold`	VAD speech detection threshold (0–1, lower = more sensitive)	0.5
`--silence-ms`	Silence timeout for segment splitting (ms)	600
`--two-pass`	Enable 2-pass mode: CTC decode + LLM rescore	off
`--ctc-precheck`	CTC pre-check before LLM to skip silence (reduces hallucination, slightly increases RTF)	off

Text-to-Speech (rs-tts-offline)

MeloTTS / OpenVoice2

OpenVoice2 builds on MeloTTS as the base acoustic model (VITS-style: text encoder + duration predictor + stochastic flow decoder + HiFi-GAN vocoder). MeloTTS ships one checkpoint per language; the --lang flag must match the language of the GGUF you converted.

English (MeloTTS-English):

./build/rs-tts-offline \
  -m /path/to/openvoice2-base-en.gguf \
  -t "Hello, welcome to RapidSpeech!" \
  --lang English \
  -o output.wav \
  --threads 4

Chinese (MeloTTS-Chinese):

./build/rs-tts-offline \
  -m /path/to/openvoice2-base-zh.gguf \
  -t "你好，欢迎使用 RapidSpeech 语音合成。" \
  --lang Chinese \
  -o output.wav

Japanese (MeloTTS-Japanese):

./build/rs-tts-offline \
  -m /path/to/openvoice2-base-jp.gguf \
  -t "こんにちは、RapidSpeech へようこそ。" \
  --lang Japanese \
  -o output.wav

Accepted --lang values: English/EN/en, Chinese/ZH/zh, Japanese/JA/ja. The language string is case-insensitive but must match the model's language — feeding Chinese text to an English model will produce garbled audio.

Voice cloning (OpenVoice2 = MeloTTS base + Tone Color Converter):

OpenVoice2 separates speaker timbre from prosody. Pass a reference WAV with --ref to apply the speaker's voice to the synthesized speech. Requires the converter GGUF in the same directory as the base GGUF (the loader auto-discovers it).

./build/rs-tts-offline \
  -m /path/to/openvoice2-base-en.gguf \
  -t "Hello, this is cloned voice." \
  --lang English \
  --ref /path/to/reference.wav \
  -o output.wav

OmniVoice (diffusion TTS, multilingual + voice cloning)

./build/rs-tts-offline \
  -m /path/to/omnivoice-f16.gguf \
  -t "Hello, welcome to RapidSpeech!" \
  --instruct "male, young adult, moderate pitch" \
  --lang English \
  --n-steps 32 \
  -o output.wav

Voice cloning (OmniVoice):

./build/rs-tts-offline \
  -m /path/to/omnivoice-f16.gguf \
  -t "Hello, this is cloned voice." \
  --ref /path/to/reference.wav \
  --ref-text "transcript of the reference audio" \
  -o output.wav

Parameters:

Flag	Description	Default
`-m, --model`	Path to TTS GGUF model file (required)	—
`-t, --text`	Text to synthesize (required)	—
`-o, --output`	Output WAV file path	output.wav
`--lang`	Target language. MeloTTS: `English`/`Chinese`/`Japanese` (must match GGUF). OmniVoice: `English`/`zh`/...	English
`--ref`	Reference audio WAV for voice cloning (OpenVoice2 / OmniVoice)	—
`--ref-text`	Transcript of the reference audio (OmniVoice only)	—
`--bert`	ZH BERT GGUF (1024-dim, OpenVoice2 Chinese only, optional)	—
`--mbert`	Multilingual BERT GGUF (768-dim, optional)	—
`--instruct`	Voice description, e.g. `male`, `female`, `young adult` (OmniVoice)	male
`--seed`	Random seed (OmniVoice)	42
`--n-steps`	Diffusion steps 1-128, fewer = faster but lower quality (OmniVoice)	32
`--threads`	Number of CPU threads	4
`--gpu`	Enable GPU acceleration (`true`/`false`)	true

Model Quantization (rs-quantize)

./build/rs-quantize /path/to/funasr-nano-fp16.gguf /path/to/output-q4_k.gguf q4_k

Supported quantization types: q4_0, q4_k, q5_0, q5_k, q8_0, f16, f32

⚠️ Note: Q2_K quantization causes unacceptable accuracy loss for FunASR Nano, producing garbled output. Not recommended.

Python Usage

Installation

# Install from PyPI (CPU version)
pip install rapidspeech

# CUDA version
pip install rapidspeech-cuda

# macOS Metal version
pip install rapidspeech-metal

Build Python Package from Source

pip install .
# Or specify backend
RS_BACKEND=cuda pip install .

Python API

import rapidspeech
import numpy as np

# Initialize ASR context
ctx = rapidspeech.asr_offline(
    model_path="funasr-nano-fp16.gguf",
    n_threads=4,
    use_gpu=True
)

# Read WAV audio (16 kHz, float32, mono)
pcm = ...  # np.ndarray, shape=[N], dtype=float32

# Push audio and recognize
ctx.push_audio(pcm)
ctx.process()

# Get recognition result
text = ctx.get_text()
print(f"Result: {text}")

See python-api-examples/asr/asr-offline.py for a complete example.

TTS Python API:

import rapidspeech
import numpy as np

# Initialize TTS synthesizer
tts = rapidspeech.tts_synthesizer(
    model_path="openvoice2-base.gguf",
    n_threads=4,
    use_gpu=True
)

# Synthesize text to audio (returns full PCM as numpy array)
pcm = tts.synthesize("Hello, welcome to RapidSpeech!")

# Streaming synthesis (returns list of numpy array chunks)
chunks = tts.synthesize_streaming("Hello, welcome to RapidSpeech!")
for chunk in chunks:
    print(f"Chunk: {len(chunk)} samples")

# Optional: set reference audio for voice cloning
# reference_pcm = ...  # load reference audio
# tts.set_reference(reference_pcm, sample_rate=16000)

🧪 Examples & Bindings

End-to-end examples for every language binding live in their own folders, each with a dedicated README that walks through installation, CLI flags, and the underlying API surface.

Folder	What it covers	README
🐍 Python	`pip install rapidspeech` → offline / online ASR (with neural VAD, 2-pass LLM rescoring), offline / streaming TTS, voice cloning	`python-api-examples/README.md`
🌐 Browser (WebAssembly)	Three-tab demo: offline ASR, mic-driven online ASR, offline TTS. Runs locally with WebGPU + pthreads	`wasm-examples/README.md`
🟩 Node.js	CLI built on the same WASM module: file → ASR (with optional VAD + 2-pass), text → TTS (with voice cloning)	`node-api-example/README.md`
💻 C++ CLI	`rs-asr-offline` / `rs-asr-online` / `rs-tts-offline` / `rs-quantize`	this README (sections above)
☁️ Colab notebook	Build the CLI on a free T4, run ASR/TTS, use the Python API end-to-end	`colab/README.md`
🤗 HuggingFace Space	Deploy the browser demo as a Docker-SDK Space (COOP/COEP-ready)	`huggingface-space/HOWTO.md`

Quick taste of each:

# Python — VAD-segmented 2-pass transcription
python python-api-examples/asr/asr-offline.py \
    --model funasr-nano.gguf --audio long.wav \
    --vad silero-vad.gguf --two-pass

# Browser — three tabs in one page
cd wasm-examples && python3 serve.py 8000  # then open http://localhost:8000

# Node.js — same WASM module, file-based ASR/TTS
node node-api-example/index.js asr -m funasr-nano.gguf -w audio.wav --two-pass
node node-api-example/index.js tts -m omnivoice.gguf -t "Hello world" -o out.wav

📊 Performance Benchmarks

Test environment: Apple M1 Pro, funasr-nano-fp16.gguf, 15s audio

Configuration	RTF	Wall Time	Notes
CPU -t 4	0.465	12.4s	CPU-only inference
GPU -t 4	0.170	5.2s	Metal acceleration
GPU -t 4 Q4_K	0.756	—	Quantized model: GPU dequant overhead
CPU -t 4 Q4_K	0.530	—	Quantized model CPU inference, 596 MB (3.3× compression)

RTF (Real-Time Factor) = Processing time / Audio duration. Lower is faster. RTF < 1 means faster than real-time.

🔧 Model Format Conversion

ASR Model (HF → GGUF)

A conversion tool from HuggingFace models to GGUF format is provided:

python scripts/convert_hf_to_gguf.py \
  --model /path/to/hf-model-dir \
  --outfile /path/to/output.gguf \
  --outtype f16

Silero VAD Model (safetensors → GGUF)

To convert the Silero VAD model for use with rs-asr-online or offline VAD segmentation:

python scripts/convert_silero_to_gguf.py \
  --model /path/to/silero_vad_16k.safetensors \
  --output /path/to/silero_vad_v6.gguf

The converted VAD model is also available for direct download from HuggingFace and ModelScope.

TTS Model (OpenVoice2 / MeloTTS → GGUF)

Convert MeloTTS (OpenVoice2 base) and the optional Tone Color Converter to GGUF. MeloTTS releases one HuggingFace repo per language; choose the matching --base-model and --language tag.

# English
python scripts/convert_openvoice2.py \
  --base-model myshell-ai/MeloTTS-English \
  --output-dir ./models \
  --language EN

# Chinese
python scripts/convert_openvoice2.py \
  --base-model myshell-ai/MeloTTS-Chinese \
  --output-dir ./models \
  --language ZH

# Japanese
python scripts/convert_openvoice2.py \
  --base-model myshell-ai/MeloTTS-Japanese \
  --output-dir ./models \
  --language JA

# With Tone Color Converter (enables voice cloning via --ref)
python scripts/convert_openvoice2.py \
  --base-model myshell-ai/MeloTTS-English \
  --converter-model myshell-ai/OpenVoiceV2 \
  --output-dir ./models \
  --language EN

Outputs:

openvoice2-base-<lang>.gguf — Text encoder + duration predictor + flow decoder + HiFi-GAN vocoder
openvoice2-converter.gguf — Tone color converter (only when --converter-model is supplied; needed for --ref voice cloning)

TTS Model (OmniVoice → GGUF)

Merge OmniVoice PyTorch model (LLM + audio tokenizer) into a single GGUF:

python scripts/convert_omnivoice_to_gguf.py \
  --model /path/to/omnivoice-model \
  --tokenizer /path/to/omnivoice-audio-tokenizer \
  --output /path/to/omnivoice-merged.gguf \
  --outtype f16

🤝 Contributing

If you are interested in the following areas, we welcome your PRs or participation in discussions:

Adapting more models to the framework.
Refining and optimizing the project architecture.
Improving inference performance.

Name		Name	Last commit message	Last commit date
Latest commit History 135 Commits
.github/workflows		.github/workflows
assets		assets
cmake		cmake
examples		examples
ggml @ 57ea0bc		ggml @ 57ea0bc
include		include
node-api-example		node-api-example
python-api-examples		python-api-examples
rapidspeech		rapidspeech
scripts		scripts
test		test
wasm-examples		wasm-examples
.gitignore		.gitignore
.gitmodules		.gitmodules
CMakeLists.txt		CMakeLists.txt
README-CN.md		README-CN.md
README.md		README.md
pyproject.toml		pyproject.toml
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

RapidSpeech.cpp 🎙️

🌟 Key Differentiators

1. vs. vLLM: Edge-first, not cloud-throughput-first

2. vs. sherpa-onnx: Deeper control over the inference stack

📦 Model Support

🏗️ Architecture Overview

🚀 Core Features

🛠️ Quick Start

Download Models

Build from Source

C++ CLI Usage

Offline Recognition (rs-asr-offline)

Online / Streaming Recognition (rs-asr-online)

Text-to-Speech (rs-tts-offline)

MeloTTS / OpenVoice2

OmniVoice (diffusion TTS, multilingual + voice cloning)

Model Quantization (rs-quantize)

Python Usage

Installation

Build Python Package from Source

Python API

🧪 Examples & Bindings

📊 Performance Benchmarks

🔧 Model Format Conversion

ASR Model (HF → GGUF)

Silero VAD Model (safetensors → GGUF)

TTS Model (OpenVoice2 / MeloTTS → GGUF)

TTS Model (OmniVoice → GGUF)

🤝 Contributing

Acknowledgements

About

Uh oh!

Releases 2

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

RapidSpeech.cpp 🎙️

🌟 Key Differentiators

1. vs. vLLM: Edge-first, not cloud-throughput-first

2. vs. sherpa-onnx: Deeper control over the inference stack

📦 Model Support

🏗️ Architecture Overview

🚀 Core Features

🛠️ Quick Start

Download Models

Build from Source

C++ CLI Usage

Offline Recognition (rs-asr-offline)

Online / Streaming Recognition (rs-asr-online)

Text-to-Speech (rs-tts-offline)

MeloTTS / OpenVoice2

OmniVoice (diffusion TTS, multilingual + voice cloning)

Model Quantization (rs-quantize)

Python Usage

Installation

Build Python Package from Source

Python API

🧪 Examples & Bindings

📊 Performance Benchmarks

🔧 Model Format Conversion

ASR Model (HF → GGUF)

Silero VAD Model (safetensors → GGUF)

TTS Model (OpenVoice2 / MeloTTS → GGUF)

TTS Model (OmniVoice → GGUF)

🤝 Contributing

Acknowledgements

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases 2

Contributors

Uh oh!

Languages