English | 简体中文
RapidSpeech.cpp is a high-performance, edge-native speech intelligence framework built on top of ggml. It aims to provide pure C++, zero-dependency, and on-device inference for large-scale ASR (Automatic Speech Recognition) and TTS (Text-to-Speech) models.
While the open-source ecosystem already offers powerful cloud-side frameworks such as vLLM-omni, as well as mature on-device solutions like sherpa-onnx, RapidSpeech.cpp introduces a new generation of design choices focused on edge deployment.
-
vLLM
- Designed for data centers and cloud environments
- Strongly coupled with Python and CUDA
- Maximizes GPU throughput via techniques such as PageAttention
-
RapidSpeech.cpp
- Designed specifically for edge and on-device inference
- Optimized for low latency, low memory footprint, and lightweight deployment
- Runs on embedded devices, mobile platforms, laptops, and even NPU-only systems
- No Python runtime required
| Aspect | sherpa-onnx (ONNX Runtime) | RapidSpeech.cpp (ggml) |
|---|---|---|
| Memory Management | Managed internally by ORT, relatively opaque | Zero runtime allocation — memory is fully planned during graph construction to avoid edge-side OOM |
| Quantization | Primarily INT8, limited support for ultra-low bit-width | Full K-Quants family (Q4_K / Q5_K / Q6_K), significantly reducing bandwidth and memory usage while preserving accuracy |
| GPU Performance | Relies on execution providers with operator mapping overhead | Native backends (ggml-cuda, ggml-metal) with speech-specific optimizations, outperforming generic onnxruntime-gpu |
| Deployment | Requires shared libraries and external config files | Single binary deployment — model weights and configs are fully encapsulated in GGUF |
Automatic Speech Recognition (ASR)
- SenseVoice-small
- FunASR-nano
- Qwen3-ASR
- FireRedASR2
Text-to-Speech (TTS)
- OpenVoice2 (MeloTTS + voice cloning)
- OmniVoice (single-stage non-autoregressive diffusion TTS, multilingual + voice cloning)
- CosyVoice3
- Qwen3-TTS
RapidSpeech.cpp is not just an inference wrapper — it is a full-featured speech application framework:
-
Core Engine A
ggml-based computation backend supporting mixed-precision inference from INT4 to FP32. -
Architecture Layer A plugin-style model construction and loading system, with support for FunASR-nano, SenseVoice, and planned support for CosyVoice, Qwen3-TTS, and more.
-
Business Logic Layer Built-in ring buffers, VAD (voice activity detection), text frontend processing (e.g., phonemization), and multi-session management.
- Extreme Quantization: Native support for 4-bit, 5-bit, and 6-bit quantization schemes to match diverse hardware constraints.
- Zero Dependencies: Implemented entirely in C/C++, producing a single lightweight binary.
- GPU / NPU Acceleration: Customized CUDA and Metal backends optimized for speech models.
- Unified Model Format: Both ASR and TTS models use an extended GGUF format.
- Python Bindings: Python API via pybind11, installable with
pip install.
Models are available on:
- 🤗 Hugging Face: https://huggingface.co/RapidAI/RapidSpeech
- ModelScope: https://www.modelscope.cn/models/RapidAI/RapidSpeech
git clone https://github.com/RapidAI/RapidSpeech.cpp
cd RapidSpeech.cpp
git submodule sync && git submodule update --init --recursive
cmake -B build
cmake --build build --config ReleaseBuild artifacts are located in the build/ directory:
rs-asr-offline— Offline ASR command-line toolrs-asr-online— Online (streaming) ASR command-line toolrs-tts-offline— Offline TTS command-line toolrs-quantize— Model quantization tool
Basic — single file without VAD:
./build/rs-asr-offline \
-m /path/to/funasr-nano-fp16.gguf \
-w /path/to/audio.wav \
-t 4 \
--gpu trueWith VAD segmentation (recommended for long audio):
./build/rs-asr-offline \
-m /path/to/funasr-nano-fp16.gguf \
-v /path/to/silero_vad_v6.gguf \
-w /path/to/audio.wav \
-t 4 \
--vad-threshold 0.5 \
--silence-ms 600When a VAD model is provided, the tool automatically segments the audio by speech activity and produces timestamped results per segment.
Parameters:
| Flag | Description | Default |
|---|---|---|
-m, --model |
Path to GGUF model file (required) | — |
-w, --wav |
Path to WAV audio file (16 kHz, required) | — |
-v, --vad |
Path to VAD GGUF model — Silero or FireRed, auto-detected from general.architecture (optional, enables VAD segmentation) |
— |
-t, --threads |
Number of CPU threads | 4 |
--gpu |
Enable GPU acceleration (true/false) |
true |
--vad-threshold |
VAD speech probability threshold (0–1, lower = more sensitive) | 0.5 |
--silence-ms |
Silence duration to split segments (ms) | 600 |
--max-segment-s |
Max segment length for ASR input (seconds) | 30.0 |
WAV file (simulate streaming):
./build/rs-asr-online \
-m /path/to/funasr-nano-fp16.gguf \
-v /path/to/silero_vad_v6.gguf \
-w /path/to/audio.wav \
-t 4 \
--vad-threshold 0.5 \
--silence-ms 600Microphone (live mode):
./build/rs-asr-online \
-m /path/to/funasr-nano-fp16.gguf \
-v /path/to/silero_vad_v6.gguf \
--mic \
-t 4Two-pass mode (CTC fast pass + LLM rescoring, FunASR-Nano only):
./build/rs-asr-online \
-m /path/to/funasr-nano-fp16.gguf \
-v /path/to/silero_vad_v6.gguf \
-w /path/to/audio.wav \
--two-passParameters:
| Flag | Description | Default |
|---|---|---|
-m, --model |
Path to ASR GGUF model file (required) | — |
-v, --vad |
Path to Silero VAD model file (required) | — |
-w, --wav |
Path to WAV audio file (16 kHz) | — |
--mic |
Use microphone input (live mode) | off |
--mic-device |
Audio device index for mic input | auto |
--mic-chunk-ms |
Mic read chunk size (ms) | 32 |
-t, --threads |
Number of CPU threads | 4 |
--gpu |
Enable GPU acceleration (true/false) |
true |
--vad-threshold |
VAD speech detection threshold (0–1, lower = more sensitive) | 0.5 |
--silence-ms |
Silence timeout for segment splitting (ms) | 600 |
--two-pass |
Enable 2-pass mode: CTC decode + LLM rescore | off |
--ctc-precheck |
CTC pre-check before LLM to skip silence (reduces hallucination, slightly increases RTF) | off |
OpenVoice2 builds on MeloTTS as the base acoustic model (VITS-style: text encoder + duration predictor + stochastic flow decoder + HiFi-GAN vocoder). MeloTTS ships one checkpoint per language; the --lang flag must match the language of the GGUF you converted.
English (MeloTTS-English):
./build/rs-tts-offline \
-m /path/to/openvoice2-base-en.gguf \
-t "Hello, welcome to RapidSpeech!" \
--lang English \
-o output.wav \
--threads 4Chinese (MeloTTS-Chinese):
./build/rs-tts-offline \
-m /path/to/openvoice2-base-zh.gguf \
-t "你好,欢迎使用 RapidSpeech 语音合成。" \
--lang Chinese \
-o output.wavJapanese (MeloTTS-Japanese):
./build/rs-tts-offline \
-m /path/to/openvoice2-base-jp.gguf \
-t "こんにちは、RapidSpeech へようこそ。" \
--lang Japanese \
-o output.wavAccepted --lang values: English/EN/en, Chinese/ZH/zh, Japanese/JA/ja. The language string is case-insensitive but must match the model's language — feeding Chinese text to an English model will produce garbled audio.
Voice cloning (OpenVoice2 = MeloTTS base + Tone Color Converter):
OpenVoice2 separates speaker timbre from prosody. Pass a reference WAV with --ref to apply the speaker's voice to the synthesized speech. Requires the converter GGUF in the same directory as the base GGUF (the loader auto-discovers it).
./build/rs-tts-offline \
-m /path/to/openvoice2-base-en.gguf \
-t "Hello, this is cloned voice." \
--lang English \
--ref /path/to/reference.wav \
-o output.wav./build/rs-tts-offline \
-m /path/to/omnivoice-f16.gguf \
-t "Hello, welcome to RapidSpeech!" \
--instruct "male, young adult, moderate pitch" \
--lang English \
--n-steps 32 \
-o output.wavVoice cloning (OmniVoice):
./build/rs-tts-offline \
-m /path/to/omnivoice-f16.gguf \
-t "Hello, this is cloned voice." \
--ref /path/to/reference.wav \
--ref-text "transcript of the reference audio" \
-o output.wavParameters:
| Flag | Description | Default |
|---|---|---|
-m, --model |
Path to TTS GGUF model file (required) | — |
-t, --text |
Text to synthesize (required) | — |
-o, --output |
Output WAV file path | output.wav |
--lang |
Target language. MeloTTS: English/Chinese/Japanese (must match GGUF). OmniVoice: English/zh/... |
English |
--ref |
Reference audio WAV for voice cloning (OpenVoice2 / OmniVoice) | — |
--ref-text |
Transcript of the reference audio (OmniVoice only) | — |
--bert |
ZH BERT GGUF (1024-dim, OpenVoice2 Chinese only, optional) | — |
--mbert |
Multilingual BERT GGUF (768-dim, optional) | — |
--instruct |
Voice description, e.g. male, female, young adult (OmniVoice) |
male |
--seed |
Random seed (OmniVoice) | 42 |
--n-steps |
Diffusion steps 1-128, fewer = faster but lower quality (OmniVoice) | 32 |
--threads |
Number of CPU threads | 4 |
--gpu |
Enable GPU acceleration (true/false) |
true |
./build/rs-quantize /path/to/funasr-nano-fp16.gguf /path/to/output-q4_k.gguf q4_kSupported quantization types: q4_0, q4_k, q5_0, q5_k, q8_0, f16, f32
⚠️ Note: Q2_K quantization causes unacceptable accuracy loss for FunASR Nano, producing garbled output. Not recommended.
# Install from PyPI (CPU version)
pip install rapidspeech
# CUDA version
pip install rapidspeech-cuda
# macOS Metal version
pip install rapidspeech-metalpip install .
# Or specify backend
RS_BACKEND=cuda pip install .import rapidspeech
import numpy as np
# Initialize ASR context
ctx = rapidspeech.asr_offline(
model_path="funasr-nano-fp16.gguf",
n_threads=4,
use_gpu=True
)
# Read WAV audio (16 kHz, float32, mono)
pcm = ... # np.ndarray, shape=[N], dtype=float32
# Push audio and recognize
ctx.push_audio(pcm)
ctx.process()
# Get recognition result
text = ctx.get_text()
print(f"Result: {text}")See python-api-examples/asr/asr-offline.py for a complete example.
TTS Python API:
import rapidspeech
import numpy as np
# Initialize TTS synthesizer
tts = rapidspeech.tts_synthesizer(
model_path="openvoice2-base.gguf",
n_threads=4,
use_gpu=True
)
# Synthesize text to audio (returns full PCM as numpy array)
pcm = tts.synthesize("Hello, welcome to RapidSpeech!")
# Streaming synthesis (returns list of numpy array chunks)
chunks = tts.synthesize_streaming("Hello, welcome to RapidSpeech!")
for chunk in chunks:
print(f"Chunk: {len(chunk)} samples")
# Optional: set reference audio for voice cloning
# reference_pcm = ... # load reference audio
# tts.set_reference(reference_pcm, sample_rate=16000)End-to-end examples for every language binding live in their own folders, each with a dedicated README that walks through installation, CLI flags, and the underlying API surface.
| Folder | What it covers | README |
|---|---|---|
| 🐍 Python | pip install rapidspeech → offline / online ASR (with neural VAD, 2-pass LLM rescoring), offline / streaming TTS, voice cloning |
python-api-examples/README.md |
| 🌐 Browser (WebAssembly) | Three-tab demo: offline ASR, mic-driven online ASR, offline TTS. Runs locally with WebGPU + pthreads | wasm-examples/README.md |
| 🟩 Node.js | CLI built on the same WASM module: file → ASR (with optional VAD + 2-pass), text → TTS (with voice cloning) | node-api-example/README.md |
| 💻 C++ CLI | rs-asr-offline / rs-asr-online / rs-tts-offline / rs-quantize |
this README (sections above) |
| ☁️ Colab notebook | Build the CLI on a free T4, run ASR/TTS, use the Python API end-to-end | colab/README.md |
| 🤗 HuggingFace Space | Deploy the browser demo as a Docker-SDK Space (COOP/COEP-ready) | huggingface-space/HOWTO.md |
Quick taste of each:
# Python — VAD-segmented 2-pass transcription
python python-api-examples/asr/asr-offline.py \
--model funasr-nano.gguf --audio long.wav \
--vad silero-vad.gguf --two-pass
# Browser — three tabs in one page
cd wasm-examples && python3 serve.py 8000 # then open http://localhost:8000
# Node.js — same WASM module, file-based ASR/TTS
node node-api-example/index.js asr -m funasr-nano.gguf -w audio.wav --two-pass
node node-api-example/index.js tts -m omnivoice.gguf -t "Hello world" -o out.wavTest environment: Apple M1 Pro, funasr-nano-fp16.gguf, 15s audio
| Configuration | RTF | Wall Time | Notes |
|---|---|---|---|
| CPU -t 4 | 0.465 | 12.4s | CPU-only inference |
| GPU -t 4 | 0.170 | 5.2s | Metal acceleration |
| GPU -t 4 Q4_K | 0.756 | — | Quantized model: GPU dequant overhead |
| CPU -t 4 Q4_K | 0.530 | — | Quantized model CPU inference, 596 MB (3.3× compression) |
RTF (Real-Time Factor) = Processing time / Audio duration. Lower is faster. RTF < 1 means faster than real-time.
A conversion tool from HuggingFace models to GGUF format is provided:
python scripts/convert_hf_to_gguf.py \
--model /path/to/hf-model-dir \
--outfile /path/to/output.gguf \
--outtype f16To convert the Silero VAD model for use with rs-asr-online or offline VAD segmentation:
python scripts/convert_silero_to_gguf.py \
--model /path/to/silero_vad_16k.safetensors \
--output /path/to/silero_vad_v6.ggufThe converted VAD model is also available for direct download from HuggingFace and ModelScope.
Convert MeloTTS (OpenVoice2 base) and the optional Tone Color Converter to GGUF. MeloTTS releases one HuggingFace repo per language; choose the matching --base-model and --language tag.
# English
python scripts/convert_openvoice2.py \
--base-model myshell-ai/MeloTTS-English \
--output-dir ./models \
--language EN
# Chinese
python scripts/convert_openvoice2.py \
--base-model myshell-ai/MeloTTS-Chinese \
--output-dir ./models \
--language ZH
# Japanese
python scripts/convert_openvoice2.py \
--base-model myshell-ai/MeloTTS-Japanese \
--output-dir ./models \
--language JA
# With Tone Color Converter (enables voice cloning via --ref)
python scripts/convert_openvoice2.py \
--base-model myshell-ai/MeloTTS-English \
--converter-model myshell-ai/OpenVoiceV2 \
--output-dir ./models \
--language ENOutputs:
openvoice2-base-<lang>.gguf— Text encoder + duration predictor + flow decoder + HiFi-GAN vocoderopenvoice2-converter.gguf— Tone color converter (only when--converter-modelis supplied; needed for--refvoice cloning)
Merge OmniVoice PyTorch model (LLM + audio tokenizer) into a single GGUF:
python scripts/convert_omnivoice_to_gguf.py \
--model /path/to/omnivoice-model \
--tokenizer /path/to/omnivoice-audio-tokenizer \
--output /path/to/omnivoice-merged.gguf \
--outtype f16If you are interested in the following areas, we welcome your PRs or participation in discussions:
- Adapting more models to the framework.
- Refining and optimizing the project architecture.
- Improving inference performance.
