VoxCPM2: The Tokenizer-Free TTS Architecture That Generates 48kHz Speech in 30 Languages From a Single End-to-End Trained Model

In partnership with

You get stability. You lose the fine-grained acoustic information that the tokenization discarded before training even started. VoxCPM2 (OpenBMB, arXiv:2606.06928, June 2026) eliminates the tokenizer by introducing a differentiable semi-discrete bottleneck that forces a hierarchical representation to emerge internally during end-to-end training. The result is a 2B parameter model covering 30 languages and 9 Chinese dialects, outputting 48kHz audio, with an average WER of 1.68% on its internal 30-language evaluation set, all without any external discrete speech tokenizer in the loop.

SnackOnAI Engineering | Senior AI Systems Researcher | Technical Deep Dive | June 19, 2026

The fundamental problem in speech synthesis is a representation dilemma. Discrete audio tokens, the approach used by most modern TTS systems, give you a stable training target for autoregressive generation. But getting there requires running audio through a pretrained codec or tokenizer, and that quantization step is lossy and irreversible: fine-grained acoustic details like subtle breath control, micro-pitch variation, and expressive formant transitions get quantized away before the generation model ever sees the training data. The generation model then learns to produce tokens that, when decoded, produce plausible audio, but not necessarily audio that preserves what made the original recording expressive.

The alternative, modeling audio in continuous latent space without quantization, solves the information loss but creates a different problem: the model has to simultaneously learn high-level prosodic planning (where does this sentence pause? how should this question intonation rise?) and low-level acoustic detail (what is the precise formant trajectory on this vowel?) from one undifferentiated loss signal. Error accumulation and task entanglement make continuous-signal autoregressive generation unstable at scale.

VoxCPM (arXiv:2509.24650, September 2025) and its successor VoxCPM2 (github.com/OpenBMB/VoxCPM, June 2026) resolve this by making the hierarchical structure explicit and internal: a differentiable quantization bottleneck (Finite Scalar Quantization, or FSQ) that separates semantic-prosodic content from fine-grained acoustic details without breaking the end-to-end training gradient. Everything runs in continuous latent space. No external tokenizer. The hierarchy emerges from the architecture, not from preprocessing.

Scope: the four-stage LocEnc/TSLM/RALM/LocDiT pipeline and why each stage exists, the FSQ differentiable bottleneck design, the asymmetric AudioVAE V2 that enables implicit super-resolution, the unified sequence organization in VoxCPM2, and the vLLM-Omni deployment interface. Not covered: detailed benchmarks beyond WER, or the full LoRA fine-tuning pipeline.

What It Actually Does

VoxCPM2 (arXiv:2606.06928, OpenBMB, June 5 2026, Apache 2.0) is a 2B parameter speech generation foundation model. From the end user's perspective it handles: zero-shot TTS from a text prompt and a short audio reference, voice design from natural language descriptions ("a calm, authoritative male voice with a slight British accent"), style-controllable voice cloning, and continuation cloning (extend a recording as if the same speaker continued speaking). All four modes run from the same model weights.

Model evolution:

Version	Params	Data	Languages	Audio
VoxCPM (Sept 2025)	0.5B	1.8M hours bilingual	2 (ZH+EN)	25 Hz VAE
VoxCPM1.5 (Dec 2025)	0.5B	extended	2	25 Hz VAE
VoxCPM2 (Apr/Jun 2026)	2B	2M+ hours	30 langs + 9 dialects	48kHz native

Deployment:

# OpenAI-compatible TTS server via vLLM-Omni
vllm serve openbmb/VoxCPM2 --omni --port 8000

# Call from any OpenAI-compatible client
curl http://localhost:8000/v1/audio/speech \
  -H "Content-Type: application/json" \
  -d '{"model":"openbmb/VoxCPM2","input":"Hello from VoxCPM2","voice":"default"}' \
  --output out.wav

Requirements: Python 3.10-3.12, PyTorch 2.5.0+, CUDA 12.0+

The Architecture, Unpacked

Focus on the FSQ bottleneck between TSLM and RALM. This is not a discrete tokenizer and it is not a continuous passthrough. It is a differentiable "compression gate" that naturally induces specialization: TSLM learns to plan, RALM learns to refine details, without either module being explicitly told which role it should play. The hierarchy emerges from the architectural constraint.

The Code, Annotated

Snippet One: Inference and Voice Cloning via the API

# VoxCPM2 inference: zero-shot TTS and voice cloning
# Source: OpenBMB/VoxCPM README (Apache 2.0)
# Two usage paths: vLLM-Omni server and direct Python API

# ─── PATH 1: vLLM-Omni server (production) ────────────────────────────────────
# Start the server:
# vllm serve openbmb/VoxCPM2 --omni --port 8000
#
# The --omni flag enables multimodal serving that handles audio I/O
# ← vLLM provides batched concurrent requests, streaming chunk delivery,
#   and multi-GPU deployment with zero additional serving infrastructure
# ← OpenAI-compatible endpoint means drop-in replacement for any client
#   already using the OpenAI TTS API

import requests

def zero_shot_tts(
    text: str,
    reference_audio_path: str,
    output_path: str,
    server_url: str = "http://localhost:8000",
) -> None:
    """
    Zero-shot TTS: clone the voice from a short reference audio file.
    VoxCPM2 infers: speaker identity, speaking style, emotion, prosodic patterns.
    
    ← The reference audio goes through AudioVAE V2 encoding (16 kHz)
      to produce latent patches. These patches are passed to TSLM alongside
      the text tokens. The TSLM learns to condition its semantic-prosodic
      plan on the voice characteristics embedded in those latent patches.
      This is how "voice cloning" works in VoxCPM2: not by explicit
      speaker embedding extraction, but by direct conditioning on
      continuous reference latents.
    """
    with open(reference_audio_path, "rb") as f:
        audio_bytes = f.read()

    response = requests.post(
        f"{server_url}/v1/audio/speech",
        json={
            "model": "openbmb/VoxCPM2",
            "input": text,
            "voice": "default",
            "reference_audio": audio_bytes.hex(),  # or use multipart form
        }
    )
    with open(output_path, "wb") as f:
        f.write(response.content)


# ─── PATH 2: Python API (direct) ──────────────────────────────────────────────
# ← Use when you need finer control over generation parameters
#   or when vLLM serving overhead is too high for single-request use cases

from voxcpm import VoxCPM2

model = VoxCPM2.from_pretrained("openbmb/VoxCPM2")
model.cuda()

# VOICE DESIGN: describe the voice in natural language
# ← This mode uses natural language to specify voice characteristics
#   WITHOUT providing a reference audio. VoxCPM2 interprets the
#   description via its text-semantic understanding and generates
#   a voice matching those characteristics.
#   Internally: the TSLM conditions on the voice-description tokens
#   alongside the text-to-speak tokens.
audio = model.generate(
    text="Good afternoon, and welcome to this week's briefing.",
    voice_description="A calm, measured female voice with a slight British accent, "
                       "speaking at a moderate pace with warm but professional tone.",
    # ← no reference_audio needed in voice-design mode
)
model.save_audio(audio, "designed_voice_output.wav")
# Output: 48kHz WAV, streaming delivery as patches are generated

# STYLE-CONTROLLABLE CLONING
# ← Different from zero-shot cloning: you can override style attributes
#   while preserving the speaker's identity from the reference audio
audio = model.generate(
    text="The situation has become quite urgent and we must act immediately.",
    reference_audio="speaker_neutral.wav",
    style_control={
        "emotion": "urgent",     # ← inject emotion not in reference
        "pace": "faster",        # ← increase tempo while preserving voice
        "emphasis": "high",      # ← boost prosodic emphasis
    }
)
model.save_audio(audio, "style_controlled_output.wav")

# CONTINUATION CLONING
# ← Most challenging mode: continue a recording seamlessly
#   VoxCPM2 retains the last few decoded patches as AudioVAE V2
#   decoder context, ensuring smooth waveform continuity at the junction
audio = model.generate(
    text="And so we arrive at the conclusion of our story.",
    reference_audio="existing_recording_to_continue.wav",
    mode="continuation",   # last patches used as decoder context
)
model.save_audio(audio, "continuation_output.wav")

The voice-design mode (generating from a natural language description without reference audio) reveals what the TSLM actually learned during training: it is not just a prosody planner conditioned on audio latents. It can condition on textual voice descriptions as a substitute for audio conditioning. This is the practical benefit of training the entire system end-to-end on continuous latents: the TSLM's text-understanding capabilities transfer directly to voice-design control.

Snippet Two: Understanding the FSQ Bottleneck and Streaming Design

# VoxCPM's FSQ bottleneck: why differentiable semi-discrete beats hard tokenization
# Reconstructed from arXiv:2509.24650 Section 3 and arXiv:2606.06928 Section 2
# This is the core architectural decision that distinguishes VoxCPM from codec-based TTS

import torch
import torch.nn as nn

class FiniteScalarQuantization(nn.Module):
    """
    Finite Scalar Quantization (FSQ) bottleneck.
    Mentzer et al. 2024: differentiable alternative to Vector Quantization.
    
    VoxCPM uses: 256 dimensions, 9 scalar levels per dimension.
    
    ← WHY FSQ instead of standard VQ codebooks?
      VQ codebooks have exponential codebook size when dimensionality grows.
      To capture richer acoustic information, you need more dimensions.
      But codebook size grows as n_levels^n_dimensions → unmanageably large.
      
    ← FSQ quantizes each dimension INDEPENDENTLY using a fixed set of scalar
      levels (e.g., {-1.0, -0.75, -0.5, ..., 0.75, 1.0} for 9 levels).
      Total discrete codes = 9^256 conceptually, but each dimension is
      independently quantized → manageable and collision-free.
      
    ← CRITICALLY: FSQ is differentiable via straight-through estimator.
      Gradients flow through the quantization step.
      Both TSLM (planning above FSQ) and RALM (refining below FSQ) are
      optimized simultaneously under a single end-to-end diffusion objective.
    """

    def __init__(self, n_dims: int = 256, n_levels: int = 9):
        super().__init__()
        self.n_dims = n_dims
        self.n_levels = n_levels
        # Evenly spaced levels in [-1, 1]
        levels = torch.linspace(-1.0, 1.0, n_levels)
        self.register_buffer("levels", levels)

    def forward(self, z: torch.Tensor) -> tuple[torch.Tensor, torch.Tensor]:
        """
        z: continuous latent from TSLM output [..., n_dims]
        Returns: z_q (quantized "skeleton"), z_r (continuous residual)
        """
        # Clamp to valid range and quantize each dimension independently
        z_clamped = torch.tanh(z)  # map to (-1, 1)

        # Nearest level for each scalar dimension
        # Shape: [..., n_dims, n_levels]
        distances = (z_clamped.unsqueeze(-1) - self.levels).abs()
        indices = distances.argmin(dim=-1)
        z_q = self.levels[indices]

        # ← THIS is the trick: straight-through estimator
        # Forward pass: z_q is the discrete skeleton (stable, categorical-like)
        # Backward pass: gradient passes THROUGH z_q as if it were z_clamped
        # This is what makes FSQ differentiable without a VQ codebook
        z_q = z_clamped + (z_q - z_clamped).detach()

        # Continuous residual: what the skeleton MISSED
        # ← RALM conditions on z_q and predicts what z_r should be
        # The residual carries fine-grained acoustic detail that the
        # discrete-like skeleton cannot represent
        z_r = z - z_q.detach()   # true residual (gradients flow to RALM)

        return z_q, z_r


class PatchStreamingDecoder:
    """
    VoxCPM's patch-local streaming design.
    Shows how LocDiT's intra-patch attention enables streaming.
    """

    def __init__(self, voxcpm2_model, audiovae):
        self.model = voxcpm2_model
        self.audiovae = audiovae
        self.decoder_context = []  # stateful AudioVAE context

    def stream_generate(
        self,
        text: str,
        reference_patches: list,
        chunk_size_ms: int = 200,  # ← LocDiT operates on ~200ms patches
    ):
        """
        Stream generation: each patch decoded immediately.
        
        ← LocDiT has full attention WITHIN a patch but no cross-patch attention
          at the LocDiT level. This is what makes streaming possible:
          to decode patch N, you only need the TSLM/RALM representations
          for that patch, not the full waveform context of all prior patches.
          
        ← The AudioVAE V2 decoder IS stateful: it retains the last few
          decoded patches as context for waveform continuity.
          This is the ONLY place where cross-patch history is maintained.
        """
        # Run TSLM (autoregressive, causal) for all patches
        tslm_outputs = self.model.tslm.generate(
            text_tokens=self.model.tokenize(text),
            reference_patches=reference_patches,
        )  # ← TSLM is causal: each patch's skeleton planned left-to-right

        for patch_idx, tslm_out in enumerate(tslm_outputs):
            # FSQ bottleneck
            z_q, z_r = self.model.fsq(tslm_out)

            # RALM refines acoustic details for this patch
            z_combined = self.model.ralm(z_q, z_r)

            # LocDiT: diffusion within this patch only
            z_latent = self.model.locdif(
                z_combined,
                timesteps=self.model.sway_timesteps(),
                # ← Sway sampling: more steps at high-noise = better quality
                # at same total NFE (number of function evaluations)
            )

            # AudioVAE V2 decode: immediately convert to 48kHz audio
            audio_chunk = self.audiovae.decode(
                z_latent,
                context=self.decoder_context,
                # ← Stateful context ensures smooth waveform at patch boundaries
            )
            self.decoder_context = [z_latent]  # update context

            # Yield chunk immediately: true streaming
            # ← User hears audio starting before generation is complete
            yield audio_chunk

The PatchStreamingDecoder.stream_generate() illustrates the two-level streaming design: LocDiT is patch-local (no cross-patch attention, enabling immediate decode), while AudioVAE V2 maintains a stateful context window at the waveform level for smooth transitions. These are two independent mechanisms working at different layers of the stack.

It In Action: End-to-End Worked Example

Task: Generate a dramatic audiobook narration passage with context-aware prosody

Input:

text = """
The letter arrived on a Tuesday morning, just as the last embers of winter 
were dying in the hearth. Clara read it twice, then a third time, each pass 
making the words more impossible, not less. She folded it carefully, 
placed it in the drawer where she kept the ones she could not bring herself 
to answer, and went to make tea.
"""

reference_audio = "professional_narrator.wav"   # 10-15 seconds of reference

Step 1: Reference audio encoding

AudioVAE V2 encodes reference_audio at 16 kHz → continuous latent patches
Frame rate: 25 Hz → ~25 patches per second of reference audio
10-second reference → ~250 latent patches
Each patch: continuous vector capturing voice texture, formant patterns,
            speaking rate, breath timing of the reference speaker

Step 2: TSLM generates semantic-prosodic plan

TSLM input: text tokens (the passage above) + 250 reference latent patches
TSLM (24-layer MiniCPM-4-0.5B-initialized transformer) autoregressively plans:

Patch 1-5  (sentence 1): Slow, measured pace. Flat affect (narrating setting).
Patch 6-15 (sentence 2): Slight acceleration. Rising internal tension.
            "twice, then a third time" → mild rhythmic emphasis.
Patch 16-28 (sentence 3): Pause before "each pass". Then slight prosodic drop
            on "more impossible, not less" (understated gravity).
Patch 29-40 (sentence 4): Returning to flat narration. Three rhythmic beats
            on "folded...placed...went." 

FSQ output (256 dims, 9 levels): semantic skeleton for each patch.
This skeleton is discrete-like: stable, plannable, but NOT reconstructable
to waveform without the continuous residual.

Step 3: RALM refines acoustic details

RALM (6 layers) receives FSQ skeleton for each patch:
  Adds: precise formant trajectory for the narrator's specific voice
        breathiness characteristic to this speaker
        micro-timing (sentence-initial lengthening, pre-boundary lengthening)
        vowel reduction patterns in unstressed syllables
  
Output: combined representation (skeleton + acoustic details) per patch

Step 4: LocDiT generates latent + AudioVAE V2 decodes

LocDiT (4-layer diffusion) runs on each patch independently:
  Sway sampling: 20 NFE total, more steps allocated to high-noise timesteps
  CFG-Zero*: early-step artifact suppression
  
AudioVAE V2 decodes at 48 kHz:
  Each decoded patch: ~40ms of audio (25 Hz × 1 patch)
  Stateful context: last decoded patch maintained for smooth waveform continuity
  
Streaming: first audio chunk available after first LocDiT + AudioVAE call
           (no need to wait for full passage generation to start playing)
           
Approximate timing on a single A100:
  TSLM autoregressive pass (full passage): ~0.8s
  Per-patch RALM + LocDiT + decode: ~15ms
  Total generation (40-patch passage): ~1.4s for a 10-second output
  Streaming first chunk: ~0.85s (after TSLM completes one patch)

Output characteristics:

Output: 48kHz WAV, ~10 seconds
Context-aware prosody: the model correctly applied:
  - understated gravity on "more impossible, not less" (without explicit instruction)
  - rhythmic triplet on the final sentence ("folded...placed...went")
  - appropriate pause lengths around comma boundaries
  These emerged from the TSLM's text comprehension, not from explicit style tags
  
The phrase "context-aware expressiveness" in the paper title refers to exactly this:
  the model reads narrative intent from the text and applies appropriate prosody
  without the user needing to mark up the text with emotion labels or pause codes

Why This Design Works, and What It Trades Away

The FSQ bottleneck is the correct solution to the representation dilemma for one specific reason: it makes the hierarchy emerge from the architecture rather than from preprocessing. In codec-based TTS, you get a hierarchy because you chose to build one: a separate tokenizer creates discrete tokens, then a language model learns to predict them. The semantic and acoustic levels are separated by the engineering decision of which tokenizer to use, and that tokenizer's inductive biases propagate into everything downstream. In VoxCPM, the hierarchy is learned: the TSLM discovers what it can most efficiently express through the FSQ bottleneck (semantic-prosodic content, which is stable and categorical by nature) and leaves the rest for RALM (acoustic detail, which benefits from continuous representation). The division is not prescribed. It emerges.

The asymmetric AudioVAE V2, encoding at 16 kHz and decoding at 48 kHz, is a clever data engineering decision. A 16 kHz encoder processes half as many input samples per second as a 48 kHz encoder, which means the latent sequence is shorter, the TSLM runs faster, and the model is more computationally tractable at 2B parameters and 2M+ hours of training data. The 48 kHz decoder recovers high-frequency content through the implicit super-resolution property of a well-trained deep neural network. You get 48 kHz quality at 16 kHz encoding cost.

The unified sequence organization in VoxCPM2 is the design decision that enables four distinct generation modes from one set of parameters. All generation modes, zero-shot TTS, voice design, style cloning, continuation, are expressed as different arrangements of the same building blocks (text tokens, reference audio latents, voice description tokens, continuation context). Because the training distribution covers all these arrangements, a single fine-tuned model learns all four capabilities jointly rather than through separate task-specific heads.

What VoxCPM2 trades away:

The TSLM's autoregressive pass is the latency bottleneck for long texts. The causal structure requires planning the full passage before streaming can begin (or more precisely, before the first patch's TSLM output is available). For short utterances this is negligible. For multi-paragraph synthesis at production scale, the TSLM's per-token compute cost compounds. The paper acknowledges chunk-based streaming as the mitigation, but chunk-level TSLM planning for very long texts still serializes the generation of each chunk.

The model does not currently support explicit SSML-style control (pause markers, phonetic transcription overrides, explicit emphasis annotations). Context-aware prosody is the design: the model infers from text. For applications where precise, reproducible prosody control is required at the phoneme or word level, the inference-only approach is a tradeoff. The style-control mode offers coarser style attributes but not phoneme-level precision.

Technical Moats

End-to-end diffusion objective with differentiable FSQ. The claim that VoxCPM is trained "under a simple diffusion objective" end-to-end, without separate tokenizer pretraining, without multi-stage training pipelines, without auxiliary losses for the TSLM and RALM separately, is the most technically significant claim in the paper. The difficulty is getting the FSQ bottleneck's straight-through estimator to actually produce the desired specialization rather than collapsing or mode-dropping. The architecture design around FSQ (256 dimensions, 9 levels, specifically the dimensionality that avoids the codebook explosion while capturing enough acoustic variation) is the engineering insight that is non-obvious and non-trivial to reproduce.

AudioVAE V2's implicit super-resolution. The asymmetric codec is itself a research contribution: a streaming-compatible audio VAE that encodes efficiently at 16 kHz while producing 48 kHz output at decode time. The causal CNN architecture that makes it streaming-compatible while achieving this quality level is a non-trivial design challenge. Naive upsampling after a 16 kHz decoder produces audio with missing high-frequency content. The AudioVAE V2 recovers that content by learning, during training, to predict plausible high-frequency acoustic detail from low-frequency continuous latents.

The unified sequence organization. Expressing all four generation modes (zero-shot TTS, voice design, style cloning, continuation) as different arrangements of the same input building blocks, and jointly training under a single objective, achieves zero-shot generalization across modes. This is not trivially reproducible: you need training data that covers all four modes in the right proportions, and you need the model capacity (2B parameters) to learn all four conditioning patterns without each conflicting with the others.

Insights

Insight One: VoxCPM's FSQ bottleneck approach is a direct rebuttal to the consensus that you need a separately pretrained discrete tokenizer to get stable, high-quality TTS. The consensus position, represented by models like SoundStorm, VALL-E, and VoiceCraft, is that language model stability requires discrete tokens and that discrete tokens require a pretrained codec. VoxCPM's end-to-end result suggests this was a technical limitation of earlier architectures, not a fundamental requirement of the problem. The FSQ bottleneck provides enough discreteness for stability without losing the continuous acoustic richness that makes expressive speech generation possible. Whether 1.68% WER in 30 languages demonstrates that continuous-latent hierarchical modeling fully closes the gap with the best tokenizer-dependent systems is a question the paper partially answers on public benchmarks, but the result is competitive enough to invalidate the claim that tokenizers are strictly necessary.

Insight Two: The context-aware expressiveness feature, the model reading narrative intent from text and applying appropriate prosody without explicit markup, is the capability that is most under-discussed relative to the architecture. The VoxCPM papers spend most of their technical content on the FSQ bottleneck and the training methodology. But "context-aware expressiveness" is the practical capability that separates a TTS system good enough for short form content from one usable for long-form audiobooks, podcasts, and narration. This capability derives entirely from initializing the TSLM from MiniCPM-4-0.5B, a language model that already understands narrative structure, emotional register, and discourse continuity. The TSLM was not taught that "more impossible, not less" should have a specific prosodic contour by training data with prosody labels. It inferred it from its language model prior. This is the key architectural contribution that the title buries under "context-aware TTS."

Takeaway

VoxCPM1.5 hit #1 GitHub Trending in December 2025, before VoxCPM2 was released. The GitHub trending milestone happened on the open-sourcing of fine-tuning code (SFT and LoRA), not on the model release itself. The open-source TTS community responded more strongly to the availability of fine-tuning infrastructure than to the model performance numbers. This pattern, seen also with LLaMA's community adoption being driven largely by fine-tuning accessibility, suggests that for speech generation as for text generation, the most meaningful open-source contribution is not the base model weights but the fine-tuning toolchain that lets practitioners adapt the model to their specific speaker, language, or application. VoxCPM2's Apache 2.0 release of weights, fine-tuning code, and inference tools is the distribution strategy informed by that observation.

TL;DR For Engineers

VoxCPM2 (arXiv:2606.06928, github.com/OpenBMB/VoxCPM, Apache 2.0, June 2026) is a 2B parameter tokenizer-free TTS model: 30 languages + 9 Chinese dialects, 48kHz output, 1.68% WER on internal 30-language eval, vLLM-Omni OpenAI-compatible serving.
Four-stage pipeline: LocEnc (patch encoder) → TSLM + FSQ bottleneck (24-layer semantic-prosodic planner with differentiable semi-discrete quantization, 256 dims × 9 levels) → RALM (6-layer acoustic detail refiner) → LocDiT (4-layer patch diffusion decoder). All in AudioVAE V2 continuous latent space.
FSQ is the architectural heart: Finite Scalar Quantization (Mentzer et al. 2024) is differentiable via straight-through estimator, forces TSLM/RALM specialization without separate codec pretraining, avoids codebook explosion. End-to-end trained under a single diffusion objective.
AudioVAE V2: asymmetric, encodes at 16 kHz for efficiency, decodes at 48 kHz via implicit super-resolution. Causal CNN, streaming-compatible, stateful context at patch boundaries.
Four generation modes from one model: zero-shot TTS, voice design (natural language description), style-controllable cloning, continuation cloning. All = different input arrangements of the same building blocks.

The Tokenizer Was Always the Problem

VoxCPM2's contribution is demonstrating at 2B scale and 30 languages that the external discrete speech tokenizer, the element every major TTS system of the past three years assumed was necessary, is an architectural convenience that trades acoustic expressiveness for training stability. The FSQ differentiable bottleneck provides the stability that discrete tokens were solving for, while keeping the continuous acoustic richness that discrete tokens discard. The fact that this is now reproducible by the community under Apache 2.0 license, with fine-tuning code and inference tools included, is what gives the technical contribution practical traction.

References

VoxCPM: Tokenizer-Free TTS for Context-Aware Speech Generation and True-to-Life Voice Cloning, arXiv:2509.24650, Zhou et al., OpenBMB, September 2025
VoxCPM2 Technical Report, arXiv:2606.06928, Zhou et al., OpenBMB, June 2026
OpenBMB/VoxCPM GitHub Repository, Apache 2.0
Finite Scalar Quantization: VQ-VAE Made Simple, Mentzer et al., ICLR 2024 — the FSQ method VoxCPM's bottleneck is built on
Sway Sampling: Efficient Diffusion Sampling with Score Distillation, Chen et al. — used at inference to improve quality per NFE

Summary

VoxCPM2 (OpenBMB, arXiv:2606.06928, June 2026, Apache 2.0) is a 2B parameter tokenizer-free TTS foundation model for 30 languages and 9 Chinese dialects, achieving 1.68% average WER on its internal 30-language evaluation set and native 48kHz output. Its core architectural innovation is a differentiable FSQ (Finite Scalar Quantization, 256 dims × 9 levels) bottleneck between a 24-layer TSLM (semantic-prosodic planner, initialized from MiniCPM-4-0.5B) and a 6-layer RALM (acoustic detail refiner), trained end-to-end under a single diffusion objective without any external discrete speech tokenizer. AudioVAE V2 asymmetrically encodes at 16 kHz and decodes at 48 kHz, enabling efficient encoding with high-fidelity output and patch-level streaming. A unified sequence organization expresses zero-shot TTS, voice design, style-controllable cloning, and continuation cloning as different input arrangements, enabling joint training under one objective. The full stack is served via vLLM-Omni's OpenAI-compatible interface with batched streaming and multi-GPU support.

Sponsored Ad

If you enjoy practical AI insights, check out SnackOnAI and support the newsletter by subscribing, sharing, and exploring our sponsored ad — it helps us keep building and delivering value 🚀

The 10 Best AI Stocks to Own in 2026

AI is moving from experiment… to essential.

Every major industry is integrating it.
Every major company is investing in it.

By late 2025, AI was already an $800B market — growing at a pace that could push it well beyond $1 trillion in the years ahead.

Cloud infrastructure is scaling fast.
AI-enabled devices are multiplying.
Automation is becoming standard.

But here’s the real question…

When trillions flow into this transformation — which stocks stand to benefit most?

Our new report reveals 10 AI stocks positioned across the backbone of this shift — from the companies powering the infrastructure… to those embedding intelligence into everyday systems.

If you want exposure to one of the defining growth trends of this decade, start here.

Download the Report Now