In partnership with

The tokenizer is not a preprocessing step. It is the architecture. Every downstream model is a language model that predicts discrete audio tokens, which the tokenizer converts back to waveforms. Get the tokenizer right and the entire family scales with it.

SnackOnAI Engineering | Senior AI Systems Researcher | Technical Deep Dive | June 8, 2026

Most TTS architectures stack a text encoder, an acoustic model (mel spectrogram predictor or duration model), and a neural vocoder. Each component has its own architecture, its own training data format, its own failure modes. When quality degrades, you debug three separate systems and their interfaces.

MOSS-TTS (arXiv:2603.18090, OpenMOSS team, MOSI.AI, released February 10 2026, Apache 2.0) collapses this stack. Audio is tokenized into discrete tokens by MOSS-Audio-Tokenizer. Text is tokenized by a standard LLM tokenizer. A language model (Qwen3-8B backbone) predicts audio tokens from text tokens. The CAT decoder converts audio tokens back to waveforms. The architecture is: text tokenizer → LLM → audio tokenizer decoder. Three components, two token types, one model class.

The implications of this design are specific: any capability that language models have (long context, instruction following, zero-shot generalization, multilingual transfer) transfers directly to speech generation. You do not need to redesign the acoustic model for 60-minute generation. You extend the LLM context window. You do not need to engineer explicit prosody controls. You provide natural language instructions about how to speak.

OpenMOSS/MOSS-TTS is the coordinating repo for five production models (MOSS-TTS, MOSS-TTSD, MOSS-VoiceGenerator, MOSS-SoundEffect, MOSS-TTS-Realtime) and two efficiency variants (MOSS-TTS-Nano at 0.1B parameters, MOSS-TTS 2.0 announced). All share the same audio tokenizer.

Scope: MOSS-Audio-Tokenizer (CAT architecture, 1.6B, 3M hours), MOSS-TTS (LLM-based speech generation), MOSS-TTSD (spoken dialogue, up to 60 minutes, 5 speakers), and MOSS-VoiceGenerator (voice design from text prompts). Not covered: MOSS-SoundEffect or MOSS-TTS-Nano in depth.

What It Actually Does

The MOSS-TTS family is five distinct generation capabilities unified by one shared audio representation layer:

Model

Task

Speakers

Context

Backbone

MOSS-TTS

Text-to-speech, voice cloning

1

Long-form

Qwen3-8B

MOSS-TTSD

Spoken dialogue synthesis

1-5

Up to 60 min

Qwen3-8B

MOSS-VoiceGenerator

Voice design from text

1

Medium

Qwen3-class

MOSS-SoundEffect

Sound effect generation

N/A

Medium

Qwen3-class

MOSS-TTS-Realtime

Low-latency streaming

1

Real-time

Smaller

All use MOSS-Audio-Tokenizer as the discrete audio interface. All are Apache 2.0.

Quick start:

# HuggingFace: download tokenizer + model
pip install huggingface_hub
hf download OpenMOSS-Team/MOSS-TTSD-v1.0 --local-dir ./MOSS-TTSD-v1.0
hf download OpenMOSS-Team/MOSS-Audio-Tokenizer --local-dir ./MOSS-Audio-Tokenizer

# Fuse TTSD + tokenizer into single servable model
python scripts/fuse_moss_tts_delay_with_codec.py \
  --model-path ./MOSS-TTSD-v1.0 \
  --codec-model-path ./MOSS-Audio-Tokenizer \
  --save-path ./fused-model

# Serve via SGLang
sglang serve --model-path ./fused-model --delay-pattern \
  --trust-remote-code --port 30000 --host 0.0.0.0

The Architecture, Unpacked

Focus on the RVQ layer selection per model. MOSS-TTS uses all 32 layers (4kbps, maximum fidelity). MOSS-TTSD uses only 16 layers (2kbps). This is the architectural lever that enables 60-minute dialogue generation: fewer tokens per second means the Qwen3-8B context window accommodates longer sequences without truncation. The CAT decoder's robustness to variable K (from quantizer dropout training) is what makes this work correctly.

The Code, Annotated

Snippet One: MOSS-Audio-Tokenizer Encoding and the Variable-Bitrate Mechanism

# MOSS-Audio-Tokenizer: encoding audio and selecting bitrate
# Source: OpenMOSS-Team/MOSS-Audio-Tokenizer (Apache 2.0)
# The variable-bitrate mechanism is the key to the whole family's flexibility

import torch
from transformers import AutoModel

# Load the 1.6B tokenizer (encoder + decoder combined)
# ← trust_remote_code required: custom CAT architecture not in HF natively
tokenizer = AutoModel.from_pretrained(
    "OpenMOSS-Team/MOSS-Audio-Tokenizer",
    trust_remote_code=True,
).cuda()

# ── ENCODING: audio → discrete RVQ tokens ─────────────────────────────────────
# Input: raw 24kHz audio waveform tensor
audio_waveform = torch.randn(1, 24000 * 5)  # 5 seconds of audio, [batch, samples]

with torch.no_grad():
    # encode() returns RVQ codes from all 32 layers
    # Shape: [batch, n_layers, n_frames]  where n_frames = len / 1920 (= audio_len * 12.5Hz)
    all_codes = tokenizer.encode(audio_waveform)  # [1, 32, 62] for 5s audio
    # ← 5 seconds × 12.5 fps = 62.5 frames → 62 frames (floor)
    # ← 32 layers of RVQ, each frame has 32 codes

# ── VARIABLE BITRATE: select how many RVQ layers to use ──────────────────────
# The key insight: quantizer dropout trains the model to work at any K ≤ 32
# ← THIS is the trick: you can select K at inference without retraining
# The decoder was trained to reconstruct from variable K via dropout (p=1.0)

# High fidelity (MOSS-TTS production): use all 32 layers
codes_full = all_codes[:, :32, :]  # [1, 32, 62] → 4kbps

# Long-context efficiency (MOSS-TTSD): use first 16 layers only
codes_half = all_codes[:, :16, :]  # [1, 16, 62] → 2kbps
# ← At 2kbps + 12.5fps: 16 codes × 12.5fps = 200 tokens/second
# For 60-min audio: 200 × 3600 = 720,000 tokens → still fits Qwen3 long context

# Ultra-low bitrate (mobile/streaming): use first 1 layer
codes_nano = all_codes[:, :1, :]   # [1, 1, 62] → 0.125kbps (heavily compressed)

# ── DECODING: RVQ tokens → waveform ───────────────────────────────────────────
# The CAT decoder is robust to variable K because of how quantizer dropout trains it:
# During training: some samples use all 32 layers, some use only 1-31 layers
# During inference: decoder sees only the specified K layers and reconstructs audio

with torch.no_grad():
    # Reconstruct from full quality codes
    waveform_full = tokenizer.decode(codes_full)    # 24kHz output
    # Reconstruct from 2kbps codes (what TTSD does)
    waveform_half = tokenizer.decode(codes_half)    # slightly lower quality
    # ← For dialogue, 2kbps is sufficient: intelligibility maintained
    #   For music or high-fidelity speech: use full 32 layers

# ── SEMANTIC RICHNESS: tokens contain language information ────────────────────
# The 0.5B auxiliary LLM trained jointly provides semantic alignment
# ← These are not acoustic codes only: they have ASR-decodable semantic content
# Evidence: MOSS-Audio-Tokenizer achieves competitive ASR WITHOUT auxiliary encoder
# You can do speech recognition directly from RVQ tokens (no need for separate encoder)

# Direct ASR from tokens (concept):
# asr_result = speech_recognizer.decode_from_tokens(codes_full)
# ← This works because CAT was trained with audio-to-text tasks jointly

The codes_half = all_codes[:, :16, :] selection is the entire MOSS-TTSD efficiency architecture. No separate model is trained for 16-layer inference; the shared CAT tokenizer with quantizer dropout handles the variable K naturally. The 720,000-token calculation shows exactly why 2kbps was chosen: it makes 60-minute dialogue fit in a single LLM context pass.

Snippet Two: MOSS-TTSD Multi-Speaker Dialogue Generation

# MOSS-TTSD: spoken dialogue generation
# Source: OpenMOSS/MOSS-TTSD (Apache 2.0)
# Architecture: Qwen3-8B + MusicGen-style Temporal+Depth Transformer

import requests
import json

# TTSD is served via SGLang after model fusion
TTSD_URL = "http://localhost:30000/v1/completions"

# ── MULTI-SPEAKER DIALOGUE GENERATION ─────────────────────────────────────────
# The input format encodes speaker turns explicitly
# Speaker identities come from zero-shot voice cloning (3-5s audio reference)

def generate_dialogue(
    script: list[dict],   # list of {"speaker": "A", "text": "Hello there"}
    speaker_audio: dict,  # {"A": base64_audio_ref, "B": base64_audio_ref}
    max_duration_s: int = 600,  # up to 3600s for TTSD v1.0
) -> bytes:
    """
    Generate a full spoken dialogue from a script and speaker references.

    ← TTSD models the entire dialogue as ONE continuous autoregressive generation
      NOT: generate each utterance separately then concatenate (that loses prosody continuity)
      BUT: model the full dialogue as a single token sequence
      This is why natural turn-taking and overlap patterns emerge:
      the model learns dialogue timing, not just individual utterance quality
    """
    prompt = build_dialogue_prompt(script, speaker_audio)

    response = requests.post(TTSD_URL, json={
        "model": "moss-ttsd-v1.0",
        "prompt": prompt,
        "max_tokens": max_duration_s * 200,  # ← 200 tokens/second at 2kbps, 12.5fps, 16 layers
        # ← THIS is the efficiency calculation:
        #   16 RVQ layers × 12.5 fps = 200 audio tokens per second
        #   For 10 min: 200 × 600 = 120,000 audio tokens
        #   Total (with text tokens): ~125,000 tokens → fits Qwen3-8B context
        "temperature": 0.8,
        "delay_pattern": True,  # ← delay pattern for multi-stream AR generation
        # The delay pattern staggers generation of different RVQ layers:
        # Layer 0 is generated, then layer 0+layer1 offset by 1 step, etc.
        # ← This prevents autoregressive generation from becoming O(K × T) steps
        #   Instead: T + K steps total (K = num RVQ layers, T = num time frames)
    })
    return response.json()["audio_bytes"]


def build_dialogue_prompt(script, speaker_audio):
    """
    Construct the interleaved text + speaker token prompt.

    Format:
    [SPEAKER_A_REF: base64_audio]
    [SPEAKER_B_REF: base64_audio]
    [A]: Good morning, Dr. Chen.
    [B]: Good morning. I've been reviewing your test results.
    [A]: And?
    [B]: The numbers look much better than last month.
    """
    # Speaker reference tokens come first (zero-shot voice cloning)
    # ← These inject the acoustic identity without fine-tuning
    # The model learns to maintain each speaker's timbre throughout the dialogue
    ref_tokens = ""
    for speaker_id, audio_ref in speaker_audio.items():
        ref_tokens += f"[SPEAKER_{speaker_id}_REF: {audio_ref}]\n"

    # Script as interleaved speaker tags and text
    dialogue_text = "\n".join([
        f"[{turn['speaker']}]: {turn['text']}"
        for turn in script
    ])

    return ref_tokens + dialogue_text


# ── EXAMPLE: 10-minute podcast-style dialogue ──────────────────────────────────
script = [
    {"speaker": "HOST", "text": "Welcome back to the show. Today we're discussing the future of AI audio."},
    {"speaker": "GUEST", "text": "Thanks for having me. It's a fascinating space right now."},
    {"speaker": "HOST", "text": "Let's start with the tokenizer revolution you mentioned last week."},
    # ... up to 60 minutes of dialogue
]

speaker_refs = {
    "HOST": load_audio_as_base64("host_reference_5sec.wav"),
    "GUEST": load_audio_as_base64("guest_reference_5sec.wav"),
}

# ← Single API call generates the entire 10-minute dialogue coherently
audio_bytes = generate_dialogue(script, speaker_refs, max_duration_s=600)

The max_tokens = max_duration_s * 200 calculation reveals TTSD's efficiency architecture. At 16 RVQ layers × 12.5 fps = 200 tokens/second, a 60-minute dialogue requires 720,000 audio tokens. TTSD v1.0 caps context at 3600 seconds (720,000 tokens) because that is the maximum Qwen3-8B can handle with extended context. The 16-layer choice was not arbitrary: it is the maximum bitrate that makes 60-minute single-pass generation tractable.

It In Action: End-to-End Worked Example

Input: Generate a 5-minute two-speaker podcast dialogue

Step 1: Tokenizer setup

MOSS-Audio-Tokenizer spec:
  Input: 24kHz waveform (any duration)
  Output: RVQ codes at 12.5 Hz
  Compression: 24,000 samples/s → 12.5 frames/s (1920x)
  Codebook: 1024 entries per layer, 32 layers
  Bitrates: 0.125kbps (1 layer) to 4kbps (32 layers)

Reference audio (5 seconds each for 2 speakers):
  Speaker A (Host): 120,000 samples → 62 frames × 16 codes = 992 RVQ tokens
  Speaker B (Guest): 120,000 samples → 62 frames × 16 codes = 992 RVQ tokens
  Combined reference tokens: ~2,000 tokens prepended to dialogue

Step 2: TTSD generation

Input script:
  [HOST]: "Welcome to AI Audio Weekly. Today we're exploring..."
  [GUEST]: "Thanks for having me. The tokenization approach is..."
  ... (5 minutes of script text ≈ 750 words ≈ 1,000 text tokens)

TTSD token budget calculation:
  Text tokens: ~1,000
  Reference tokens: ~2,000
  Audio generation: 5 min × 200 tokens/s = 60,000 audio tokens
  Total: ~63,000 tokens  ← well within Qwen3-8B extended context

Generation:
  Mode: delay_pattern (MusicGen-style multi-stream AR)
  RVQ layers: 16 (2kbps)
  Frame rate: 12.5 fps
  Steps: 3,750 temporal steps (5 min × 12.5 fps = 3,750 frames)
  Per step: 16 RVQ codes predicted (Depth Transformer)
  Total predictions: 60,000 tokens

Step 3: CAT decoder

Input: 60,000 RVQ tokens (16 layers × 3,750 frames)
Process: CAT decoder (pure Causal Transformer)
Output: 24kHz waveform, 5 minutes (7,200,000 samples)
Decoder is streaming-compatible: outputs frames causally as tokens arrive

Quality (at 2kbps TTSD output):
  WER (intelligibility): competitive with 4kbps MOSS-TTS
  MOS (naturalness): best/second-best in subjective comparisons
  Speaker consistency: zero-shot cloning maintained throughout 5-min dialogue
  Turn transitions: natural (not splice artifacts): modeled as single sequence

Step 4: Benchmark context

MOSS-TTSD v1.0 subjective evaluation results:
  Best or second-best across all subjective metrics vs open-source models
  Outperformed Doubao and Gemini 2.5-pro in subjective evaluations

MOSS-Audio-Tokenizer reconstruction (LibriSpeech test-clean, SIM/STOI/PESQ):
  SOTA among open-source audio tokenizers across speech + audio + music
  Variable bitrate: meaningful quality at 0.125kbps, near-lossless at 4kbps

Why This Design Works, and What It Trades Away

The CNN-free CAT architecture is the correct design choice for a universal audio tokenizer. CNN-based tokenizers (Encodec, DAC) embed fixed receptive fields into the architecture that bias the model toward local audio patterns. These are appropriate for speech but suboptimal for music (which has longer-range structure) and sound effects (which have arbitrary temporal patterns). A pure Transformer with hierarchical patchify operations adapts its attention to the content, not to the architecture's inductive bias. Training on 3 million hours of diverse audio (speech, sound, music) with a single architecture exploits this flexibility.

The quantizer dropout design (p=1.0 during training) is the mechanism that makes one trained model serve five downstream systems at different bitrates. Most codec architectures require separate models or post-hoc quantization to achieve multiple bitrates. MOSS-Audio-Tokenizer is trained to decode correctly at any K ≤ 32 layers from a single training run. This is not a minor implementation detail: it is what allows MOSS-TTSD to select 16 layers for long-context efficiency while MOSS-TTS selects 32 layers for maximum fidelity, with both using identical decoder weights.

The fully discrete generation approach (text tokens → LLM → audio tokens → CAT decoder) inherits LLM capabilities without redesigning the acoustic system. Zero-shot voice cloning works because the LLM learns to condition on reference audio tokens the same way it learns to condition on any in-context exemplar. Long-form stability works because the LLM's context window determines the maximum generation length. Multilingual support (20 languages) works because Qwen3-8B is multilingual. None of these required separate engineering efforts in the audio domain.

What MOSS-TTS trades away:

Computational cost. The Qwen3-8B backbone is not cheap. Inference requires a capable GPU for real-time generation. MOSS-TTS-Nano addresses this with a 0.1B model that runs on CPU, but the quality gap between Nano and the full model is real.

Inference latency. Fully autoregressive token generation produces the highest quality but has latency that scales with output duration. MOSS-TTS-Local-Transformer specifically addresses this by adding a frame-local AR module that reduces time-to-first-audio. For realtime streaming use cases, MOSS-TTS-Realtime is the appropriate model, not the standard MOSS-TTS.

The auxiliary 0.5B LLM for semantic training is an unusual design choice that adds training complexity. Training two models (CAT + auxiliary LLM) jointly requires careful gradient coordination and adds compute to an already large training run. The payoff is semantic-rich tokens, but teams training their own tokenizers should account for this overhead.

Technical Moats

The 3-million-hour CAT training run. Data scale is the moat for universal audio tokenizers. A tokenizer trained on speech only will have degraded quality on music; one trained on music will struggle with speech. The diversity and scale of the training data determine how well the tokenizer transfers to each downstream model. Replicating a 3-million-hour diverse audio training run requires infrastructure that most academic and small-team researchers do not have access to.

The joint encoder-quantizer-decoder optimization. Most existing tokenizers train the encoder separately (often from a pretrained semantic model) and the decoder separately. CAT jointly optimizes all three with a unified loss that includes reconstruction, semantic alignment (via the auxiliary LLM), and variable-bitrate robustness (via quantizer dropout). The joint training produces tokens that are simultaneously acoustically faithful and semantically rich. Achieving both properties from separate training objectives is significantly harder.

The progressive sequence dropout. This training technique, which randomly drops later RVQ layer sequences during LLM training, enables TTSD to use 16 of 32 layers at inference without quality collapse. The technique couples the LLM training to the tokenizer's variable-bitrate capability. Replicating this requires understanding both systems and implementing the coupling correctly.

Insights

Insight One: MOSS-Audio-Tokenizer achieves competitive ASR performance without an auxiliary encoder. This is not a marketing claim: it is empirical evidence that the RVQ tokens produced by CAT are semantically rich enough to perform speech recognition directly. Standard TTS pipelines use separate text encoders for meaning and acoustic encoders for speech. MOSS-Audio-Tokenizer produces tokens that encode both. If this property holds at scale, it suggests a path toward truly unified audio-text models that do not need separate acoustic and language understanding pipelines.

Insight Two: MOSS-TTSD's ability to generate 60-minute coherent multi-speaker dialogue from a single LLM forward pass is not a feature of the dialogue model. It is a feature of the tokenizer's bitrate selection. The decision to use 16 RVQ layers (2kbps, 200 tokens/second) instead of 32 layers is what makes 3600-second generation tractable. The tokenizer's variable-bitrate capability directly enables the downstream model's most distinctive capability. Teams building on this family who want longer or shorter context should understand that they are tuning the tokenizer's K parameter, not the LLM's architecture.

Takeaway

MOSS-TTS-Nano runs on CPU with 0.1B parameters. The same audio tokenizer (MOSS-Audio-Tokenizer) serves both the 0.1B Nano model and the Qwen3-8B full model. This means a model that runs on a laptop CPU and a model requiring a high-end GPU share identical discrete audio representations. The quality difference is determined by the LLM's capacity to predict those representations, not by the representations themselves. A team that starts with MOSS-TTS-Nano for prototyping and scales to the full Qwen3-8B model does not retrain the audio tokenizer or change the audio output format. The upgrade path is replacing one LLM with another, which is the same upgrade path as any other LLM application.

TL;DR For Engineers

  • MOSS-TTS Family (OpenMOSS/MOSS-TTS, Apache 2.0, released Feb 10 2026) is five speech/audio generation models unified by MOSS-Audio-Tokenizer, a 1.6B pure Transformer tokenizer trained on 3M hours. All models are LLMs (Qwen3-8B backbone) predicting discrete audio tokens. Text tokenizer → LLM → audio tokenizer decoder.

  • MOSS-Audio-Tokenizer (arXiv:2602.10934): CAT architecture, CNN-free, 32-layer RVQ, 12.5Hz frame rate, 0.125-4kbps variable bitrate via quantizer dropout (p=1.0 during training). 24kHz in, 24kHz out. 1920x temporal compression. Semantically rich tokens (competitive ASR without auxiliary encoder). SOTA reconstruction vs. open-source tokenizers.

  • MOSS-TTSD (arXiv:2603.19739): uses 16 RVQ layers (2kbps), enabling 60-minute single-session dialogue at 200 tokens/second. 1-5 speakers, zero-shot voice cloning, MusicGen-style Temporal+Depth Transformer. Outperformed Doubao and Gemini 2.5-pro in subjective evaluations. Serve with SGLang after fusing model + tokenizer.

  • MOSS-TTS-Nano: 0.1B params, CPU-only, ONNX version available, same tokenizer as full model. MOSS-TTS 2.0 announced. 20 languages supported.

  • Deployment: fuse model + tokenizer → SGLang serve. PyTorch-free option: llama.cpp + ONNX Runtime for the audio tokenizer.

The Tokenizer Is the Foundation

MOSS-TTS's design decision to unify all audio generation around one shared tokenizer is the correct architectural choice for building a family of models rather than a collection of independent systems. Every downstream capability, long-form stability, multi-speaker dialogue, voice generation, sound effects, realtime streaming, inherits the tokenizer's properties. The tokenizer's variable bitrate enables TTSD's 60-minute context. The tokenizer's semantic richness enables ASR without separate encoders. The tokenizer's causal architecture enables streaming inference.

The CAT architecture, trained end-to-end without CNN inductive biases on 3 million hours of diverse audio, is the technical moat. The five downstream models sit on top of it.

References

MOSS-TTS Family (OpenMOSS/MOSI.AI, Apache 2.0, released February 2026) is five audio generation systems unified by MOSS-Audio-Tokenizer (arXiv:2602.10934), a 1.6B-parameter CNN-free pure Transformer tokenizer trained on 3M hours with 32-layer RVQ, 12.5Hz output, and variable-bitrate via quantizer dropout. All generation models (MOSS-TTS, MOSS-TTSD, MOSS-VoiceGenerator, MOSS-SoundEffect, MOSS-TTS-Realtime) use Qwen3-8B as the LLM backbone and predict discrete audio tokens from text, with the CAT decoder converting tokens back to 24kHz waveforms. MOSS-TTSD (arXiv:2603.19739) enables 60-minute multi-speaker dialogue (1-5 speakers) by using 16 of 32 RVQ layers (2kbps, 200 tokens/second), making 3600-second generation fit within Qwen3-8B's extended context; it outperformed Doubao and Gemini 2.5-pro in subjective evaluations.

Sponsored Ad

If you enjoy practical AI insights, check out SnackOnAI and support the newsletter by subscribing, sharing, and exploring our sponsored ad — it helps us keep building and delivering value 🚀

The GTM bets that shouldn't have worked, and did

One grew revenue 50x after half his team quit over the strategy. One brought in 50K signups in a single day with no paid budget. One generated 100M+ views from a stunt that took 50 hours to conceive. One asked every prospect to demo the product themselves instead of demoing it for them.

None of them followed the safe playbook. They treated GTM like an experiment, moved before they had proof, and made bets most founders would never get approved.

HubSpot for Startups documented all 6 stories in the free Bold Bets Playbook. The risks they took, why it was risky, and what it returned.

Recommended for you