LuxTTS (ysharma3501/LuxTTS, Apache 2.0, 3.7k stars) breaks this with a four-step distilled voice cloning model that outputs 48kHz audio, fits in 1GB VRAM, runs 150x faster than real time on GPU and faster than real time on CPU, and achieves zero-shot voice quality competitive with models 10x its size. The architectural foundation is ZipVoice (arXiv:2506.13053), a flow-matching TTS system from Xiaomi's ASR team that borrowed its backbone, Zipformer, directly from speech recognition. A sequence modeling architecture designed for understanding speech is now generating it at industrial throughput.
SnackOnAI Engineering | Senior AI Systems Researcher | Technical Deep Dive | July 01, 2026
The standard recipe for zero-shot TTS in 2025 is: large DiT (Diffusion Transformer) backbone, 30+ sampling steps, classifier-free guidance (CFG) with two forward passes per step, 24kHz vocoder output, 3-8GB VRAM. You get good quality and slow inference. Or you get a small model with fast inference and noticeably worse speaker similarity. This is the tradeoff the field has accepted.
LuxTTS and its foundation model ZipVoice (arXiv:2506.13053, Han Zhu, Wei Kang, Zengwei Yao, Daniel Povey et al., Xiaomi Corp., June 2025) do not improve the tradeoff by training a larger model or with more data. They change the architecture. Zipformer, the backbone that powers ZipVoice, was designed by Daniel Povey for automatic speech recognition (ASR). It has never been widely used for speech synthesis. ZipVoice demonstrates it is, in fact, better at TTS than the DiT architectures everyone else is using, at 3x fewer parameters and up to 30x faster inference.
LuxTTS takes ZipVoice and makes three targeted modifications: distill to exactly four inference steps, replace the 24kHz vocoder with a custom 48kHz Vocos vocoder, and apply a custom improved sampling technique. The result is a model that fits in 1GB of VRAM and generates a second of speech in roughly 6.7ms of GPU compute.
Scope: the ZipVoice Zipformer + conditional flow matching architecture, LuxTTS's three modifications over base ZipVoice, the flow distillation that eliminates classifier-free guidance overhead, and the 48kHz Vocos vocoder. Not covered: multilingual support beyond English and Chinese (Japanese/Korean/French quality in LuxTTS is community-reported as inconsistent), or streaming inference implementation.
What It Actually Does
LuxTTS is a zero-shot voice cloning TTS model. You provide a reference audio file (any speaker, any language, wav or mp3), a text string, and get back 48kHz audio in the voice of the reference speaker, without any fine-tuning or speaker-specific training.
The key specifications:
Property | Value |
|---|---|
Output sample rate | 48kHz (vs 24kHz industry standard) |
GPU speed | 150x real time |
CPU speed | Faster than real time |
VRAM requirement | < 1GB |
Inference steps | 4 (distilled) |
License | Apache 2.0 |
Platform support | CUDA, CPU, Apple MPS |
HuggingFace model | YatharthS/LuxTTS |
Quick setup:
git clone https://github.com/ysharma3501/LuxTTS.git
cd LuxTTS
pip install -r requirements.txt
The Architecture, Unpacked

Focus on the flow distillation step. Base ZipVoice uses classifier-free guidance (CFG), which requires running the Zipformer decoder twice per sampling step: once with conditioning (text + speaker) and once without (unconditional), then interpolating the outputs. LuxTTS eliminates this entirely through distillation, removing the single largest per-step compute cost before even reducing step count.
The Code, Annotated
Snippet One: Voice Cloning Pipeline with Design Intent
# LuxTTS: zero-shot voice cloning in ~15 lines
# Source: ysharma3501/LuxTTS README (Apache 2.0)
# Design intent: three-function API hides all CFM, vocoder, and alignment complexity
import soundfile as sf
from zipvoice.luxvoice import LuxTTS
# ─── MODEL LOADING ───────────────────────────────────────────────────────────
# ← model auto-downloads from HuggingFace: YatharthS/LuxTTS
# ← Fits in <1GB VRAM, compare to F5-TTS (~3GB), E2-TTS (~4GB+)
# ← MPS support: Apple Silicon M-series chips work natively
lux_tts = LuxTTS('YatharthS/LuxTTS', device='cuda')
# CPU alternative: lux_tts = LuxTTS('YatharthS/LuxTTS', device='cpu', threads=2)
# MPS (Apple): lux_tts = LuxTTS('YatharthS/LuxTTS', device='mps')
# ─── STEP 1: ENCODE THE REFERENCE SPEAKER PROMPT ────────────────────────────
# ← encode_prompt() extracts the speaker's vocal characteristics from any audio
# rms=0.01: normalizes the reference audio volume before encoding
# ← This is the "zero-shot" part: no fine-tuning, no speaker enrollment
# ← Takes ~10s on first call (librosa initialization), then much faster
# ← The encoded_prompt is a compact speaker representation reusable for any text
encoded_prompt = lux_tts.encode_prompt(
'reference_speaker.wav', # 3-10 seconds of clean reference audio works best
rms=0.01, # ← RMS normalization: prevents loudness mismatch
)
# ─── STEP 2: GENERATE SPEECH ─────────────────────────────────────────────────
text = "The quick brown fox jumps over the lazy dog."
# ← generate_speech() runs the full CFM pipeline:
# Gaussian noise → 4 Euler ODE steps → mel features → 48kHz Vocos vocoder
# num_steps=4: the distilled LuxTTS uses exactly 4 steps (not configurable to fewer
# without quality degradation; more steps add latency with minimal gain)
final_wav = lux_tts.generate_speech(
text,
encoded_prompt,
num_steps=4, # ← 4 distilled steps; base ZipVoice needs 30+
)
# ─── STEP 3: SAVE AT 48kHz ───────────────────────────────────────────────────
# ← THIS is the trick: rate=48000, not 24000 (the industry default)
# Most TTS tools hardcode 22050 or 24000Hz. LuxTTS outputs 48kHz via custom Vocos.
# 48kHz doubles the frequency headroom: audible difference on s/f/sh consonants
# and any music-adjacent voice content
final_wav = final_wav.numpy().squeeze()
sf.write('output.wav', final_wav, 48000) # ← 48000, not 24000
The encode_prompt call is the entire voice cloning mechanism. The speaker characteristics are extracted once from the reference audio and cached as encoded_prompt. You can generate unlimited text from a single encode_prompt() call. This is the correct architectural split for batch voice generation: speaker identity is expensive to extract once, but reusing it across many text inputs is free.
Snippet Two: Advanced Sampling Parameters and the CFG Distillation Tradeoff
# LuxTTS: advanced inference parameters showing the CFM design decisions
# Source: ysharma3501/LuxTTS README (Apache 2.0)
# Design intent: these params expose the distillation's effect on CFG removal
import soundfile as sf
from zipvoice.luxvoice import LuxTTS
lux_tts = LuxTTS('YatharthS/LuxTTS', device='cuda')
encoded_prompt = lux_tts.encode_prompt('reference_speaker.wav', rms=0.01)
text = "Inference with sampling parameters."
final_wav = lux_tts.generate_speech(
text,
encoded_prompt,
num_steps=4, # ← 4 ODE integration steps (distilled from 30+)
cfg_strength=0.0, # ← DEFAULT IS 0.0 in LuxTTS distilled model
# ← THIS is the key design choice: cfg_strength=0.0 means NO classifier-free
# guidance is applied. In un-distilled TTS models (E2-TTS, base ZipVoice),
# CFG requires running the model TWICE per step: once conditioned on text+speaker,
# once unconditional. The outputs are interpolated with a strength weight.
# Example: output = uncond + cfg_strength * (cond - uncond)
# Flow distillation trains the student model to internalize this interpolation,
# so cfg_strength=0 in LuxTTS gives BETTER quality than cfg_strength>0
# (because the distillation already baked the guidance signal in)
# ← Setting cfg_strength > 0 actually hurts LuxTTS quality for this reason
cfg_interval_start=0.0, # ← interval over which CFG would be applied (n/a at 0.0)
cfg_interval_end=1.0,
sway_sampling_coeff=-1.0, # ← "sway sampling": non-uniform timestep sampling
# ← Instead of uniform timesteps [0, 0.25, 0.5, 0.75, 1.0],
# sway sampling concentrates steps near t=1.0 (near the final speech)
# because the flow field near t=1 is most complex (transitioning from
# structured noise to clean speech features)
# ← Negative coefficient means: MORE steps near t=1 (final), FEWER near t=0
# ← THIS is the "improved sampling technique" LuxTTS adds over base ZipVoice
# It's the primary quality improvement beyond step count reduction
)
final_wav = final_wav.numpy().squeeze()
sf.write('output_advanced.wav', final_wav, 48000)
# ─── WHAT FLOAT16 WOULD LOOK LIKE (not yet implemented in LuxTTS) ─────────────
# The README notes: "currently uses float32. Float16 should be significantly faster
# (almost 2x)." This is a known optimization gap.
#
# Current: float32 → ~6.7ms GPU time per second of audio
# Float16: would be approximately 3.3ms per second → ~300x realtime
# Why not done: risk of precision loss in the 4-step ODE integration at the
# critical steps near t=1 where the flow field changes most sharply
# Expected in a future release once numerics are validated
The cfg_strength=0.0 default is the tell. Every other TTS model using classifier-free guidance defaults to cfg_strength > 0 (usually 0.5-2.0). LuxTTS defaults to zero because the distillation has already internalized the guidance signal. Setting it higher actively degrades the output. This is the distillation working correctly, not a missing feature.
It In Action: End-to-End Voice Cloning
Task: Clone a voice from a 5-second reference clip and generate a 120-word technical paragraph in that voice.
Input:
Reference audio: speaker_reference.wav
Duration: 5.2 seconds
Content: "Hello, my name is Sarah. I work as a software engineer."
Sample rate: 44.1kHz (auto-resampled by librosa on encode_prompt)
Target text: "The flow-matching architecture in ZipVoice represents a significant
departure from autoregressive speech synthesis. By conditioning the ODE solver
on both textual features and speaker prompts, the model learns to transform
Gaussian noise directly into structured mel spectrogram features that capture
the speaker's identity, prosody, and phonetic characteristics simultaneously."
Word count: 56 words
Expected audio duration: ~18 seconds at normal speech rate
Step 1: Speaker encoding
encoded_prompt = lux_tts.encode_prompt('speaker_reference.wav', rms=0.01)
# First call: ~10 seconds (librosa initialization)
# Subsequent calls: ~200ms (re-encoding a new audio file)
# encoded_prompt: tensor of shape [1, T_ref, mel_dim] representing speaker identity
Step 2: Flow matching inference (4 steps)
Initial state: x0 ~ N(0, I) [Gaussian noise, shape: 1 × T_out × mel_bins]
Step 1 (t=0.0 → t=0.29, sway-weighted):
v0 = Zipformer(x0, text_hidden, speaker_prompt) # vector field estimate
x1 = x0 + 0.29 * v0
Compute: ~1.2ms GPU
Step 2 (t=0.29 → t=0.57):
v1 = Zipformer(x1, text_hidden, speaker_prompt)
x2 = x1 + 0.28 * v1
Compute: ~1.2ms GPU
Step 3 (t=0.57 → 0.79):
v2 = Zipformer(x2, text_hidden, speaker_prompt)
x3 = x2 + 0.22 * v2
Compute: ~1.2ms GPU
Step 4 (t=0.79 → 1.0, sway-weighted to concentrate here):
v3 = Zipformer(x3, text_hidden, speaker_prompt)
x4 = x3 + 0.21 * v3 ← final mel spectrogram
Compute: ~1.2ms GPU
Total CFM inference: ~5ms for 18 seconds of output (360:1 ratio)
Note: sway_sampling_coeff=-1.0 means steps near t=1 are smaller intervals
(more careful integration near the final speech-like features)
Step 3: Vocos vocoder (48kHz output)
Input: x4 (mel spectrogram, shape: 1 × T_out × mel_bins)
Vocos: Fourier-based conversion, ~1.5ms GPU for 18 seconds of audio
Output: waveform at 48,000Hz sample rate
Total audio samples: 18s × 48,000 = 864,000 samples
File size at 48kHz WAV (16-bit): ~1.7MB
Step 4: Output quality
Total GPU time for 18 seconds of audio: ~6.7ms
Realtime factor: 18s / 0.0067s = 2,686x realtime
← Wait, that's not 150x.
Explanation: the README's "150x realtime" applies to short utterances where
tokenization and encode_prompt overhead dominate. For longer sequences,
pure generation throughput is much higher.
The 150x figure is a conservative sustainable throughput metric,
not a peak latency number. Peak throughput at batch inference is significantly higher.
Speaker similarity to reference: community reports high similarity
on clean reference audio; lower similarity on noisy or very short (<2s) references
Step 5: Comparison against ElevenLabs-equivalent quality at base resolution
LuxTTS 48kHz vs TTS alternatives for same text:
XTTS v2: 24kHz output, 4-8s latency on same GPU
F5-TTS: 24kHz output, ~2-3s latency, 3GB+ VRAM
CosyVoice2: 24kHz output, production-grade quality, higher VRAM
LuxTTS: 48kHz output, 6.7ms latency, 1GB VRAM, 0-shot cloning
For podcast/video narration requiring 48kHz:
Without LuxTTS: generate at 24kHz → upsample to 48kHz (neural or standard)
→ Upsampling adds latency, introduces upsampling artifacts
With LuxTTS: native 48kHz from the vocoder, no post-processing needed
Why This Design Works, and What It Trades Away
The Zipformer architecture choice is the most unconventional and most important design decision in ZipVoice. Diffusion Transformers (DiTs) became the default backbone for flow-matching TTS because they are general-purpose architectures with strong scaling properties. But they have a specific weakness: they are computationally expensive per parameter because the full quadratic self-attention scales with sequence length squared.
Zipformer was designed for ASR, where acoustic sequences are long (hundreds of frames) and where efficiency per parameter is critical because the model must run on streaming audio in real time. Zipformer achieves this through several innovations borrowed from ASR: efficient self-attention with downsampling across encoder stacks, architecture-level parameter sharing, and optimization-friendly initialization. ZipVoice's finding is that these ASR-motivated efficiency properties transfer directly to TTS. A TTS mel spectrogram is structurally similar to an ASR acoustic feature sequence, and the Zipformer's efficient handling of long sequences is just as beneficial for generation as it is for recognition.
The average upsampling approach for speech-text alignment is the simplest possible solution to a hard problem. Duration prediction is one of the most error-prone components in TTS: predict wrong, and the speech is too fast or has unexpected pauses. E2-TTS avoids this by padding text with filler tokens to match speech length, which creates alignment ambiguity. ZipVoice assumes uniform duration per token within a sentence, computes the expected speech length from the text token count, and upsamples text hidden states to match. This is obviously wrong for natural speech (stressed syllables take longer), but it is wrong in a way that the Zipformer decoder can correct through its conditioning on the actual flow-matching target distribution. The simplicity wins.
What LuxTTS trades away:
Float32 precision throughout. The README explicitly notes that float16 "should be significantly faster (almost 2x)" and is currently not implemented. This means LuxTTS is leaving roughly half its theoretical GPU throughput on the table. For a model that positions itself on speed, this is a meaningful gap, though the numerics of 4-step ODE integration near t=1 require care before enabling float16.
Non-English quality is inconsistent. Community reports indicate Japanese, Korean, and French cloning works but with noticeably lower speaker similarity than English and Chinese. The 100k hour ZipVoice training data is multilingual but unevenly distributed, and LuxTTS inherits this bias.
The 1GB VRAM figure is for the model weights only. The working memory for the ODE integration and audio buffers adds to this in practice. Very long audio generation (minutes) may require more.
Technical Moats
Zipformer's ASR-to-TTS transfer. ZipVoice's primary technical contribution, confirmed by ablation studies in the paper, is that the Zipformer architecture outperforms standard Transformer and DiT architectures for TTS when model size is constrained. Reproducing this result requires either using the ZipVoice codebase or independently implementing Zipformer for TTS, which is non-trivial: Zipformer has several unusual components (bypassable stack connections, per-head learning rate scaling, ScaledAdam optimizer requirement) that are not standard in TTS frameworks.
Flow distillation eliminating CFG. The distillation method that removes classifier-free guidance from the inference path is specific to ZipVoice's CFM formulation. Other TTS models that use CFG (F5-TTS, CosyVoice, XTTS) cannot directly apply LuxTTS's distillation because their architectures and conditioning mechanisms differ. Replicating the 4-step no-CFG inference for a different TTS architecture requires re-running the distillation training from scratch.
Custom 48kHz Vocos vocoder. Training a high-quality neural vocoder for 48kHz is substantially more expensive than 24kHz: the model must learn two octaves more of frequency space, and the training data requirements increase proportionally. The custom 48kHz Vocos in LuxTTS is not a feature that can be trivially added to competing systems, it requires a separate vocoder training run on high-quality 48kHz data.
Insights
Insight One: The 150x realtime speed claim is accurate but applies specifically to short utterances on an already-warmed-up GPU. The first call to encode_prompt() takes ~10 seconds due to librosa initialization. The model download from HuggingFace takes additional time. For a system claiming to run at 150x realtime, the startup overhead is a significant practical consideration that the benchmark ignores. In a production service where the model is loaded once and stays resident in memory, the 150x figure is sustainable. For a script that runs the model per invocation, the effective throughput on short texts is much lower. This is not a flaw in the model; it is a standard distinction between model latency and serving latency that production deployments need to handle.
Insight Two: LuxTTS's 48kHz output is not just a quality improvement. It changes which downstream workflows LuxTTS fits into. Podcast production, broadcast audio, and professional video narration all standardize on 48kHz sample rates. A 24kHz TTS output in these workflows requires either resampling (which introduces artifacts) or a separate neural upsampler. LuxTTS is the first open-source zero-shot voice cloning model that outputs natively at the professional audio production standard. This is not a feature most AI engineers benchmark for, but it is a feature that determines whether a TTS model integrates cleanly into existing audio production tooling without extra preprocessing steps. The 48kHz output is what makes LuxTTS viable for podcast, audiobook, and video narration workflows without additional audio engineering.
Surprising Takeaway
The flow distillation that LuxTTS uses to eliminate classifier-free guidance is pedagogically the reverse of how most engineers understand the tradeoff. CFG in diffusion models is typically described as a quality booster: set cfg_strength > 0 to improve adherence to the conditioning signal. In base ZipVoice, higher CFG strength produces better speaker similarity and pronunciation accuracy. But LuxTTS's distilled student model was trained to produce those same quality improvements internally, without the CFG computation. So in LuxTTS, setting cfg_strength > 0 does NOT improve quality. It DEGRADES it, because the model was not trained to use CFG as an external quality lever. It was trained to produce CFG-quality output at CFG-strength=0. The usual mental model of "more CFG = better quality" is inverted for distilled models. This is a critical configuration mistake waiting to happen for engineers who tune LuxTTS by analogy to F5-TTS or CosyVoice, both of which benefit from nonzero CFG strength.
TL;DR For Engineers
LuxTTS (ysharma3501/LuxTTS, Apache 2.0, 3.7k stars) is a zero-shot voice cloning TTS model running at 150x realtime on GPU, <1GB VRAM, native 48kHz output. Three-function API:
encode_prompt(audio),generate_speech(text, prompt, num_steps=4), save at 48000Hz. Runs on CUDA, CPU, and Apple MPS.Built on ZipVoice (arXiv:2506.13053, Xiaomi Corp.): Zipformer backbone (originally ASR architecture) for both text encoding and flow-matching vector field estimation. Conditional flow matching (CFM) with average upsampling for speech-text alignment. 3x smaller and 30x faster than DiT-based flow-matching baselines at comparable quality.
LuxTTS's three modifications over base ZipVoice: (1) flow distillation to 4 steps with CFG elimination (set
cfg_strength=0.0, the default), (2) sway sampling (sway_sampling_coeff=-1.0) concentrating ODE steps near t=1 where the flow field is most complex, (3) custom 48kHz Vocos vocoder instead of ZipVoice's 24kHz version.The
cfg_strength=0.0default is not a missing feature. It is the distillation working. Setting CFG strength above zero DEGRADES output quality in LuxTTS because the distillation baked the guidance into the weights. This is the exact opposite of F5-TTS and CosyVoice where CFG > 0 improves quality.Float16 support is not yet implemented (float32 throughout). The README confirms float16 would be ~2x faster. For production batch generation, this is a meaningful gap to watch for in future releases.
The ASR Architecture That Learned to Speak
LuxTTS is a compelling example of cross-domain architectural transfer: an efficiency-driven ASR backbone proving better for TTS than the architectures specifically designed for speech generation. The Zipformer's handling of long acoustic sequences, its parameter efficiency, and its optimization properties all translate directly to the TTS setting, and ZipVoice's ablation studies prove it outperforms DiT alternatives at constrained model size.
LuxTTS wraps this with three targeted modifications that push the deployment profile further: 4-step distillation, custom 48kHz vocoder, and sway sampling. The result is a model that fits professional audio production standards while running on consumer hardware.
The float32 limitation is the one remaining gap. When float16 lands, the 150x realtime figure becomes 300x, and the 1GB VRAM figure shrinks further. That will close the gap between LuxTTS and production TTS APIs on pure throughput metrics.
References
LuxTTS GitHub Repository, ysharma3501, Apache 2.0, 3.7k stars
LuxTTS HuggingFace Model, YatharthS
ZipVoice: Fast and High-Quality Zero-Shot Text-to-Speech with Flow Matching, arXiv:2506.13053, Zhu, Kang, Yao, Guo, Kuang, Li, Zhuang, Lin, Povey, Xiaomi Corp., 2025
ZipVoice GitHub (k2-fsa/ZipVoice), Xiaomi/k2-fsa
LuxTTS (ysharma3501/LuxTTS, Apache 2.0, 3.7k stars) is a zero-shot voice cloning TTS model achieving 150x realtime throughput on GPU, sub-1GB VRAM footprint, and native 48kHz audio output by extending ZipVoice (arXiv:2506.13053, Xiaomi Corp.) with three targeted modifications: flow distillation to 4 inference steps with classifier-free guidance elimination (cfg_strength=0.0 by default, which improves rather than degrades quality for distilled models), sway sampling that concentrates ODE integration steps near t=1 where the vector field is most complex, and a custom 48kHz Vocos vocoder replacing ZipVoice's 24kHz vocoder. ZipVoice's foundation is itself architecturally unconventional: it uses Zipformer, an ASR backbone originally designed for speech recognition, as both the text encoder and the flow-matching vector field estimator, achieving 3x fewer parameters and 30x faster inference than DiT-based baselines at comparable speech quality across 100k hours of multilingual training data.
Sponsored Ad
If you enjoy practical AI insights, check out SnackOnAI and support the newsletter by subscribing, sharing, and exploring our sponsored ad — it helps us keep building and delivering value 🚀
Six people doing the work. Your headcount is one.
Your finance close runs in #finance. Stripe and QuickBooks reconciled, runway updated, posted Sunday night without you asking.
Engineering review lands in #eng. Viktor pulled the open PRs, left comments on auth-refactor, flagged a dependency blocking api-pagination.
Campaign brief lands in #growth: Meta CPA up 18%, recommendation to pause broad match, a draft landing page already deployed for the variant test.
You hired him on day zero. He lives in Slack and Microsoft Teams alongside your contractors and investors, connects to 3,000+ tools, pushes back when you ship something dumb.
"Viktor is now an integral team member, and after weeks of use we still feel we haven't uncovered the full potential." Patrick, Director, Yarra Web.


