Sponsored by

SnackOnAI Engineering | Senior AI Systems Researcher | Technical Deep Dive | May 15, 2026

The problem with bolting audio onto video after generation is synchronization. If the audio model generates a footstep sound independently from the video model generating a foot hitting pavement, they will not be temporally aligned in a semantically meaningful way unless you add expensive post-processing. Foley artists and sound designers know this: the audio must be generated with awareness of what the video contains at every frame, not added on top of a completed visual sequence.

LTX-2 (Lightricks/LTX-2, open weights, arXiv:2601.03233) solves this with an asymmetric dual-stream transformer that jointly denoises audio and video latents throughout the full diffusion process. The video stream (14B parameters) and audio stream (5B parameters) exchange information at every denoising step via bidirectional cross-attention, ensuring that audio generation is continuously conditioned on the current video state and vice versa. The result: synchronized video and audio with correct foley, ambient sound, and speech, generated in one pass from a text prompt.

LTX-2.3 (the latest update as of May 2026) adds native portrait generation (1080×1920 trained on portrait data, not cropped from landscape), a rebuilt VAE for sharper fine detail, a 4x larger text connector for better prompt adherence, and HDR output as a beta IC-LoRA. It is available at HuggingFace.

This newsletter dissects LTX-2 as a systems engineering document: why separate modality-specific VAEs matter, how the dual-stream block's four sequential operations achieve inter-modal synchronization, what modality-CFG enables that standard CFG cannot, and what the 14B/5B asymmetry reveals about the information density difference between video and audio.

Scope: LTX-2 architecture (arXiv:2601.03233), dual-stream transformer, modality-specific VAEs, modality-CFG, LTX-2.3 improvements. Not covered: LTX Studio or enterprise API pricing, or detailed LoRA training for style customization beyond mentioning support.

What It Actually Does

LTX-2 generates native 4K video at 50 FPS with synchronized audio from text prompts, image inputs, audio inputs (audio-to-video), or combinations. It supports up to 20 seconds of output. Open weights are available on HuggingFace.

Capability summary (LTX-2 and LTX-2.3):

Capability

LTX-2

LTX-2.3

Max resolution

Native 4K

Native 4K + 1080×1920 portrait

Max duration

20 seconds

20 seconds

Frame rate

50 FPS

50 FPS

Audio

Synchronized foley, speech, ambient

Cleaner (filtered training + new vocoder)

Text adherence

Strong

4x larger text connector

Image-to-video

Yes

Improved (less Ken Burns, more real motion)

Audio-to-video

Yes

Yes

VAE

Original

Rebuilt latent space, finer detail

LoRA

Style + identity LoRAs

IC-LoRA for HDR (beta)

Weights

Open

The audio is not speech-only. LTX-2 generates what the paper describes as "rich, coherent audio tracks that follow the characters, environment, style, and emotion of each scene, complete with natural background and foley elements." A desert landscape generates wind and ambient silence. A street scene generates crowd noise and traffic. A footstep generates the appropriate foley for the surface material visible in the video.

The Architecture, Unpacked

Focus on the bidirectional AV cross-attention at step 3 of every dual-stream block. This is what makes joint synchronization possible: at every denoising step, video tokens attend to audio tokens AND audio tokens attend to video tokens. The video generation knows what audio is being generated, and the audio generation knows what visual content is being generated. This bidirectional exchange happens at every layer of both streams.

The Code, Annotated

Snippet One: Dual-Stream Block with Bidirectional AV Cross-Attention

# Reconstructed from arXiv:2601.03233 architecture description
# and the Lightricks/LTX-2 GitHub repository

import torch
import torch.nn as nn
from typing import Optional

class DualStreamBlock(nn.Module):
    """
    One block of the LTX-2 dual-stream transformer.
    Processes video and audio latents in parallel with bidirectional cross-attention.

    The four sequential operations match the paper's Figure 2 exactly:
    1. Self-Attention within modality (video attends to video, audio to audio)
    2. Text Cross-Attention (both streams conditioned on text)
    3. Audio-Visual Cross-Attention (bidirectional inter-modal exchange)
    4. Feed-Forward Network (per-stream refinement)
    """

    def __init__(
        self,
        video_dim: int,  # 14B video stream hidden dim (larger)
        audio_dim: int,  # 5B audio stream hidden dim (smaller)
        text_dim: int,
        num_heads: int,
    ):
        super().__init__()
        # Step 1: Modality-specific self-attention
        # Video: 3D RoPE (spatial H, W + temporal T)
        # Audio: 1D temporal RoPE (audio is purely temporal)
        # ← The RoPE dimensionality difference is the key consequence of
        #   separate VAEs: video tokens have (t, h, w) coordinates,
        #   audio tokens have only (t,) coordinates. Shared positional
        #   encoding would require an arbitrary projection.
        self.video_self_attn = SelfAttention(video_dim, num_heads, rope_type="3D")
        self.audio_self_attn = SelfAttention(audio_dim, num_heads, rope_type="1D")

        # Step 2: Text cross-attention (both modalities conditioned on text prompt)
        self.video_text_cross_attn = CrossAttention(video_dim, text_dim, num_heads)
        self.audio_text_cross_attn = CrossAttention(audio_dim, text_dim, num_heads)

        # Step 3: BIDIRECTIONAL Audio-Visual Cross-Attention
        # ← THIS is the architectural innovation: both streams attend to each other
        # video_to_audio: video queries attend to audio keys/values
        # audio_to_video: audio queries attend to video keys/values
        # Both use 1D temporal RoPE for positional alignment (time is shared)
        # ← Why 1D temporal RoPE here (not 3D video RoPE)?
        #   Because the cross-attention is temporal matching:
        #   "what audio at time t corresponds to video frame at time t?"
        #   Spatial (H, W) positioning of video is not meaningful for audio.
        self.video_to_audio_cross_attn = CrossAttention(
            video_dim, audio_dim, num_heads, rope_type="1D_temporal"
        )
        self.audio_to_video_cross_attn = CrossAttention(
            audio_dim, video_dim, num_heads, rope_type="1D_temporal"
        )

        # Cross-modality AdaLN: SAME timestep t → BOTH streams
        # ← This is the shared denoising schedule mechanism
        # Without shared timestep conditioning, video and audio would
        # diffuse at different rates and become misaligned during denoising
        self.video_adaln = AdaptiveLayerNorm(video_dim, cond_dim=256)  # t embedding
        self.audio_adaln = AdaptiveLayerNorm(audio_dim, cond_dim=256)  # same t

        # Step 4: Modality-specific FFN
        self.video_ffn = FeedForward(video_dim)
        self.audio_ffn = FeedForward(audio_dim)

    def forward(
        self,
        video_latents: torch.Tensor,  # (B, T*H*W, video_dim)
        audio_latents: torch.Tensor,  # (B, T_audio, audio_dim)
        text_embeddings: torch.Tensor,
        timestep_embedding: torch.Tensor,  # same t for both streams
    ) -> tuple[torch.Tensor, torch.Tensor]:

        # Cross-modality AdaLN: scale/shift both streams by the SAME timestep t
        video_normed = self.video_adaln(video_latents, timestep_embedding)
        audio_normed = self.audio_adaln(audio_latents, timestep_embedding)

        # Step 1: Modality-specific self-attention
        video_latents = video_latents + self.video_self_attn(video_normed)
        audio_latents = audio_latents + self.audio_self_attn(audio_normed)

        # Step 2: Text cross-attention
        video_latents = video_latents + self.video_text_cross_attn(video_normed, text_embeddings)
        audio_latents = audio_latents + self.audio_text_cross_attn(audio_normed, text_embeddings)

        # ← Step 3: BIDIRECTIONAL AV cross-attention — the core mechanism
        # Both operations use the PRE-UPDATE latents (from before step 3)
        # This prevents order-dependency: video doesn't "see" updated audio
        # and audio doesn't "see" updated video within the same block
        # ← THIS is the trick: parallel bidirectional exchange in one step
        video_av_update = self.video_to_audio_cross_attn(video_normed, audio_normed)
        audio_av_update = self.audio_to_video_cross_attn(audio_normed, video_normed)
        video_latents = video_latents + video_av_update
        audio_latents = audio_latents + audio_av_update

        # Step 4: FFN
        video_latents = video_latents + self.video_ffn(self.video_adaln(video_latents, timestep_embedding))
        audio_latents = audio_latents + self.audio_ffn(self.audio_adaln(audio_latents, timestep_embedding))

        return video_latents, audio_latents

The pre-update latent exchange in step 3 is the design decision that makes bidirectional cross-attention order-independent. Both the video_to_audio and audio_to_video cross-attentions read from the state BEFORE the update, not the partially updated state. This ensures neither stream has an asymmetric "sees updated other stream" advantage within a single block.

Snippet Two: Modality-CFG and Inference

# Modality-aware Classifier-Free Guidance (modality-CFG)
# This is LTX-2's key inference-time mechanism for controlling
# audiovisual alignment and per-modality guidance strength

def modality_cfg_inference(
    model: LTX2Model,
    video_latents_noisy: torch.Tensor,
    audio_latents_noisy: torch.Tensor,
    text_embeddings: torch.Tensor,
    timestep: int,
    # Guidance scales: higher = stronger adherence to each conditioning signal
    text_guidance_scale: float = 7.5,
    video_guidance_scale: float = 3.0,  # for audio conditioned on video (V2A)
    audio_guidance_scale: float = 3.0,  # for video conditioned on audio (A2V)
) -> tuple[torch.Tensor, torch.Tensor]:
    """
    Modality-CFG: 4 forward passes per denoising step.

    Standard CFG: 2 passes (conditioned + unconditioned)
    Modality-CFG: 4 passes for independent control of 3 guidance signals

    ← THIS is the trick: by running 4 forward passes, we get independent
      estimates of how much each conditioning signal contributes to the output.
      This allows a filmmaker to say: "I want strong text adherence AND strong
      audio quality but don't need the audio to drive the video motion."
    """
    # Pass 1: Fully conditioned (text + all modality conditioning)
    v_full, a_full = model(
        video_latents_noisy, audio_latents_noisy, text_embeddings, timestep
    )

    # Pass 2: Text-only (removes modality-specific guidance)
    v_text, a_text = model(
        video_latents_noisy, audio_latents_noisy,
        text_embeddings,
        timestep,
        drop_av_conditioning=True,  # no AV cross-modal guidance signal
    )

    # Pass 3: Video-conditioned only (for A2V: audio guided by video without text)
    v_vid, a_vid = model(
        video_latents_noisy, audio_latents_noisy,
        null_text_embeddings,  # text dropped
        timestep,
        video_conditioning=True, audio_conditioning=False,
    )

    # Pass 4: Audio-conditioned only (for V2A: video guided by audio without text)
    v_aud, a_aud = model(
        video_latents_noisy, audio_latents_noisy,
        null_text_embeddings,  # text dropped
        timestep,
        video_conditioning=False, audio_conditioning=True,
    )

    # ← Compose guidance signals with independent scale factors
    # Standard CFG: output = uncond + scale * (cond - uncond)
    # Modality-CFG: extends this with separate per-modality scale factors

    video_pred = (
        v_text                                                    # text base
        + text_guidance_scale * (v_full - v_text)                 # text guidance
        + video_guidance_scale * (v_vid - v_text)                 # video self-guidance
        + audio_guidance_scale * (v_aud - v_text)                 # audio-guided video
    )

    audio_pred = (
        a_text                                                    # text base
        + text_guidance_scale * (a_full - a_text)                 # text guidance
        + audio_guidance_scale * (a_aud - a_text)                 # audio self-guidance
        + video_guidance_scale * (a_vid - a_text)                 # video-guided audio
    )

    return video_pred, audio_pred

The 4-pass modality-CFG costs 4× the compute of a single forward pass per denoising step. This is the inference cost of independent modality guidance. At production scale, this can be reduced with guidance distillation or by reducing steps via flow matching, which is why LTX-2's efficiency claim ("a fraction of [proprietary models'] computational cost") requires context: the comparison is total generation time, not per-step count.

It In Action: End-to-End Worked Example

Input: Generate a 10-second clip of a jazz musician playing a saxophone on a rainy night street, complete with appropriate audio.

Step 1: Encoding

Text: "A jazz musician plays saxophone on a wet city street at night.
       Rain falls softly. Street lights reflect in the puddles. The
       music is melancholic and expressive. Close-up medium shot."

Video VAE: empty (text-to-video, no input video)
Audio VAE: empty (text-to-audio, no input audio)
Text encoder: multilingual → text embeddings (LTX-2.3: 4× larger connector)

Target:
  Duration:   10 seconds
  Resolution: 1920 × 1080 (or 4K for API)
  FPS:        50
  Frames:     500 (at 50 FPS, 10 seconds)

Step 2: Latent space setup

Video latent: 8× spatial compression → 240 × 135 latent grid
              8-frame temporal compression → 63 temporal tokens
              Video latent shape: (63, 240, 135) → flattened for transformer

Audio latent: 10 seconds at 44.1 kHz = 441,000 samples
              Audio VAE compresses to 1D temporal tokens
              Audio latent shape: (T_audio,) → 1D token sequence

Shared denoising timestep: t = 1000 → 0 (flow matching or DDIM schedule)

Step 3: Denoising (N transformer blocks × M denoising steps)

At each denoising step t:
  Dual-stream blocks process video and audio latents:
    1. Video self-attention: saxophone player appearance, street lighting, puddles
    2. Audio self-attention: melancholic music pattern, rain texture
    3. AV cross-attention:
       Video → Audio: "the saxophonist's embouchure is moving → generate saxophone sound"
       Audio → Video: "the music is legato and expressive → the player should sway slightly"
    4. FFN refinement per stream

  Modality-CFG: 4 passes at inference for guidance composition
  Text guidance strengthens: wet streets, rain, jazz musician identity
  AV alignment: lip/embouchure sync with audio, instrument foley timing

Step 4: Decoding

Video VAE decode: latent → 500 frames at target resolution and 50 FPS
Audio VAE decode (LTX-2.3 new vocoder): latent → synchronized waveform
  - Saxophone melody: pitched to match embouchure movement
  - Rain ambient: continuous background, intensity matching visible rainfall
  - Wet pavement: subtle acoustic reflections
  - Environment reverb: consistent with street acoustics

Output files:
  video.mp4:  1920×1080, 50 FPS, H.264
  audio.wav:  44.1 kHz, stereo, synchronized
  (or combined audiovisual output)

Timing (from available benchmark references):

LTX-2 on A100 80GB (representative):
  10-second, 1080p, 50 FPS generation: ~90-120 seconds
  4K generation: ~3-5 minutes
  (Proprietary comparable models: 10-15+ minutes for similar output)

LTX-2.3 improvements vs LTX-2:
  Same generation time (optimizations in quality, not speed)
  Audio quality: cleaner (new vocoder, filtered training data)
  Fine detail: sharper (rebuilt VAE)
  Prompt adherence: stronger (4× text connector)

Why This Design Works, and What It Trades Away

The asymmetric 14B/5B parameter split is the correct allocation for the information density difference between video and audio. A 4K video frame at 50 FPS contains billions of pixel values per second. Compressed to latent space, video tokens still carry dense spatial and temporal structure. Audio, even at 44.1 kHz, is one-dimensional time series with far lower intrinsic dimensionality. Forcing equal capacity on both would either waste parameters on audio or underfit video. The 14B/5B split reflects measured information density, not an arbitrary ratio.

The separate modality-specific VAEs are the decision that enables everything else. A shared latent space would require either (a) projecting audio into a 3D spatial-temporal representation (meaningless for audio) or (b) collapsing video into a 1D temporal representation (losing all spatial structure). Separate VAEs allow video to use 3D RoPE in its self-attention and audio to use 1D RoPE, matching each modality's natural geometry. They also natively enable the Video-to-Audio (V2A) and Audio-to-Video (A2V) workflows without architectural changes: simply provide the encoded video or audio at initialization and generate the missing modality.

The cross-modality AdaLN with shared timestep conditioning is the mechanism that keeps both streams synchronized during denoising. Without shared timestep conditioning, a video stream at noise level t=500 attending to an audio stream at noise level t=200 would be attending to cleaner audio tokens than the video tokens, introducing a temporal bias in the cross-attention that would cause desynchronization artifacts.

What LTX-2 trades away:

Inference cost for modality-CFG. Four forward passes per denoising step is 4× the compute of standard CFG for video-only models. This is mitigated by the model's efficiency (the paper claims "a fraction of [proprietary models'] computational cost"), but the comparison is total wall-clock generation time against models that also add post-hoc audio. LTX-2 generates synchronized audio in one pass; competitors require a second model pass.

Audio quality ceiling. The 5B audio stream, while appropriate for its information density, is smaller than dedicated audio generation models. For applications requiring the highest audio quality (professional music production, high-fidelity speech), a specialized audio model may produce better results than the 5B stream. LTX-2's audio excels at synchronization and environmental coherence, not maximum audio fidelity.

Maximum duration. Twenty seconds is the current ceiling. Long-form video generation (minutes) requires sliding window or continuation approaches not built into the base model.

Technical Moats

The temporal alignment mechanism via 1D temporal RoPE in AV cross-attention. The bidirectional cross-attention between video and audio streams uses 1D temporal RoPE specifically (not 3D video RoPE) for positional alignment. This choice is precise: the cross-attention asks "which audio token at time t corresponds to which video token at time t?" Spatial (H, W) position in video is irrelevant for this question. Using 1D temporal RoPE for the inter-modal exchange ensures that audio generation aligns with the correct temporal position in video without being confused by spatial video positions. Getting this embedding choice right required understanding what the cross-modal alignment task actually requires.

Training a 19B joint model from scratch (or via curriculum). Pretraining a 19B dual-stream transformer on synchronized audiovisual data at scale is the compute moat. The model must see enough paired video-audio data with correct synchronization to learn that foley events in audio correspond to physical events in video, that speech audio corresponds to mouth movement in video, and that environmental audio corresponds to scene content. The training data curation (paired audiovisual content with temporal alignment annotations) is a significant non-trivial effort that is hard to replicate.

The modality-CFG mechanism for controllability. Standard CFG provides one guidance scale. Modality-CFG provides three independent guidance signals (text, video, audio) with separate scale factors. This enables workflows like: "generate audio that is strongly driven by the video motion (V2A mode, high video_guidance_scale) but weakly driven by the text." The mechanism is architecturally simple (4 forward passes) but requires training the model to correctly respond to each conditioning signal's presence or absence, which is a training objective that had to be designed and validated.

Insights

Insight One: LTX-2 is not a video model that adds audio. It is a joint audio-video model that generates both modalities as equally first-class outputs. Most coverage treats it as the former, which undersells the architectural difference.

Prior video models (Sora, VEO-2, other commercial systems) generate silent video. When those services provide audio in their consumer products, the audio is generated by a separate model as a second inference step, not jointly. The architectural consequence is that audio synchronization is approximate: the audio model sees the completed video and estimates what sounds were happening, rather than having bidirectional knowledge of what the video model was generating at each denoising step. LTX-2's bidirectional AV cross-attention means the audio and video were generated in conversation with each other throughout the full denoising process. This produces qualitatively different synchronization properties, especially for foley events (a door closing, footsteps, instrument sounds) where the timing must match the precise frame when the event visually occurs.

Insight Two: The "thinking tokens" in LTX-2's text conditioning module (documented in the technical report as a multi-token prediction mechanism for semantic stability) is the component that enables complex multi-subject prompts to be handled correctly, and it is the piece most likely to be adopted broadly in other video models.

Standard text-to-video conditioning encodes the prompt as a fixed embedding and uses cross-attention in each transformer block. This works for simple prompts but degrades for complex prompts with multiple subjects, spatial relationships, and stylistic instructions. LTX-2's text conditioning uses multi-token prediction: the text encoder produces intermediate "thinking" tokens that allow the model to reason about complex prompt structure before producing the final conditioning embeddings. LTX-2.3's 4× larger text connector is an extension of this approach. The result is what the LTX-2.3 page documents as: "Complex prompts, multiple subjects, spatial relationships, stylistic instructions, now resolve accurately. Try being more specific. The model handles it."

Takeaway

LTX-2's paper (arXiv:2601.03233) claims to achieve "state-of-the-art audiovisual quality and prompt adherence among open-source systems, while delivering results comparable to proprietary models at a fraction of their computational cost." The key word is "comparable." The computational efficiency comes from the joint generation in one pass: the proprietary systems that LTX-2 competes with on quality all require separate model passes for video and audio, making LTX-2's total generation time significantly lower even if per-step compute is higher due to modality-CFG.

This is the correct framing that most comparisons miss. LTX-2 is not faster than VEO-2 or Sora on a per-operation basis. It is faster on a per-synchronized-audiovisual-output basis because it does not require a second inference pass for audio. When comparing generation cost, the relevant unit is "cost to produce one synchronized audiovisual clip," not "cost per video denoising step."

TL;DR For Engineers

  • LTX-2 (Lightricks, arXiv:2601.03233, open weights, January 2026) is the first open-weights model generating synchronized video and audio in one diffusion pass. Architecture: asymmetric dual-stream transformer, 14B video stream + 5B audio stream, coupled via bidirectional AV cross-attention with temporal 1D RoPE and cross-modality AdaLN for shared timestep conditioning.

  • Separate modality-specific causal VAEs encode video (3D spatial-temporal) and audio (1D temporal) into independent latent spaces. This enables 3D RoPE for video self-attention and 1D RoPE for audio, matching each modality's natural geometry. Natively supports V2A and A2V workflows without architecture changes.

  • Each dual-stream block performs four sequential operations: (1) modality-specific self-attention with appropriate RoPE, (2) text cross-attention, (3) bidirectional AV cross-attention (video↔audio), (4) per-stream FFN. Cross-modality AdaLN with shared timestep ensures both streams denoise at the same noise level.

  • Modality-CFG: 4 forward passes per denoising step for independent text, video, and audio guidance scales. Enables workflows like V2A (generate audio for existing video), A2V, and full text-to-audiovisual generation.

  • LTX-2.3 improvements (current as of May 2026): rebuilt VAE for sharper detail, 4× larger text connector, improved image-to-video with less Ken Burns, cleaner audio via filtered training + new vocoder, native portrait 1080×1920 (not cropped from landscape), HDR output as beta IC-LoRA.

Audio Was Always Part of the Video. LTX-2 Is the First Open Model to Generate Them Together.

The silent video problem was not a technical limitation that required new hardware or a new scale of training. It was an architectural choice to defer audio to a second model, accepted so universally that it became invisible. LTX-2 challenges that choice with an architecture that treats audio and video as equally first-class outputs in a single unified denoising process. The bidirectional AV cross-attention is the mechanism. The asymmetric parameter allocation is the engineering discipline. The open weights are the distribution. Whether LTX-2's specific implementation becomes the standard or serves as the demonstration that joint generation works at production quality, the silence era of video generation is over.

References

LTX-2 (arXiv:2601.03233, Lightricks, January 2026, open weights) is the first open-weights joint audio-visual diffusion model, using an asymmetric dual-stream transformer (14B video + 5B audio parameters) with separate modality-specific causal VAEs, bidirectional AV cross-attention using 1D temporal RoPE for inter-modal exchange and cross-modality AdaLN for shared timestep conditioning, and modality-aware CFG (4 forward passes) for independent guidance control. Each dual-stream block performs: self-attention (3D RoPE for video, 1D for audio), text cross-attention, bidirectional AV cross-attention, and per-stream FFN. LTX-2.3 (current) adds native portrait generation (1080×1920 trained on portrait data), a rebuilt video VAE, 4× larger text connector, improved vocoder for cleaner audio, and HDR output as a beta IC-LoRA.

Sponsored Ad

If you enjoy practical AI insights, check out SnackOnAI and support the newsletter by subscribing, sharing, and exploring our sponsored ad — it helps us keep building and delivering value 🚀

In a World of AI Agents: Intent > Identity

AI-powered bots aren’t just logging in anymore. They’re mimicking real users, slipping past identity checks, and scaling attacks faster than ever.

Thousands of companies worldwide trust hCaptcha to protect their online services from automated threats while preserving user privacy.

Now is the time to take control of your security.

Recommended for you