Frames Per Prompt: Generate Video from Text, Locally, for Free

SnackOnAI Engineering · Senior AI Systems Researcher · March 2026 · Source: lightricks-ltx-video.com · License: LTX Community License

TL;DR For Founders

Video generation just went from "cloud API call with a $0.10/clip invoice" to "runs on your MacBook, costs $0.002 in electricity." Lightricks LTX-Video is a 13B-parameter Diffusion Transformer that runs on Apple Silicon MPS with no CUDA required. The snackonai/lightricks-ltx-2-av project wraps it in a single-file Python CLI with format presets (LinkedIn, TikTok, Instagram), SHA256-keyed inference caching, and optional TTS audio. Five-second 720p video from a text prompt in 3 to 15 minutes on an M4 Pro: free, private, offline.

The unlock: a 1:192 spatiotemporal compression ratio lets a DiT model fit in 24 GB of unified memory. If you are validating a media AI idea without burning API credits, this is your stack.

Why This Matters Now

We are in the middle of a quiet infrastructure inversion in generative AI. In 2023, video generation required Runway, Pika, or Stable Video Diffusion API calls: cloud-bound, rate-limited, priced per second. By early 2026, Lightricks shipped LTX-2 with fully open weights, a 19B-parameter dual-stream audio-video transformer, and a Python inference package that runs on consumer hardware.

This is the image generation arc replaying at higher complexity. DALL-E API gave way to Stable Diffusion local; the same transition is now happening in video. But video is fundamentally harder. The state space is orders of magnitude larger, temporal consistency is a qualitatively different problem than spatial consistency, and the memory pressure of full spatiotemporal attention is brutal. Three forces converged to make local inference viable:

Compression got ruthless. LTX-Video's Video-VAE achieves a 1:192 compression ratio: 32×32 spatial downscaling plus 8× temporal compression per token. This is the architectural prerequisite for full spatiotemporal self-attention on a single 24 GB device.

DiTs replaced U-Nets. Diffusion Transformers scale more predictably, benefit more from distillation, and tolerate quantization better than convolutional architectures. The shift to DiT is what enabled 8 to 10-step distilled inference paths.

Apple Silicon closed the gap. The M4 Pro's unified memory architecture eliminates the PCIe bandwidth bottleneck between system memory and VRAM. A 24 GB M4 Pro punches meaningfully above its spec for models that use CPU offloading aggressively.

For builders: your AI media prototype no longer requires a cloud GPU instance. Your iteration loop is a local shell command.

The Real Problems Being Solved

"Generate video from text" is the product. The engineering problems are different:

Memory versus quality. A 720p, 5-second video at 24fps is 120 frames of raw float32 RGB: roughly 330 MB per second. Pixel-space attention over that requires terabytes of compute. The entire architecture exists to make this tractable.

Temporal coherence. Generating frame 47 independently of frame 46 gives you noise, not video. The model must attend across the full temporal sequence, which means the attention matrix grows quadratically with frame count. Latent compression is not an optimization here; it is a prerequisite.

Platform portability. CUDA is unavailable on Apple Silicon. The MPS (Metal Performance Shaders) backend in PyTorch bridges this, but with hard constraints: no float16 for this model class, different memory semantics, no flash attention kernel. The generate.py wrapper handles these constraints explicitly, not incidentally.

Iteration cost. Running the model takes 3 to 15 minutes on an M4 Pro. Re-running the same prompt is economically irrational. The SHA256-keyed frame cache directly addresses this.

How It Actually Works

The Compression Insight

The fundamental insight behind LTX-Video: video generation does not need to happen in pixel space, or in a modestly compressed latent space. It can happen in a radically compressed latent space if the VAE is co-designed with the transformer rather than treated as a pre-processing step.

Standard latent diffusion models compress images roughly 8× spatially. LTX-Video compresses video 1:192 total: 32×32 spatial plus 8× temporal. At 720p 24fps, a 5-second clip goes from 110 million pixels to roughly 13,000 latent tokens. That is the sequence length the transformer attends over.

The mechanism that preserves quality under this extreme compression: the VAE decoder handles not just latent-to-pixel conversion but also the final denoising step. High-frequency detail reconstruction (hair, texture, edges) is delegated to the decoder. The transformer handles semantic layout, motion dynamics, and temporal structure. This division of labor is both elegant and non-obvious.

The Patchification Architecture

Standard video DiTs patchify after the VAE encoder, taking VAE latents and dividing them into patches before the transformer. LTX-Video moves patchification into the VAE input, before the encoder. The VAE learns to compress patches of raw pixels directly, producing a more information-dense latent space with lower inter-channel redundancy. This is verified in the original paper via PCA of latent channels at different training stages: the off-diagonal autocorrelation matrix converges toward zero over training, meaning the latent space is being used efficiently rather than redundantly.

Dual-Stream Audio-Video in LTX-2

The lightricks-ltx-2-av project references Lightricks/LTX-Video via Diffusers, which exposes the LTX-2 architecture. LTX-2 extends the original model with an asymmetric dual-stream design:

Video stream: 14B parameters, wide, high-capacity, handling spatiotemporal dynamics. Audio stream: 5B parameters, narrower, 1D temporal, operating on mel-spectrogram latents. Coupling: Bidirectional audio-video cross-attention at every transformer block, with temporal positional embeddings and cross-modality AdaLN for shared timestep conditioning.

The audio VAE operates on stereo mel spectrograms at 16 kHz. Each audio latent token represents approximately 1/25 seconds of audio. The two streams co-denoise: neither generates independently; they mutually condition each other at every layer through cross-attention.

The text encoder is Gemma, not CLIP or T5. This is a significant architectural choice. Gemma provides richer semantic understanding: temporal language, camera language, style language. The text embedding pipeline uses multi-layer feature extraction across all decoder layers (not just the final one) and "thinking token" connectors: bidirectional transformer blocks with learnable registers that perform additional contextual mixing before cross-attending into the DiT.

Modality-Aware Classifier-Free Guidance

Standard CFG runs two forward passes (conditional and unconditional) and interpolates. LTX-2 uses modality-aware CFG: independent guidance scales for the video and audio streams. You can strengthen video adherence without over-constraining audio generation, or vice versa. This is the guidance_scale parameter paired with per-modality guider params in the pipeline.

The Two-Stage Production Pipeline

The snackonai project uses the single-stage LTXPipeline via Diffusers, which is appropriate for local iteration. The full LTX-2 production pipeline is two-stage:

Stage 1: DiT denoising at half resolution (384px), producing video and audio latents. Stage 2: Spatial 2× latent upsampling via LTX2LatentUpsamplerModel, followed by distilled LoRA refinement at full resolution, then VAE decode.

The distilled LoRA in Stage 2 approximates many-step diffusion refinement in few steps. This is why production quality requires the separate ltx-2-19b-distilled-lora-384.safetensors weight file. The snackonai single-stage path trades some fine detail (hair, text, sharp edges) for a dramatically simpler local setup. For prototyping, this is the correct tradeoff.

Architecture Breakdown

System Architecture

┌─────────────────────────────────────────────────────────────────────┐
│                     generate.py  (CLI entrypoint)                   │
│   typer CLI , argument validation , format/mode resolution          │
└──────────────────────────────┬──────────────────────────────────────┘
                               │
              ┌────────────────▼────────────────┐
              │           Cache Layer            │
              │  SHA256( prompt | W | H | T | S )│
              │  HIT  , load frames.pt from disk │
              │  MISS , proceed to pipeline      │
              └────────────────┬─────────────────┘
                               │ MISS
              ┌────────────────▼────────────────┐
              │        Pipeline Loader           │
              │  LTXPipeline.from_pretrained()   │
              │  "Lightricks/LTX-Video"          │
              │  torch_dtype=float32 (MPS req.)  │
              │  enable_model_cpu_offload()      │
              │  enable_attention_slicing()      │
              └────────────────┬─────────────────┘
                               │
              ┌────────────────▼────────────────┐
              │          Text Encoding           │
              │  Gemma tokenizer , encoder       │
              │  Multi-layer feature extraction  │
              │  Thinking token connectors       │
              └────────────────┬─────────────────┘
                               │
              ┌────────────────▼────────────────┐
              │        Noise Initialization      │
              │  Gaussian noise in compressed    │
              │  latent space                    │
              │  Shape: [B, T_lat, H_lat, W_lat] │
              └────────────────┬─────────────────┘
                               │
              ┌────────────────▼────────────────┐
              │          Denoising Loop          │
              │  N steps (10 preview / 25 prod)  │
              │  ┌──────────────────────────┐    │
              │  │  DiT Block (×N_layers)   │    │
              │  │  Full spatiotemp attn    │    │
              │  │  Text cross-attention    │    │
              │  │  AdaLN timestep cond     │    │
              │  └──────────────────────────┘    │
              │  Scheduler: DDIM / RF-Solver     │
              └────────────────┬─────────────────┘
                               │
              ┌────────────────▼────────────────┐
              │           VAE Decoder            │
              │  Latent , pixel + final denoise  │
              │  Output: PIL frames [T, H, W, 3] │
              └────────────────┬─────────────────┘
                               │
         ┌─────────────────────┼──────────────────────┐
         │                     │                      │
┌────────▼──────┐   ┌──────────▼────────┐   ┌────────▼──────────┐
│  MP4 Encoder  │   │  GIF Generator    │   │  JPEG Thumbnail   │
│  H.264        │   │  480px, 15fps,    │   │  First frame,     │
│  CRF 23       │   │  first 3s         │   │  quality=90       │
│  yuv420p      │   └────────┬──────────┘   └───────────────────┘
└────────┬──────┘            │
         │           ┌───────▼───────────┐
         │           │   Audio Handling  │
         │           │   TTS (Coqui) OR  │
         │           │   External file   │
         │           │   ffmpeg mux      │
         │           └───────────────────┘
         │
┌────────▼──────────────────────────────────────────────────────┐
│                       Output Directory                         │
│  <ts>_<fmt>_<mode>_<id>.mp4                                   │
│  <ts>_..._thumb.jpg                                            │
│  <ts>_..._preview.gif                                          │
│  <ts>_..._meta.json                                            │
│  <ts>_..._tts.wav           (with --tts)                       │
│  <ts>_..._with_audio.mp4    (with --tts or --audio)            │
└───────────────────────────────────────────────────────────────┘

Memory Layout on MPS

Apple Silicon's unified memory means model weights, activations, and output buffers share one physical pool. The pipeline uses enable_model_cpu_offload(): weights sit on CPU when idle, move to MPS on-demand per layer, and results return to CPU after each forward pass. This is slower than keeping everything on GPU but lets a 13B+ parameter model run on 24 GB by ensuring only a few layers occupy the accelerator simultaneously.

The env variable PYTORCH_MPS_HIGH_WATERMARK_RATIO=0.0 is not boilerplate. It disables MPS's default memory fragmentation behavior that causes OOM errors even when total memory is technically sufficient.

Code Walkthrough

The MPS OOM Guard

# Top of generate.py — MUST be set before torch is imported
os.environ.setdefault("PYTORCH_MPS_HIGH_WATERMARK_RATIO", "0.0")

This line must precede import torch. The default MPS high watermark ratio limits how much of the GPU memory pool PyTorch uses before falling back to system memory. Setting it to 0.0 removes this artificial cap. Without it, a model that technically fits will OOM during large activation spikes: full spatiotemporal attention over a long video is the worst offender.

Dimension Resolution and the 32-Divisibility Constraint

FORMAT_PRESETS = {
    "linkedin":  {"width": 1280, "height": 720,  "ratio": "16:9"},
    "tiktok":    {"width": 720,  "height": 1280, "ratio": "9:16"},
    "instagram": {"width": 720,  "height": 720,  "ratio": "1:1"},
}

def _resolve_dims(fmt: str, mode: str) -> tuple[int, int]:
    preset = FORMAT_PRESETS[fmt]
    w, h = preset["width"], preset["height"]
    if mode == "preview":
        w, h = w // 2, h // 2
    # LTXPipeline REQUIRES dimensions divisible by 32
    w = (w // 32) * 32
    h = (h // 32) * 32
    return w, h

The // 32 constraint is architectural, not arbitrary. LTX-Video's patchifier divides spatial dimensions into 32×32 tiles. Non-multiples produce an incomplete tile grid and a hard error. Preview mode halves both dimensions: a 4× reduction in total pixels that roughly halves inference time, since attention complexity is quadratic in token count and token count scales with spatial area.

The Cache System

def _cache_key(prompt: str, width: int, height: int, num_frames: int, steps: int) -> str:
    raw = f"{prompt}|{width}|{height}|{num_frames}|{steps}"
    return hashlib.sha256(raw.encode()).hexdigest()

def _load_from_cache(key: str) -> Optional[torch.Tensor]:
    path = CACHE_DIR / key / "frames.pt"
    if path.exists():
        return torch.load(path, map_location="cpu", weights_only=True)
    return None

The cache is content-addressed on the exact parameters that determine output: prompt, resolution, frame count, inference steps. Change any one and you get a fresh run. Identical parameters skip the entire 3 to 15-minute inference.

Note that --audio, --tts, and --output-dir are intentionally absent from the cache key. These affect only post-processing. The frames tensor is the expensive artifact; everything downstream is cheap.

weights_only=True in torch.load prevents arbitrary code execution from malicious .pt files. Include this in any tool that reads tensors from disk.

Pipeline Initialization: Three Stacked Memory Optimizations

def _load_pipeline(device: torch.device):
    from diffusers import LTXPipeline

    pipe = LTXPipeline.from_pretrained(
        "Lightricks/LTX-Video",
        torch_dtype=torch.float32,  # MPS does NOT support float16 for this model
    )
    pipe.enable_model_cpu_offload()
    if hasattr(pipe, "enable_attention_slicing"):
        pipe.enable_attention_slicing(slice_size="auto")
    if hasattr(pipe, "enable_vae_slicing"):
        pipe.enable_vae_slicing()
    return pipe

enable_model_cpu_offload(): sequential layer offloading. Slower but memory-safe for 24 GB unified memory.

enable_attention_slicing("auto"): instead of computing the full Q·Kᵀ attention matrix at once, compute it in chunks. At 13,000 tokens, the full attention matrix is 13K × 13K × 4 bytes ≈ 676 MB. Slicing trades compute time for peak memory, which is the right tradeoff on MPS.

enable_vae_slicing(): decode VAE one frame at a time rather than batching all frames simultaneously. Critical for clips longer than a few seconds.

Inference and Frame Handling

result = pipe(
    prompt=prompt,
    width=width,
    height=height,
    num_frames=num_frames,            # int(duration * fps)
    num_inference_steps=num_inference_steps,  # 10 preview / 25 production
)

result.frames has an unstable shape across diffusers versions. The code defensively handles all three known shapes:

raw_frames = result.frames
if isinstance(raw_frames[0], list):
    pil_frames = raw_frames[0]       # [[PIL, PIL, ...]] nested list
elif isinstance(raw_frames[0], Image.Image):
    pil_frames = list(raw_frames)    # [PIL, PIL, ...] flat list
else:
    frames_tensor = raw_frames       # Tensor path
    pil_frames = None

Diffusers changes its output format between minor versions. This defensive handling is not paranoia; it is empirically necessary.

Output Encoding

def _save_mp4(pil_frames, path, fps):
    iio.imwrite(str(path), arrays, fps=fps, codec="libx264", plugin="FFMPEG",
                output_params=["-crf", "23", "-preset", "fast", "-pix_fmt", "yuv420p"])

CRF 23 is visually lossless quality for H.264: a rate factor, not a bitrate target. yuv420p is mandatory for compatibility with browser video players and social media ingest pipelines. Some platforms silently reject yuv444p despite it being technically superior. Always encode to yuv420p for distribution.

The audio mux uses -c:v copy (stream copy, no re-encode) and -shortest to trim to whichever stream ends first. Since TTS has no knowledge of video duration, the generated speech may be shorter or longer than the clip. -shortest handles both cases.

Tradeoffs and Scaling Considerations

MPS vs. CUDA

Dimension	MPS (Apple Silicon)	CUDA (A10G)
float16 support	No (float32 only)	Yes
Memory bandwidth	~400 GB/s (M4 Pro)	~600 GB/s (A10G)
Inference speed	3 to 15 min per clip	1 to 3 min per clip
Cost per clip	$0 after hardware	$0.01 to $0.10
Flash attention	No	Yes (xformers)
Batch processing	Single job	Multi-job

For production batch workloads, spin up a spot A10G on RunPod ($0.35 to $0.60/hr) and make two changes to _load_pipeline:

# CUDA path
pipe = LTXPipeline.from_pretrained(
    "Lightricks/LTX-Video",
    torch_dtype=torch.float16,   # 2× memory savings on CUDA
)
pipe.to(device)  # No offloading needed on A10G+

Remove PYTORCH_MPS_HIGH_WATERMARK_RATIO from the env. It is MPS-specific and adds confusion on CUDA systems.

Inference Step Count

Steps	Mode	Quality	Time (M4 Pro)
8 to 10	Preview / Distilled	Coherent motion, soft detail	3 to 5 min
20 to 25	Production	Sharp textures, better faces	8 to 15 min
40+	Research	Diminishing returns past ~30	20 to 35 min

The distilled Stage 2 LoRA in LTX-2 closes the quality gap between 8 and 40 steps. On CUDA with the distilled pipeline, you get near-production quality in 8 steps.

What Breaks at Scale

The single-file architecture is correct for solo local use and breaks at three points:

Concurrent jobs: No job queue. Two simultaneous instances will OOM. Long videos past 10 seconds: Quadratic attention cost becomes the bottleneck. LTX-2 officially supports up to 10-second clips; longer requires chunked generation or the LTXV-13B long-shot model. Batch prompt files: Wrapping generate.py in a shell loop over a CSV works but is unoptimized. Model reloads on every invocation. A proper batch runner would load the pipeline once and call inference in a loop.

What Most People Get Wrong

Using float16 on MPS. MPS does not support float16 for this model. The error is a cryptic Metal shader compilation failure, not "float16 unsupported." The code sets torch.float32 explicitly. Do not change this without testing end-to-end.

Not setting PYTORCH_MPS_HIGH_WATERMARK_RATIO before import. This must precede import torch. Setting it after has no effect. Omitting it causes OOM at seemingly random points, typically during peak attention computation.

Prompting like it is DALL-E. LTX-Video uses Gemma, not CLIP. Gemma understands temporal language ("slowly pans left"), camera language ("cinematic wide shot"), and compositional instructions. "Mountain lake" produces mediocre output. "A wide aerial shot of a mountain lake at golden hour, camera slowly descending toward the surface, soft mist rising from the water, cinematic depth of field" produces something significantly better. The model is temporal; your prompt needs to describe motion and camera behavior, not just appearance.

Running production mode for every iteration. Preview mode (half resolution, 10 steps) is for everything until the motion and composition are correct. Switch to production for the final render only. The cache makes that production run the only one at that resolution and step count.

Ignoring the 8+1 frame count constraint. LTX-Video requires (num_frames - 1) % 8 == 0. At 24fps, valid frame counts are 9, 17, 25, 33... 121. Five seconds at 24fps is 120 frames: (120-1) % 8 = 7, which violates the constraint. The pipeline pads to 121 internally, but this is silent behavior. Understand it.

Confusing LTX-Video with LTX-2. The snackonai project uses Lightricks/LTX-Video via Diffusers: the original LTXV v0.9.x model. LTX-2 (dual-stream 19B architecture, native synchronized audio, two-stage distilled pipeline) lives in the separate Lightricks/LTX-2 repo with different setup requirements. For basic text-to-video locally, LTXV is the right tool. For synchronized audio-video, you need LTX-2.

Contrarian Insights

Contrarian Insight 1: The single-file CLI is a better architecture than a microservices wrapper.

Every "production AI infrastructure" tutorial will tell you to split model loading, job scheduling, and output encoding into separate services. For a tool like this, that is wrong. The model is the bottleneck: everything else is negligible. A single Python file with a SHA256 cache, three CLI flags for quality mode, and ffmpeg subprocess calls is the correct architecture for 95% of builder use cases. Premature decomposition adds operational surface area, container networking overhead, and deployment complexity for zero throughput gain. Write the single file. Ship it. Decompose only when you have evidence of a specific bottleneck.

Contrarian Insight 2: MPS inference being "slow" is often a misleading framing.

The common claim is that MPS is too slow for serious AI work. In the context of video generation for content pipelines, this misidentifies the actual constraint. If you are generating 10 clips per day for a content strategy workflow, 15 minutes per clip is perfectly acceptable. The relevant comparison is not "MPS versus A100" but "MPS versus paying $4 per clip on a cloud API." At $0.002 in electricity, you can afford 2,000 clips on local hardware for the cost of one hour of A100 cloud time. For builders validating ideas, the economics of MPS are straightforwardly superior until you hit genuine throughput requirements, which is typically much later than people assume.

Surprising Takeaway

The VAE decoder, not the transformer, is responsible for visual quality.

Most engineers assume quality lives in the denoising transformer: more parameters, more steps, better output. For LTX-Video this is partially wrong. The 1:192 compression means the transformer operates on a deeply abstract latent representation. It knows where to put things and how they should move, but it cannot represent fine texture. The VAE decoder performs the final denoising step in pixel space, reconstructing hair strands, fabric texture, and sharp edges that the transformer never had the resolution to represent. The practical implication: if your output looks compositionally correct but texturally soft, the problem is the single-stage pipeline, not the prompt and not the step count. The two-stage path with the distilled LoRA and the latent upsampler is the fix. Improving prompts or increasing steps will not solve a VAE fidelity problem.

Ecosystem Comparison

Tool	Model	Local	Audio	Speed	Best For
snackonai/ltx-2-av	LTXV via Diffusers	MPS or CUDA	TTS/mux	Moderate	Mac iteration, social clips
Lightricks/LTX-2	LTX-2 (19B)	CUDA 80GB+	Native	Fast (distilled)	Full A/V production
LTX Desktop	LTX-2.x	Windows/Linux GPU	Native	Fast	Non-technical users
ComfyUI + LTXVideo	LTXV, LTX-2	GPU	Partial	Variable	Workflow builders
Runway Gen-3	Proprietary	No	No	Fast	Premium cloud
Wan 2.1	Open 14B	CUDA	No	Slow locally	Research
HunyuanVideo	Open 13B	CUDA	No	Slow locally	Research

The snackonai project occupies a specific niche: Apple Silicon local inference with zero CUDA dependency. For Mac-based builders, this is the only clean path to free local video generation today. If you need the full LTX-2 audio-video stack locally, you need 80GB+ VRAM (A100 or H100 territory) or the quantized 32GB path. Plan accordingly.

Builder Mindset

Treat this as a media compute primitive. The generate.py script is ~300 lines that takes a text prompt and writes MP4 frames to disk. That is its complete job. The value is in the model. Your engineering problem is what you build around this primitive: automated content pipelines, social media tooling, personalized video at scale, concept visualization for pre-production. The wrapper is not the product.

The inference cache is a product feature. The SHA256-keyed frame cache is the difference between a tool you iterate with and a tool you use once. In a production product context, this pattern generalizes directly to a server-side cache keyed on prompt plus parameters. When multiple users request similar videos, a shared cache drops inference cost to near-zero on repeated prompts. This is the economics that makes video AI products viable at scale before the underlying model gets cheaper.

MPS is a development path, CUDA is a production path. An A10G at $0.35 to $0.60 per hour on RunPod generates a 5-second clip in roughly 2 minutes. At 30 clips per hour, your cost is $0.01 to $0.02 per clip. That math works for most product use cases. Develop on MPS; deploy on CUDA.

Prompt engineering is model-specific, not generic. Prompting patterns for SDXL do not transfer to LTX-Video. The model is temporal: it generates motion, not just appearance. Effective prompts describe the opening composition, the motion that unfolds, the camera behavior, and the lighting. Invest in a prompt library before investing in inference infrastructure. Ten strong prompt templates will outperform ten iterations on deployment architecture.

End-to-End: Get Running in 5 Minutes

Setup

git clone https://github.com/mohnishbasha/snackonai.git
cd snackonai/lightricks-ltx-2-av

brew install ffmpeg        # macOS
# apt install ffmpeg       # Ubuntu

bash setup.sh
source .venv/bin/activate

The first run downloads ~8 GB of model weights to ~/.cache/huggingface/. One-time cost.

First video, preview mode

python generate.py \
  --prompt "A mountain river through a pine forest at golden hour, camera tracking downstream, volumetric sunlight, cinematic 4K" \
  --format linkedin \
  --mode preview \
  --duration 5

Expected time on M4 Pro: 3 to 5 minutes.

Output:

outputs/20260327T142301Z_linkedin_preview_a3f2c1b0.mp4
outputs/20260327T142301Z_..._thumb.jpg
outputs/20260327T142301Z_..._preview.gif
outputs/20260327T142301Z_..._meta.json

TikTok vertical with TTS narration

python generate.py \
  --prompt "A founder pitching on stage, dramatic lighting, crowd visible behind" \
  --format tiktok \
  --mode preview \
  --duration 6 \
  --tts

Production quality LinkedIn with voiceover

python generate.py \
  --prompt "Product launch event, sleek modern stage, tech reveal, dramatic lighting" \
  --audio ./voiceover_final.mp3 \
  --format linkedin \
  --mode production \
  --duration 8 \
  --output-dir ./renders

Switching to CUDA for cloud production

# Two changes in _load_pipeline
pipe = LTXPipeline.from_pretrained(
    "Lightricks/LTX-Video",
    torch_dtype=torch.float16,   # float32 on MPS, float16 on CUDA
)
pipe.to(device)  # No offloading needed on A10G+

# _get_device() priority order
def _get_device() -> torch.device:
    if torch.cuda.is_available():
        return torch.device("cuda")
    if torch.backends.mps.is_available():
        return torch.device("mps")
    return torch.device("cpu")

Remove PYTORCH_MPS_HIGH_WATERMARK_RATIO from the top of the file. MPS-only env vars add confusion in CUDA environments.

CLI Quick Reference

Flag	Default	Purpose
`--prompt`	required	Video description
`--duration`	5.0s	Clip length (10s practical max)
`--fps`	24	Frame rate
`--format`	linkedin	linkedin, tiktok, instagram
`--mode`	preview	preview (10 steps), production (25 steps)
`--output-dir`	./outputs	Output location
`--tts`	false	Auto-narration from prompt text
`--audio`	None	External audio file to mux

Resolution by format and mode:

Format	Preview	Production
linkedin	640×352	1280×720
tiktok	352×640	704×1280
instagram	352×352	704×704

The Bottom Line

Local video generation is no longer a research curiosity. The combination of a 1:192 compression DiT, Apple Silicon unified memory, and a well-engineered CLI wrapper means you can generate production-adjacent video from text on hardware you already own, at a cost that rounds to zero.

The snackonai/lightricks-ltx-2-av project is a clean implementation of the minimum viable local video pipeline. The architecture decisions (single file, SHA256 cache, format presets, graceful MPS memory management) reflect real constraints rather than theoretical elegance. Start here, understand the primitives, and build upward.

The infrastructure inversion is already happening. The builders who understand the compression architecture, the memory management tradeoffs, and the prompt semantics will ship meaningfully faster than those who treat this as a black-box API.

Source: snackonai/lightricks-ltx-2-av, Model: Lightricks/LTX-Video, Paper: LTX-2 arXiv:2601.03233

https://lnkd.in/gAJF3m69 - new drop coming ... | Mohinish S.

https://lnkd.in/gAJF3m69 - new drop coming ...

www.linkedin.com/posts/mohinishbasha_httpslnkdingajf3m69-new-drop-coming-ugcPost-7443336038820847616-0zaE?utm_source=share&utm_medium=member_desktop&rcm=ACoAAAGlNS4B_7TzcsJdgAU4OEGGyKveS3fiLdg

— # (#)

❝

Sponsored Ad
If you enjoy practical AI insights, check out SnackOnAI and support the newsletter by subscribing, sharing, and exploring our sponsored ad—it helps us keep building and delivering value 🚀

Learn how to code faster with AI in 5 mins a day

You're spending 40 hours a week writing code that AI could do in 10.

While you're grinding through pull requests, 200k+ engineers at OpenAI, Google & Meta are using AI to ship faster.

How?

The Code newsletter teaches them exactly which AI tools to use and how to use them.

Here's what you get:

AI coding techniques used by top engineers at top companies in just 5 mins a day
Tools and workflows that cut your coding time in half
Tech insights that keep you 6 months ahead

Join 200K+ engineers