SnackOnAI Clips: Engineering a Blog-to-Video Pipeline From First Principles

In partnership with

The Gap Nobody Wants to Talk About

Content creators and engineers write long-form articles that take 8–15 minutes to read. LinkedIn's highest-engagement format is a 60-second vertical video that most people will actually watch. The gap between these two formats is where most people give up and either skip video entirely or pay for a tool they don't fully control.

SnackOnAI Clips is a Python CLI that closes this gap: give it a URL or a local HTML file, and it produces a 1080×1920 MP4 with narration, styled text slides, and a watermark — in under two minutes on a laptop.

The interesting engineering isn't the video itself. It's how the pipeline handles every failure mode between "URL as input" and "MP4 as output" without requiring the caller to care about which extraction strategy worked, which LLM is running, or whether a TTS service is available.

Where Every Other Tool Falls Apart

Most video generation tools fail in one of three places.

They're cloud-only black boxes. You upload content, a remote service processes it, you get a result. No control over prompts, rendering, fonts, or output format. When the API changes, your pipeline breaks.

They conflate the stages. A monolithic "blog-to-video" tool that handles extraction, summarization, and rendering in one function is impossible to debug when it produces garbage. You can't tell whether the bad output came from a failed extraction, a hallucinated summary, or a rendering bug.

They don't handle real-world HTML. Most articles don't serve clean text. They serve JavaScript-rendered pages, paywalled content, complex ad-heavy layouts, and inconsistent DOM structures. A single extraction strategy fails on roughly 30–40% of real URLs.

SnackOnAI Clips addresses all three by treating each stage as an independently testable, replaceable component with explicit input/output contracts.

Five Stages, Five Contracts, Zero Surprises

The pipeline has five discrete stages, each isolated into its own module:

URL or local file
        |
        v
┌──────────────────┐
│   extractor.py   │  trafilatura → readability → newspaper3k → raw text strip
└────────┬─────────┘
         |  ArticleContent(url, title, text, author, date)
         v
┌──────────────────┐
│  summarizer.py   │  OpenAI (json_object mode) or Ollama REST → fallback heuristic
└────────┬─────────┘
         |  Summary(headline, summary, bullets)
         v
┌──────────────────┐
│     tts.py       │  gTTS (free) or ElevenLabs (premium) → temp MP3
└────────┬─────────┘
         |  /tmp/snackonai_*.mp3  (or None if --no-tts)
         v
┌──────────────────────┐
│  video_generator.py  │  MoviePy slides + NumPy gradient BG + ImageMagick text
└────────┬─────────────┘
         |  output.mp4 (1080×1920, ≤60s, libx264/aac)
         v
┌──────────────────┐
│     cli.py       │  Rich progress UI, error handling, --thumbnail, --json-output
└──────────────────┘

Key data contracts:

extractor.py → ArticleContent dataclass. Every downstream module receives the same structured object regardless of which extraction strategy succeeded.
summarizer.py → Summary dataclass with headline, summary, and bullets. The video generator only knows about Summary — it has no knowledge of LLMs or article content.
config.py → singleton AppConfig via get_config(). CLI flags override the singleton before any stage runs. No global state mutation mid-pipeline.

How the Code Actually Works

Content Extraction: The Three-Strategy Cascade

extractor.py implements a priority-ordered cascade:

def extract_content(url: str, cfg: ExtractorConfig | None = None) -> ArticleContent:
    # Local file short-circuit — no network required
    if is_local_input(url):
        html = _read_local_file(url)
        if resolved.endswith(".txt"):
            return ArticleContent(url=url, title=..., text=_clean_text(html))
        return (
            _extract_with_trafilatura(html, source_label)
            or _extract_with_readability(html, source_label)
        )

    # Remote URL: fetch once, try strategies in order
    html = _fetch_html(url, cfg)
    content = (
        _extract_with_trafilatura(html, url)
        or _extract_with_readability(html, url)
        or _extract_with_newspaper(url)
    )
    if not content:
        raise ExtractionError(...)
    return content

Three things worth noting here. First, is_local_input() checks for file:// URLs, absolute paths, and relative paths — enabling fully offline operation. Second, HTML is fetched exactly once and passed to all strategies; newspaper3k is the exception because it fetches independently (it has its own HTTP client). Third, each strategy returns None on failure rather than raising — the or chain handles fallthrough cleanly without try/except nesting.

The _fetch_html function is wrapped with the @retry decorator from utils.py:

@retry(max_attempts=cfg.max_retries, backoff=cfg.retry_backoff,
       exceptions=(requests.RequestException,))
def _get() -> str:
    resp = requests.get(url, headers=headers, timeout=cfg.request_timeout)
    resp.raise_for_status()
    return resp.text

The @retry decorator uses exponential backoff and is composable — the same decorator wraps LLM API calls in summarizer.py with different exception types.

LLM Summarization: Structured Output + Validation

summarizer.py sends article text to the configured LLM and validates the response before trusting it:

def _build_user_prompt(text: str, title: str) -> str:
    body = truncate_text(text, _MAX_INPUT_CHARS)  # 6000 char cap
    return f"""
    Summarize the following article for a 60-second LinkedIn vertical video.
    Return ONLY this JSON structure (no code fences):
    {{
      "headline": "<5-10 word punchy headline>",
      "summary": "<2-3 sentence spoken narration>",
      "bullets": ["<key point 1>", "<key point 2>", "<key point 3>"]
    }}
    Rules:
    - headline must be ≤ 10 words, no period at end
    - summary must be conversational and < 60 words
    - bullets must be 3–5 items, each < 15 words
    ...
    """

The 6,000 character cap on input (_MAX_INPUT_CHARS = 6000) is a deliberate cost and latency control. It covers ~95% of meaningful article content without sending the full text of long-form pieces.

For OpenAI, response_format={"type": "json_object"} is used — this guarantees valid JSON back without needing to strip markdown fences (though _extract_json() handles fence stripping anyway as a defensive measure). For Ollama, "format": "json" in the request payload achieves the same result via the REST API at localhost:11434/api/chat.

After parsing, _validate_summary_dict() checks the schema before constructing a Summary:

def _validate_summary_dict(data: dict[str, Any]) -> None:
    for key, expected_type in SUMMARY_SCHEMA.items():
        if key not in data:
            raise ValueError(f"Missing key in LLM response: {key!r}")
        if not isinstance(data[key], expected_type):
            raise ValueError(...)
    if not (3 <= len(data["bullets"]) <= 5):
        raise ValueError(...)

If validation fails or the LLM is unreachable, _fallback_summarize() kicks in — a rule-based summarizer that splits text into sentences and uses the first few as the narration and bullets. It produces mediocre but structurally valid output, which is exactly what you want from a fallback.

Video Generation: Timing Budget + Cross-Platform Text Rendering

video_generator.py is the most complex module. It builds slides as MoviePy CompositeVideoClip objects and concatenates them.

Timing budget calculation:

max_dur = float(cfg.max_duration)          # 60s default
title_dur = clamp(max_dur * 0.15, 3.0, 8.0)
outro_dur = clamp(max_dur * 0.10, 2.0, 5.0)
remaining = max_dur - title_dur - outro_dur
summary_dur = clamp(remaining * 0.30, 4.0, 12.0)
bullet_budget = remaining - summary_dur
bullet_dur = clamp(bullet_budget / max(n_bullets, 1), 3.0, 12.0)

This proportional allocation guarantees the total never exceeds max_duration regardless of how many bullets the LLM produced, while enforcing minimum per-slide durations so text is actually readable.

Gradient backgrounds without dependencies:

Rather than loading image files, backgrounds are generated as NumPy arrays:

def _make_gradient(width, height, top_color, bottom_color) -> np.ndarray:
    frame = np.zeros((height, width, 3), dtype=np.uint8)
    for y in range(height):
        t = y / height
        for c in range(3):
            frame[y, :, c] = int(top_color[c] * (1 - t) + bottom_color[c] * t)
    return frame

This is passed directly to ImageClip(bg_arr) — no temporary files, no disk I/O, no ImageMagick involvement for the background.

Cross-platform ImageMagick handling:

This is where most similar projects fall apart on Ubuntu. MoviePy's TextClip internally calls ImageMagick to render text, and Ubuntu ships with a policy.xml that blocks writing PNGs from /tmp — exactly what TextClip does. The failure looks like a missing font, not a permissions error, which makes it nearly impossible to diagnose.

_configure_imagemagick() detects this at startup:

def _check_linux_imagemagick_policy() -> None:
    for policy_path in glob.glob("/etc/ImageMagick-*/policy.xml"):
        content = Path(policy_path).read_text()
        blocked = 'rights="none" pattern="PNG"' in content
        if blocked:
            raise RuntimeError(
                f"ImageMagick's security policy blocks PNG writing.\n\n"
                f"Fix it with:\n"
                f"  sudo sed -i '{sed_expr}' {policy_path}\n\n"
                f"Then re-run the command."
            )

It also detects whether to use magick (IMv7, macOS/Homebrew) or convert (IMv6, Ubuntu apt) and sets the MoviePy binary accordingly — zero manual configuration required from the user.

Font resolution with fallback chain:

_BOLD_FONT_CANDIDATES lists 15+ font paths across macOS (Homebrew arm64/x86, system), Ubuntu (DejaVu, Liberation, Ubuntu font family), and Windows. The first path that exists on disk wins. If none are found, _get_fonts() raises a clear error listing the install command.

Final render:

video.write_videofile(
    output_path,
    fps=cfg.fps,         # 30fps default
    codec="libx264",
    audio_codec="aac",
    preset="medium",
    bitrate="4000k",
    audio_bitrate="128k",
    threads=4,
    logger=None,         # suppress moviepy bar (Rich handles progress)
)

imageio-ffmpeg is bundled as a pip package — no system ffmpeg install required on either macOS or Ubuntu.

CLI Orchestration

cli.py wires all stages together using Rich for progress display. Each stage is wrapped in a Progress context:

# Basic usage
python snackonaiclips.py --url https://techcrunch.com/some-article

# Fully offline: local file + Ollama + no TTS
python snackonaiclips.py \
  --url ./article.html \
  --llm ollama \
  --no-tts \
  --output output.mp4

# Summary only — inspect before rendering
python snackonaiclips.py \
  --url https://example.com/post \
  --summary-only \
  --json-output summary.json

# With thumbnail + cinematic style
python snackonaiclips.py \
  --url https://example.com/post \
  --style cinematic \
  --watermark "MyBrand" \
  --thumbnail

Honest Tradeoffs and Where the Constraints Live

moviepy pinned to ==1.0.3. MoviePy 2.x has breaking API changes. The pin is intentional and documented in requirements.txt. This also forces numpy<2.0.0 since moviepy 1.x is incompatible with NumPy 2.x. Both constraints are explicit in the requirements file with comments explaining why.

NumPy gradient loop vs. vectorized operations. The _make_gradient function uses a Python loop over rows rather than a fully vectorized NumPy operation. For a 1080×1920 frame this is fast enough (~10ms) but could be np.linspace + broadcasting. The loop is more readable; at this scale it doesn't matter.

gTTS requires internet. The free TTS option calls Google's API. The --no-tts flag exists precisely for offline/air-gapped deployments. ElevenLabs is higher quality but requires an API key and credits.

6,000 char LLM input cap. This covers most articles but truncates very long-form content. The tradeoff is cost and latency control vs. comprehensiveness. For newsletter-length content (1,000–3,000 words) this is not a constraint.

No streaming from Ollama. The Ollama integration uses "stream": False — the full response arrives at once. For a 512 token output this is fast, but it means the CLI shows a spinner for the full LLM latency rather than streaming tokens.

What Burns You in Production

The Ubuntu ImageMagick policy will silently destroy your CI pipeline. A vanilla Ubuntu 22.04 EC2 instance blocks MoviePy TextClip without any obvious error message. The _check_linux_imagemagick_policy() detection was written specifically because this burned hours of debugging. If you're deploying to Ubuntu: add the sed policy fix to your Dockerfile or cloud-init script on day one.

moviepy temp files accumulate. Each video render creates a temporary audio file in tempfile.gettempdir(). The code sets remove_temp=True but MoviePy doesn't always clean up on exceptions. Add a cleanup step to any batch processing loop.

LLM JSON output is non-deterministic at the schema level. Even with response_format={"type": "json_object"}, GPT-4o-mini occasionally returns bullets as a string instead of a list for short articles. The _validate_summary_dict() check catches this before it reaches the video generator. The fallback summarizer ensures a video is always produced even when validation fails.

Font paths differ between macOS ARM and Intel Homebrew. Homebrew on Apple Silicon installs to /opt/homebrew/ while Intel installs to /usr/local/. Both are in the _BOLD_FONT_CANDIDATES list. If you add a new font, add both paths.

Batch processing needs unique output paths. The default --output output.mp4 is fine for interactive use but will stomp on previous outputs in a batch job. Pass a unique path per article (e.g. derived from the URL slug) when processing multiple articles.

What Every Engineer Should Take Away From This

The cascade extractor pattern is the right abstraction for messy real-world HTML. No single library handles every site. The trafilatura → readability → newspaper3k chain fails gracefully at each step and always produces a structured ArticleContent object. Callers never see which strategy ran.

Validate LLM output before trusting it downstream. The _validate_summary_dict() function is three lines but prevents an entire class of bugs where a malformed LLM response produces a crashed video generator rather than a clear summarization error.

Fail fast with actionable error messages. The ImageMagick policy check raises before any article is fetched or any LLM is called. Users see the exact sed command to fix the problem in under one second, not after a 90-second pipeline run.

Timing budget arithmetic is a product decision encoded in code. The proportional allocation of 60 seconds across slides isn't a technical constraint — it's a content decision about how long each section should feel. The clamp() function enforces minimum slide durations so text is readable even when the budget is tight.

Offline-first is composable, not monolithic. Every stage has an offline alternative: local file instead of URL, Ollama instead of OpenAI, --no-tts instead of gTTS or ElevenLabs, bundled imageio-ffmpeg instead of system ffmpeg. The code doesn't have an "offline mode" — it has individual switches that compose into full offline operation.

Source: github.com/mohnishbasha/snackonai/tree/master/snackonai-clips

More at snackonai.com

❝

A production-grade Python CLI that turns any blog URL or local HTML file into a LinkedIn-ready 1080×1920 MP4 — using a five-stage pipeline of content extraction, LLM summarization, TTS voiceover, and MoviePy video rendering, all runnable locally with zero cloud dependency.

Built with modularity and failure handling as first-class concerns: each stage has an explicit data contract, a fallback strategy, and cross-platform quirks handled automatically so engineers can extend or deploy it without hitting invisible walls.

❝

Sponsored Ad
If you enjoy practical AI insights, check out SnackOnAI and support the newsletter by subscribing, sharing, and exploring our sponsored ad—it helps us keep building and delivering value 🚀

88% resolved. 22% stayed loyal. What went wrong?

That's the AI paradox hiding in your CX stack. Tickets close. Customers leave. And most teams don't see it coming because they're measuring the wrong things.

Efficiency metrics look great on paper. Handle time down. Containment rate up. But customer loyalty? That's a different story — and it's one your current dashboards probably aren't telling you.

Gladly's 2026 Customer Expectations Report surveyed thousands of real consumers to find out exactly where AI-powered service breaks trust, and what separates the platforms that drive retention from the ones that quietly erode it.

If you're architecting the CX stack, this is the data you need to build it right. Not just fast. Not just cheap. Built to last.

See the data