PixelRAG: Your RAG Pipeline Is Losing 40% of the Evidence Before the LLM Ever Sees It. The Fix Is Screenshots.

Sponsored by

PixelRAG (StarTrail-org/PixelRAG, Apache 2.0) eliminates the parser entirely: it renders pages to screenshots, slices them into image tiles, embeds the tiles with a fine-tuned Qwen3-VL-Embedding model, and hands retrieved images directly to a vision-language model. On SimpleQA with a Qwen3.5-4B reader, visual RAG reaches 78.8% vs text-RAG's best of 71.6%, a 7.2 percentage point gain, from a Berkeley SkyLab + BAIR + Databricks team that includes Matei Zaharia.

SnackOnAI Engineering | Senior AI Systems Researcher | Technical Deep Dive | June 30, 2026

Every RAG pipeline has the same assumption baked in at step one: you can extract the text from a web page, and what you extract is a faithful representation of what the page contains. That assumption is wrong, and the PixelRAG paper (arXiv:2506.00077) quantifies exactly how wrong.

A state-of-the-art HTML parser like Trafilatura recovers far more text than a naive parser, but even the best extractors routinely discard more than 40% of the text that a different extractor would recover from the same page. The parser you pick, which most RAG implementations treat as a one-time configuration choice, determines a large fraction of your downstream answer quality. On SimpleQA, the difference between the worst commonly used parser and Trafilatura is nearly 10 accuracy points. That gap has nothing to do with the retrieval model, the chunk size, the embedding model, or the reader LLM. It is a parser decision made before any ML runs at all.

The problem is sharpest for structured content. Tables, infoboxes, data grids, side panels: when you linearize these into text, you destroy the two-dimensional relationships that make them readable. A table row that answers a factual question gets fragmented into keyword soup. The retrieval model finds the right page but retrieves the wrong chunk, because the chunk that contains the answer has been structurally destroyed by the parser.

PixelRAG (Yichuan Wang, Zhifei Li, Zirui Wang, Paul Teiletche, Lesheng Jin, Matei Zaharia, Joseph E. Gonzalez, Sewon Min, Berkeley SkyLab + BAIR + Berkeley NLP + Princeton + EPFL + Databricks + Renmin University, released May 2026) attacks this at the root: skip the parser. Render the page as a human would see it. Index the pixels. Let a vision model read them.

Scope: PixelRAG's four-component architecture (Playwright rendering, tile slicing, Qwen3-VL-Embedding + LoRA fine-tuning, FAISS retrieval), the benchmark results on SimpleQA, the token cost implications, and the Claude Code plugin. Not covered: the Visual-RAG benchmark (arXiv:2502.16636) methodology in depth, or PixelRAG's performance on non-Wikipedia corpora.

What It Actually Does

PixelRAG replaces the text extraction step in a RAG pipeline with a visual rendering step. Instead of: HTML → parser → text chunks → text embedding → text retrieval → LLM reader, the pipeline becomes: HTML → Playwright render → screenshot tiles → visual embedding → tile retrieval → VLM reader.

The output of the reader is the same: a natural language answer to a query. The path from web page to answer is entirely different. No text is extracted from the HTML at any point.

The live hosted endpoint: https://api.pixelrag.ai serves a pre-built visual index of 8.28 million Wikipedia pages (30M+ screenshot tiles) with no setup and no API key required.

# Install
pip install pixelrag

# Render any page to screenshot tiles
pixelshot https://en.wikipedia.org/wiki/Python --output ./tiles

# Search the hosted Wikipedia index — no setup required
curl -X POST https://api.pixelrag.ai/search \
  -H "Content-Type: application/json" \
  -d '{"queries": [{"text": "What is the capital of France?"}], "n_docs": 5}'

Key benchmark numbers (Table 1, paper, Qwen3.5-4B reader, k=3):

Method	SimpleQA Accuracy
No retrieval (closed-book)	7.0%
Raw HTML to reader	29.0%
Best text parser (Trafilatura)	71.6%
PixelRAG (visual tiles)	78.8%

The raw HTML number is the most revealing. Feeding the model the full HTML without any parsing actually performs 42+ points WORSE than text-RAG. The tags bloat the context approximately 4x, overwhelming the reader with structure rather than content.

The Architecture, Unpacked

Focus on the tile boundary problem: why 875×1024 matters. By fixing width at 875px, PixelRAG ensures a consistent visual "viewport" that matches how the embedding model was trained. The 1024px height is chosen to be large enough to contain most tables and infoboxes as single tiles, avoiding the two-tile split that would separate a table header from its data rows.

The Code, Annotated

Snippet One: Rendering a Page to Tiles

# PixelRAG: Playwright-based offline page rendering to screenshot tiles
# Reconstructed from StarTrail-org/PixelRAG /render/ and /chromium/ modules
# Design intent: offline rendering decoupled from live network fetches

import asyncio
from pathlib import Path
from playwright.async_api import async_playwright
from PIL import Image

TILE_WIDTH = 875    # fixed: consistent with Qwen3-VL-Embedding training viewport
TILE_HEIGHT = 1024  # fixed: large enough to keep most tables within one tile

async def render_page_to_tiles(
    html_path: str,            # local HTML file (from Kiwix ZIM or crawl)
    output_dir: Path,
    page_id: str,
) -> list[Path]:
    """
    Render a local HTML page to screenshot tiles using headless Chromium.

    ← WHY LOCAL FILES not live URLs:
      Rendering 8.28M Wikipedia pages from the live web would take ~30 days.
      PixelRAG decouples download (one-time bulk export from Kiwix ZIM)
      from rendering (offline Playwright). This lets rendering run at CPU-scale
      in parallel without network bottlenecks or rate limiting.

    ← WHY strip navigation/whitespace:
      Chrome renders navbars and footer boilerplate that carry no information
      for retrieval. Stripping them reduces tile count per page by ~20% and
      removes negative examples from the training signal.
    """
    async with async_playwright() as pw:
        browser = await pw.chromium.launch(headless=True)
        page = await browser.new_page(viewport={"width": TILE_WIDTH, "height": 768})

        # Load local HTML: file:// protocol, no network required
        # ← THIS is the decoupling trick: offline render of pre-downloaded pages
        await page.goto(f"file://{Path(html_path).resolve()}")

        # Strip navigation, header, footer — pure content viewport
        await page.evaluate("""() => {
            for (const sel of ['nav', 'header', '#footer', '.navbox', '.mw-navigation']) {
                document.querySelectorAll(sel).forEach(el => el.remove());
            }
        }""")

        # Get total page height after DOM modification
        total_height = await page.evaluate("document.documentElement.scrollHeight")

        # Slice into tiles by scrolling the viewport
        tile_paths = []
        y_offset = 0
        tile_idx = 0

        while y_offset < total_height:
            await page.evaluate(f"window.scrollTo(0, {y_offset})")
            await page.wait_for_timeout(50)   # let CSS transitions settle

            screenshot_bytes = await page.screenshot(
                clip={"x": 0, "y": 0, "width": TILE_WIDTH, "height": TILE_HEIGHT},
                type="png",
            )

            tile_path = output_dir / f"{page_id}_tile_{tile_idx:04d}.png"
            tile_path.write_bytes(screenshot_bytes)
            tile_paths.append(tile_path)

            y_offset += TILE_HEIGHT
            tile_idx += 1

        await browser.close()
        return tile_paths

# Result for a typical Wikipedia article (~3,000 words, one infobox, two tables):
# 4-6 tiles, each 875×1024 PNG, ~150KB each
# Table in tile 2: header row + data rows stay on same tile → readable by VLM
# vs text parser: table → linearized keyword mush, correct row not retrievable

The file:// protocol is the architectural key. By pre-downloading Wikipedia as a Kiwix ZIM archive and rendering locally, PixelRAG scales to 8.28M pages without ever making a live HTTP request during the indexing phase. The rendering becomes a pure CPU/GPU compute problem with no network dependency.

Snippet Two: Visual Embedding with LoRA Fine-Tuning and FAISS Retrieval

# PixelRAG: Qwen3-VL-Embedding + LoRA fine-tuning for screenshot retrieval
# Reconstructed from StarTrail-org/PixelRAG /embed/ module
# Design intent: adapt a generic visual-language embedding model to the
# specific distribution of web page screenshots (not natural photos)

import torch
from transformers import AutoModel, AutoTokenizer
from peft import get_peft_model, LoraConfig, TaskType
import faiss
import numpy as np
from pathlib import Path

# ─── EMBEDDING MODEL: Qwen3-VL-Embedding + LoRA ────────────────────────────
def load_pixel_embedder(base_model: str = "Qwen/Qwen3-VL-Embedding-8B"):
    """
    Load Qwen3-VL-Embedding with LoRA adaptation for screenshot retrieval.

    ← WHY Qwen3-VL-Embedding specifically:
      It is a multimodal embedding model: same embedding space for text queries
      and image tiles. Text query → embed → same vector space as tile images.
      This is what enables text-to-image retrieval: no modality gap.

    ← WHY LoRA (not full fine-tuning):
      The base Qwen3-VL-Embedding model was trained on natural images and
      documents. Web screenshots have a very specific distribution:
      white backgrounds, system fonts, structured layouts, dense text.
      LoRA adapts these weights cheaply (~2M additional parameters vs ~8B)
      using contrastive pairs (query, correct tile, negative tiles).
      This avoids catastrophic forgetting: the model retains its general
      visual understanding while specializing on screenshot retrieval.
    """
    model = AutoModel.from_pretrained(base_model, trust_remote_code=True)

    lora_config = LoraConfig(
        task_type=TaskType.FEATURE_EXTRACTION,
        r=16,               # rank: balance between adaptation and parameter cost
        lora_alpha=32,
        target_modules=["q_proj", "v_proj"],  # attention projections only
        lora_dropout=0.05,
    )
    # ← THIS is the trick: LoRA only updates ~2M of the 8B parameters
    #   Full fine-tuning would cost 8B gradient updates; LoRA costs ~2M
    model = get_peft_model(model, lora_config)
    return model


def embed_tile(model, image_path: str, device: str = "cuda") -> np.ndarray:
    """
    Embed a single screenshot tile to a dense vector.
    Both tiles (images) and text queries use this same model:
    text queries are embedded as text, tiles as images, into the same space.
    """
    from PIL import Image
    img = Image.open(image_path).convert("RGB")
    # Qwen3-VL-Embedding: pass image directly, returns 768-d vector
    with torch.no_grad():
        embedding = model.encode_image(img, normalize=True)  # L2-normalized
    return embedding.cpu().numpy()


def embed_query(model, query_text: str) -> np.ndarray:
    """Embed a text query to the same vector space as tile images."""
    with torch.no_grad():
        embedding = model.encode_text(query_text, normalize=True)
    return embedding.cpu().numpy()


# ─── FAISS INDEX ────────────────────────────────────────────────────────────
def build_faiss_index(tile_embeddings: np.ndarray, dim: int = 768) -> faiss.IndexFlatIP:
    """
    Build FAISS inner product index for fast ANN search over 30M+ tile vectors.

    ← WHY inner product (not L2):
      Embeddings are L2-normalized → inner product = cosine similarity
      This is the standard for retrieval models: find tiles most semantically
      similar to the query, not closest in Euclidean distance.

    For 30M+ tiles, production PixelRAG likely uses FAISS IVF (inverted file)
    or HNSW for sub-second search, but the core similarity metric is the same.
    """
    index = faiss.IndexFlatIP(dim)
    # ← Add all 30M+ tile vectors at index time
    index.add(tile_embeddings)
    return index


def retrieve_tiles(
    index: faiss.IndexFlatIP,
    tile_paths: list[str],
    query_text: str,
    model,
    k: int = 3,
) -> list[str]:
    """
    At query time: embed the query, retrieve top-k tile image paths.
    These image paths are then passed directly to the VLM reader.
    No text extraction from tiles at any point.
    """
    query_vec = embed_query(model, query_text)
    query_vec = query_vec.reshape(1, -1).astype(np.float32)

    # ← ANN search: finds k tiles whose visual content best matches query text
    distances, indices = index.search(query_vec, k)

    return [tile_paths[i] for i in indices[0]]


# ─── END-TO-END QUERY ────────────────────────────────────────────────────────
def answer_query(query: str, retrieved_tile_paths: list[str], vlm_reader) -> str:
    """
    Pass retrieved image tiles directly to VLM reader.
    The VLM sees pixel content: text, tables, infoboxes all as images.
    ← No OCR pre-pass, no text extraction: VLM reads pixels natively.
    """
    from PIL import Image
    images = [Image.open(p).convert("RGB") for p in retrieved_tile_paths]

    # VLM prompt: query + k images
    # The VLM's internal OCR and visual understanding handles the rest
    response = vlm_reader.generate(
        text=f"Answer the following question based on the provided web page screenshots:\n{query}",
        images=images,
        max_new_tokens=256,
    )
    return response

The design bet in embed_tile and embed_query using the same model is the central architectural choice. If text queries and image tiles live in different embedding spaces, you need a cross-modal alignment step, which is where most multimodal retrieval falls apart. Qwen3-VL-Embedding produces a unified space, so the LoRA contrastive fine-tuning can directly train on (query, correct tile) pairs without worrying about modality gap.

It In Action: End-to-End Query Against the Wikipedia Index

Task: "What year did the Boston Red Sox win their first World Series after the 86-year drought?"

Step 1: Query embedding

Input text: "What year did the Boston Red Sox win their first World Series after the 86-year drought?"
Qwen3-VL-Embedding: text → 768-d vector [0.021, -0.184, 0.092, ...]
Normalized: ||v|| = 1.0
Query time: ~15ms

Step 2: FAISS ANN search over 30M Wikipedia tiles

Index size: 30M tile vectors × 768 dimensions = ~23GB (float32)
Search: inner product (cosine similarity) on normalized vectors
Top-k = 3 tiles retrieved

Results:
  Tile 1: Boston_Red_Sox_wiki_tile_0003.png (score: 0.847)
    → Contains the infobox: "2004 World Series · Boston Red Sox"
    → Text parser would have extracted: "2004 World Series Boston Red Sox 4–0
       St. Louis Cardinals" — fragmented into one long chunk, not a table row

  Tile 2: 2004_World_Series_tile_0001.png  (score: 0.831)
    → Contains championship year prominently in the page header

  Tile 3: Curse_of_the_Bambino_tile_0002.png  (score: 0.814)
    → Contains table with years and context

ANN search latency: ~28ms for 30M vectors (FAISS flat inner product)
Total retrieval: 43ms

Step 3: VLM reading

VLM: Qwen3.5-4B
Input: query text + 3 tile images (875×1024 PNG each)
Token cost: ~3,200 tokens (text) + ~1,800 visual tokens per image = ~8,600 total
vs text-RAG: 3 text chunks × ~600 tokens = ~1,800 tokens

← PixelRAG uses MORE tokens than text-RAG per query in raw mode
   But with tile compression techniques (from paper): ~3x reduction vs uncompressed
   Optimized mode: similar or lower token cost than text, at higher accuracy

VLM answer: "2004"
Correct answer: 2004

Generation latency: ~800ms on A10 GPU
Total e2e: ~850ms

Step 4: SimpleQA accuracy, why text fails this query

Text parser (Trafilatura) output for Boston Red Sox infobox:
  "World Series champions (2004, 2007, 2013, 2018)\nAmerican League champions..."
  → A flat list, no table structure
  → Correct chunk retrieved? Sometimes yes, but "2004" is buried in a list
    with other years → LLM must select from multiple numbers with no visual context

PixelRAG tile for same infobox:
  → A rendered image showing "2004 World Series" in bold header of the champions
    section, visually distinct from other championship years
  → VLM reads the visual hierarchy and correctly identifies 2004

Accuracy gap on table-heavy questions: more than 10pp gap vs best text parser
Accuracy gap on all SimpleQA questions: 7.2pp (78.8% vs 71.6%)

Why This Design Works, and What It Trades Away

The core insight is that HTML parsing is not a preprocessing step, it is a lossy compression. Every parser makes decisions about what constitutes "content" and what constitutes "structure," and those decisions are wrong for a non-trivial fraction of pages. Bypassing the parser by rendering pages as images delegates those decisions to a model (the VLM) that has been trained on two-dimensional visual content and can understand layout, visual hierarchy, and structured data natively.

The FAISS retrieval step is where the system's scalability comes from. FAISS ANN search over 30M vectors at ~28ms is the kind of performance that makes production deployment viable. The academic precedent for this approach is ColPali (visual document retrieval using late interaction), which demonstrated that visual page embeddings can outperform text-based embeddings for document retrieval tasks. PixelRAG applies the same insight at web scale with a hosted Wikipedia corpus.

The LoRA fine-tuning on screenshot data is the correct approach for adapting a visual embedding model to a domain-specific distribution. Web screenshots are structurally different from the natural images and documents in Qwen3-VL-Embedding's training distribution. Contrastive fine-tuning with (query, correct tile, negative tiles) provides direct gradient signal for the retrieval task without requiring full model retraining.

What PixelRAG trades away:

Token cost at inference is the main sacrifice. Each retrieved tile contains roughly 1,800 visual tokens when processed by the VLM, versus ~600 tokens for a text chunk. At k=3, the VLM processes ~5,400 visual tokens of retrieved content versus ~1,800 text tokens. The paper addresses this with tile compression techniques achieving ~3x token reduction, but uncompressed PixelRAG is still more expensive per query than text-RAG on token metrics.

The 30-million tile FAISS index requires substantial memory. At float32, 30M × 768 dimensions = ~92GB. The hosted index compresses this, but self-hosting PixelRAG at Wikipedia scale requires significant GPU or CPU memory infrastructure that is far beyond what most teams run for a text-based FAISS index of the same corpus.

Rendering latency for new pages is high. Playwright + headless Chromium for one page takes 2-5 seconds. For a corpus that changes frequently (news sites, live documentation), the rendering latency means the index is always slightly stale. Text-based systems can update much more frequently because text extraction is orders of magnitude faster than full-page rendering.

Technical Moats

The hosted 8.28M Wikipedia tile index at api.pixelrag.ai. The engineering work to render, tile, embed, and index 8.28M Wikipedia pages into a FAISS-searchable 30M-tile corpus is months of compute work. A team trying to replicate PixelRAG's results for their own benchmarking needs to either use the hosted endpoint or reproduce this indexing pipeline from scratch. The hosted endpoint is free, no API key required, which is a deliberate adoption strategy: the moat is not the API key, it is the indexed corpus and the fine-tuned Qwen3-VL-Embedding weights.

LoRA-adapted Qwen3-VL-Embedding for screenshot retrieval. The contrastive fine-tuning on (text query, web screenshot tile) pairs is what makes retrieval work at 78.8% accuracy rather than just decent accuracy. The base Qwen3-VL-Embedding model without this fine-tuning would perform worse because its training distribution emphasizes natural images. The PixelRAG weights are specific to the web screenshot distribution. A team self-hosting would need to generate or acquire a large training set of (text query, correct tile) pairs, which requires having the answer annotations from SimpleQA or a similar benchmark mapped to specific page tiles.

The Claude Code plugin ecosystem. claude plugin install pixelbrowse@pixelrag-plugins gives AI coding agents visual web access as a native tool. This is not just a demo: it is positioning PixelRAG as the default visual web retrieval layer for agent pipelines. An agent using pixelbrowse can search Wikipedia visually without being restricted to whatever text a parser would extract. As agent frameworks standardize on tool interfaces, being the established visual search tool in the Claude Code plugin registry creates an adoption flywheel that is hard for a later entrant to overcome.

Insights

Insight One: The 40% text loss from HTML parsing is not a parser quality problem. It is a structural problem that better parsers cannot solve. The best parsers (Trafilatura, Readability) lose content not because they are poorly implemented but because they make correct decisions under the constraints of text linearization: they strip navigation boilerplate (correct), collapse whitespace (correct), and flatten tables (inevitable). Flattening tables is correct from a text extraction standpoint but wrong from an information retrieval standpoint. PixelRAG does not improve the parser; it removes the linearization constraint entirely. This is why the 7.2 percentage point gap is unlikely to close as parsers improve. The gap exists because the task is fundamentally harder for text than for visual models on structured content, not because text parsers are behind the frontier.

Insight Two: PixelRAG's performance advantage is concentrated in a specific subset of questions, and the headline "18.1% improvement" obscures this. On text-heavy Wikipedia articles with no tables or infoboxes, the gap between PixelRAG and text-RAG is much smaller, because linearization does not destroy the structure when there is no structure to destroy. The paper reports 18.1% in one framing and 7.2pp in another (78.8% vs 71.6%) depending on the comparison baseline. The 7.2pp figure is the direct head-to-head against the best text parser on the same SimpleQA benchmark. The 18.1% figure likely refers to relative improvement over a weaker text baseline. Engineers building real systems should calibrate expectations: the gain is real, material, and concentrated in the exactly the cases where structured web content is most important (product data, statistics, sports records, financial figures in tables).

Surprising Takeaway

The visual query support in PixelRAG's API (not just text-to-image but image-to-image search) is the feature that changes what retrieval-augmented generation can be. The API accepts an image as the query, not just text. This means an agent can submit a screenshot of what it sees on its screen and retrieve the most similar Wikipedia tiles. In the context of vision-language agents that observe their environment visually, this enables a retrieval pattern that has no text-RAG equivalent: "find Wikipedia pages that look like what I am looking at." A web agent that sees a table it does not understand can query PixelRAG with an image of the table and retrieve the Wikipedia page most visually similar to it. This is a different capability than "find Wikipedia pages about the topic of this table," and it suggests a broader role for visual retrieval in agent systems that interact with the web through visual observation rather than DOM parsing.

TL;DR For Engineers

PixelRAG (StarTrail-org/PixelRAG, Apache 2.0, Berkeley SkyLab + Databricks) eliminates HTML parsing from RAG by rendering pages to screenshot tiles with Playwright + headless Chromium, embedding tiles with fine-tuned Qwen3-VL-Embedding + LoRA, and retrieving via FAISS. On SimpleQA with Qwen3.5-4B reader at k=3: text-RAG best (Trafilatura) = 71.6%, PixelRAG = 78.8%.
The 40% text loss problem is structural, not fixable by better parsers. HTML parsers must linearize two-dimensional content; tables, infoboxes, and visual hierarchies lose their structure in the process. PixelRAG bypasses this by delegating reading to a VLM that handles layout natively.
Architecture: Kiwix ZIM archive → offline Playwright render → 875×1024px tiles → Qwen3-VL-Embedding (LoRA-tuned on contrastive screenshot pairs) → FAISS ANN index → top-k tile images → VLM reader. 8.28M Wikipedia pages = 30M+ tiles. Hosted at api.pixelrag.ai, no API key required.
Token cost tradeoff: each tile = ~1,800 visual tokens vs ~600 text tokens for a chunk. 3x token reduction possible with tile compression. Self-hosting at Wikipedia scale requires ~92GB memory for the FAISS float32 index.
The pixelbrowse Claude Code plugin gives AI agents visual web retrieval as a native tool. The API also supports image-as-query (visual search), enabling retrieval patterns with no text-RAG equivalent.

The Parser Was Always the Weakest Link

Text RAG's accuracy problem was never the embedding model, the retrieval strategy, or the reader LLM. It was the parser at step one, quietly discarding 40% of recoverable information before any ML ran. PixelRAG's contribution is making this loss visible by measuring it, and then eliminating it by skipping the lossy step entirely.

The 7.2pp gain on SimpleQA is not the ceiling. It is the gain achievable on a general-purpose benchmark against a corpus that is mostly text-heavy prose. On a corpus of structured web content, financial data, technical documentation with diagrams, or any domain where tables and visual layout carry the answer, the gap is larger. That is the deployment target PixelRAG is built for.

References

PixelRAG GitHub Repository, StarTrail-org, Apache 2.0
Web Screenshots Beat Text for Retrieval-Augmented Generation, arXiv:2506.00077, Wang, Li, Zaharia, Gonzalez et al., May 2026
Visual-RAG: Benchmarking Text-to-Image Retrieval Augmented Generation, arXiv:2502.16636, February 2026
PixelRAG live API and demo

Summary

PixelRAG (arXiv:2506.00077, StarTrail-org/PixelRAG, Apache 2.0, Berkeley SkyLab + BAIR + Databricks, May 2026) eliminates HTML parsing from RAG by rendering web pages to screenshot tiles using offline Playwright + headless Chromium, embedding tiles with contrastively fine-tuned Qwen3-VL-Embedding + LoRA, storing 30M+ tile vectors in a FAISS index, and passing retrieved tile images directly to a VLM reader. The root problem it attacks: HTML parsers discard 40%+ of recoverable text and linearize tables into unretievable keyword mush, with parser choice alone shifting SimpleQA accuracy by ~10 points. PixelRAG achieves 78.8% on SimpleQA (Qwen3.5-4B reader, k=3) vs the best text parser at 71.6%, a 7.2pp gain concentrated in structured-content questions. The hosted API at api.pixelrag.ai serves 8.28M Wikipedia pages as pre-indexed visual tiles with no API key required, and a Claude Code plugin (pixelbrowse) gives agents visual web retrieval including image-as-query search.

Sponsored Ad

If you enjoy practical AI insights, check out SnackOnAI and support the newsletter by subscribing, sharing, and exploring our sponsored ad — it helps us keep building and delivering value 🚀

What’s next is almost here.

On July 16th at 1PM ET, beehiiv is going live with a look at the future of publishing, audience growth, and digital business.

What started as a newsletter platform has evolved into something much bigger: a place where creators and brands can grow, monetize, and own their audiences without stitching together half the internet to make it work.

The next chapter starts live at the Summer Release Event.

Join us to see what’s coming next.

RSVP now.