In partnership with

Why This Matters Now

Every time you call DALL-E or Replicate, you're paying for three things you don't need: a round-trip across the public internet, a queue behind thousands of other requests, and a margin that funds someone else's GPU cluster.

The better path has existed for two years. Most people just haven't taken it seriously yet.

Mr. Pixel Smith is a Python CLI that generates images locally using Ollama and watermarks them automatically with Pillow.

Built by SnackOnAI and available for free, this CLI is now a core part of SnackOnAI’s workflow for generating high-quality newsletter images.

Generated By Mr. Pixel Smith with Prompt
“A character named Mr. Pixel Smith floating among clouds and stars, holding a glowing hammer made of pixels, cinematic lighting, ultra realistic, dramatic sky, high detail, 4k”

It's small by design. But the patterns inside it - preflight service checks, subprocess error handling, graceful degradation — are the same ones that will break your production AI pipeline if you get them wrong.

The Problem Most Builders Ignore

Wrapping an inference service isn't just calling it. It's handling every state the service can be in.

Ollama, like any long-running daemon, can be missing, installed-but-not-running, running-but-hung, or running-with-the-wrong-model-loaded. Most wrappers handle none of these. They call the service, get a cryptic error or empty output, and surface a useless stack trace to the user.

The three failure domains that matter:

Failure

Root Cause

Right Response

Binary not found

Not installed

Exit with install link

Daemon not running

Not started

Exit with ollama serve hint

Model not pulled

Empty JSON response

Exit with ollama pull hint

Bad input

Empty prompt, invalid dims

Validate before subprocess call

Corrupt output

Bad base64 from model

Surface error with context

Watermark crash

Font missing, wrong image mode

Fall back to original, log warning

Fix all six or you don't have a tool, you have a demo.

How It Actually Works

The Inference Pipeline

When you send a prompt to Ollama's image model, here's what happens:

TEXT PROMPT
    |
    v
[Text Encoder]     -->  768-dim embedding vector
    |
    v
[Latent Diffusion] -->  Iterative denoising (20-50 steps)
    |
    v
[VAE Decoder]      -->  Pixel-space image
    |
    v
Base64 PNG         -->  Returned in JSON response

The width and height parameters don't just control output resolution, they control the size of the latent tensor being denoised. A 1200×628 image is roughly 4× the compute of a 512×512 at equivalent step count. Resolution choices have real latency consequences.

The Watermark Pipeline

Watermarking in Pillow is three operations composed:

  1. RGBA conversion — Forces 4-channel mode regardless of source format

  2. Tiled diagonal repeat — Text at opacity 60/255, covering the full image. Non-destructive to remove but not trivial either.

  3. Corner stamp — High-contrast at 220/255 with a dark background rect for legibility at any image brightness

The graceful degradation path — save the original if watermarking fails — is the right call. Failing to watermark is always less bad than losing the image.

The Code That Actually Matters

Preflight Check

def check_ollama() -> None:
    if not shutil.which("ollama"):
        print("Error: 'ollama' not found. Install: https://ollama.com")
        sys.exit(1)

    try:
        result = subprocess.run(
            ["ollama", "list"],
            capture_output=True, text=True, timeout=5
        )
        if result.returncode != 0:
            print("Ollama installed but not running. Run: ollama serve")
            sys.exit(1)
    except subprocess.TimeoutExpired:
        print("Ollama not responding. Is 'ollama serve' running?")
        sys.exit(1)

The 5-second timeout is deliberate. ollama list returns in under 100ms if the daemon is healthy. If it times out, image generation will too — fail fast here rather than hanging for 2 minutes.

Validated Input Loop

def get_int_input(prompt_text, default, min_val=64, max_val=4096):
    while True:
        raw = input(f"{prompt_text} [default: {default}]: ").strip()
        if raw == "":
            return default
        try:
            value = int(raw)
            if not (min_val <= value <= max_val):
                print(f"  Enter a value between {min_val} and {max_val}.")
                continue
            return value
        except ValueError:
            print("  Invalid — enter a whole number.")

The bounds [64, 4096] aren't arbitrary guardrails. Below 64px, diffusion models produce incoherent noise. Above 4096px on consumer hardware, you're likely to OOM before you get an output.

Font Fallback (The Version Most People Write Wrong)

# Wrong: crashes silently on systems without DejaVu
font = ImageFont.truetype("/usr/share/fonts/truetype/dejavu/DejaVuSans-Bold.ttf", size)

# Right: try a prioritized list, log the fallback
FONT_CANDIDATES = [
    "/usr/share/fonts/truetype/dejavu/DejaVuSans-Bold.ttf",  # Linux
    "/System/Library/Fonts/Helvetica.ttc",                   # macOS
    "C:/Windows/Fonts/arial.ttf",                            # Windows
]

def find_font(size: int):
    for path in FONT_CANDIDATES:
        try:
            return ImageFont.truetype(path, size)
        except (IOError, OSError):
            continue
    logging.warning("No system font found. Using default bitmap font.")
    return ImageFont.load_default()

ImageFont.load_default() returns a bitmap font designed for thumbnails. At 1200px it looks terrible. Never use it as a silent default.

Architecture: What the System Actually Looks Like

USER (Terminal)
    |
    | input(): prompt, width, height, output_path
    v
┌─────────────────────────────────────┐
│          MAIN ORCHESTRATOR          │
│  1. check_ollama()  — preflight     │
│  2. get_int_input() — validated I/O │
│  3. generate_image() — subprocess   │
│  4. add_watermark() — Pillow comp.  │
│  5. write_bytes()   — disk write    │
└─────────────────────────────────────┘
         |                |
         v                v
  [Ollama Daemon]    [Pillow / PIL]
  localhost:11434    In-process CPU
  GPU/CPU inference  compositing

The data flow is a straight pipeline: str → subprocess → JSON → base64 → bytes → PIL Image → watermarked bytes → file. Each transformation has exactly one failure mode. Handle them all or handle none.

⚡ Insight #1: Subprocess Is the Wrong Abstraction

Everyone reaches for subprocess.run() because it's familiar. It's also fragile.

The current implementation calls the Ollama CLI and parses its stdout as JSON. That works until Ollama adds a progress indicator, a warning message, or changes its output schema, and then it breaks silently, returning empty or malformed output with no error.

The REST API at http://localhost:11434/api/generate has an actual versioned contract. It supports streaming. It gives you proper HTTP error codes. It doesn't require spawning a new process on every generation.

# Better: call the REST API directly
import httpx

response = httpx.post(
    "http://localhost:11434/api/generate",
    json={"model": "x/z-image-turbo", "prompt": prompt},
    timeout=300
)
response.raise_for_status()
image_b64 = response.json()["response"]

Two lines cleaner, dramatically more robust. Use subprocess for quick prototypes. Use REST for anything that runs in production.

⚡ Insight #2: Local-First Is a Competitive Moat, Not Just a Cost Play

The common framing is: "use local inference to save money." That's true but undersells it.

The real advantage is architectural independence. When your product's image generation runs locally, you're immune to API rate limits, outages, pricing changes, and terms-of-service updates. You can run in air-gapped environments. You can guarantee data doesn't leave the machine. You can tune inference parameters the cloud API doesn't expose.

Founders building cost-sensitive products should be asking: "what is the total cost of API dependency?", not just the per-image price. The answer almost always favors local-first for internal tools and batch workloads, and cloud for user-facing real-time generation where quality SLAs matter.

💡 Surprising Takeaway: The Watermark Is a Product Decision, Not a Technical One

Most engineers treat watermarking as an implementation detail. It's not.

A watermark communicates provenance. It's how your brand travels when generated images get shared, reused, or embedded somewhere you didn't intend. For a content tool, the watermark is distribution. For an enterprise tool, it's audit trail. The opacity, placement, and content of the watermark are product requirements that should come from a spec, not from whatever value happened to get hardcoded during prototyping.

Adaptive opacity based on regional image luminance is the right production implementation. A watermark at 60/255 opacity is invisible on a bright sky and visible on a dark forest. Compute the mean luminance of the stamp region and adjust accordingly.

What Breaks at Scale

Mr. Pixel Smith is a single-user CLI. These assumptions fail the moment you serve multiple users:

  • Ollama is single-threaded by default — concurrent requests queue, they don't parallelize

  • subprocess.run() is blocking — a web server calling this blocks its entire thread

  • output.png is hardcoded — concurrent users stomp each other's files

  • No request correlation — impossible to trace which generation belongs to which user

The production path requires a task queue, async REST calls to Ollama, per-request unique output paths, and object storage (S3) instead of local disk.

HTTP Request
    |
    v
[FastAPI]
    |
    v
[Celery + Redis]    <-- decouple generation from request
    |
    v
[Worker] --> POST localhost:11434/api/generate
    |
    v
[S3 Upload]  <-- watermarked PNG
    |
    v
Signed URL returned to client

None of this is complicated. It's just not in scope for a CLI tool.

Tool Comparison

Tool

Best For

The Catch

Ollama + Mr. Pixel Smith

Local, private, zero API cost

Setup required, GPU helpful

Stable Diffusion WebUI

Mature ecosystem, 1000s of models

Browser-based, not CLI-native

ComfyUI

Complex multi-step workflows

Steep learning curve

Replicate API

Zero setup, managed GPU

~$0.003/image, data leaves machine

DALL-E 3

Best prompt adherence

Most expensive, requires internet

HuggingFace Diffusers

Maximum pipeline flexibility

Write inference code yourself

Use local when privacy, cost at scale, or offline operation matter. Use cloud when quality SLAs and latency requirements genuinely demand it.

Five Things People Get Wrong

1. Trusting stdout format. json.loads(result.stdout) breaks silently when Ollama changes its output. Use the REST API.

2. Ignoring stderr. Most subprocess wrappers only check returncode. Ollama writes diagnostics to stderr. Log it even on success.

3. Not accounting for cold start. First generation after model load is 3-5× slower. Build this into your timeout config and user messaging.

4. Hardcoding watermark opacity. A fixed opacity looks wrong on half your images. Adapt to regional luminance.

5. The font fallback trap. load_default() is a bitmap font. At 1200px it looks like 1995. Always try a system font list first.

How to Think About This as a Builder

Mr. Pixel Smith is a specific instance of a pattern you'll encounter constantly: wrapping an inference service with validation, post-processing, and failure handling.

The same pattern applies to LLM APIs, speech-to-text, OCR, embedding models, any service with its own lifecycle that can be in an unknown state when your code calls it.

The quality of the wrapper often determines the user experience more than the quality of the underlying model. A great model with a brittle wrapper feels worse than a decent model with a solid one.

TL;DR for Founders

Local AI inference is production-viable right now. Ollama is a single CLI install. Pillow watermarking is CPU-only, fast, and free. The real work is the wrapper, validation, preflight checks, graceful degradation.

Build on this in this order:

  1. Swap subprocess for REST API calls to localhost:11434

  2. Make watermark opacity adaptive to image luminance

  3. Add a cross-platform font candidate list

  4. Wrap in FastAPI + Celery for multi-user serving

  5. Move output to S3 with per-request UUIDs

The economics of local inference only improve from here. Hardware gets cheaper, models get smaller, Ollama adds more features. The builders who understand this stack now will have a durable advantage over those who keep treating cloud APIs as the default.

Give it a try yourself, I’d love to see what images you come up with.

Sponsored Ad

The Free Newsletter Fintech and Finance Execs Actually Read

If you work in fintech or finance, you already have too many tabs open and not enough time.

Fintech Takes is the free newsletter senior leaders actually read. Each week, I break down the trends, deals, and regulatory moves shaping the industry — and explain why they matter — in plain English.

No filler, no PR spin, and no “insights” you already saw on LinkedIn eight times this week. Just clear analysis and the occasional bad joke to make it go down easier.

Get context you can actually use. Subscribe free and see what’s coming before everyone else.

Recommended for you