In partnership with

SnackOnAI Engineering | Senior AI Systems Researcher | Technical Deep Dive | April 27, 2026

Most AI application tutorials converge on the same architecture: call OpenAI's API, get an answer, display it. Aurum, built by Adam Chan, takes the opposite approach: three Ollama models, a PostgreSQL database with the pgvector extension, a faster-whisper Python microservice, a Next.js 16 frontend, and zero external API calls. Everything runs locally. Your data never leaves the machine. And the technical choices made to get there, which embedding model, which LLM, how to do multimodal image recognition, how to implement semantic search, how to build a voice input pipeline without a cloud transcription service, are exactly the choices you would make building any serious local AI application.

This newsletter dissects Aurum as a systems document: what the three-model Ollama stack does, how pgvector with HNSW indexing enables semantic search across a home inventory, how faster-whisper runs local speech-to-text on port 9000, and what the ReAct pattern from arXiv:2210.03629 explains about the LLM's natural language parsing loop.

Scope: Aurum's complete technical stack, the three Ollama models (embeddinggemma:300m, llama3.2:3b, qwen3-vl), pgvector's cosine similarity search, the faster-whisper transcription microservice, and the Prisma ORM schema. Not covered: fine-tuning any of the underlying models, or production scaling beyond single-machine local deployment.

What It Actually Does

Aurum is a Next.js 16 web application for tracking home inventory using natural language, voice, and photos. Five stars on GitHub, MIT license, built and released publicly by Adam Chan. The repo description: "AI-powered home inventory management. Track, organize, and find your belongings using natural language, voice, or photos. 100% local with Ollama, your data never leaves home."

The use case is real: "Where did I put the extra toothpaste?" "Do I have any batteries?" "Add 3 bottles of shampoo to the bathroom cabinet." These queries are trivially easy to answer if you have a well-maintained inventory database and a natural language interface on top of it. The hard part is making the interface work for how people actually speak about their belongings, fuzzily, conversationally, without remembering the exact keywords they used when adding an item.

Semantic search via vector embeddings solves this. The embedding model turns both the stored item description and the query into high-dimensional vectors. Similarity search finds items where the vectors are close, even when the words are different. "I need a battery" finds items tagged "AA batteries," "Duracell," or "two AA cells in the kitchen drawer."

The complete installed tech stack:

  • Next.js 16 (App Router, React 19, TypeScript)

  • PostgreSQL with pgvector extension (via Docker)

  • Prisma ORM for database access

  • Ollama running three models locally:

    • embeddinggemma:300m (semantic search embeddings)

    • llama3.2:3b (natural language parsing and chat)

    • qwen3-vl (image recognition for photo input)

  • faster-whisper Python microservice on port 9000 (voice transcription)

  • TailwindCSS and Lucide React for UI

The Architecture

Focus on the three distinct Ollama model roles. embeddinggemma handles search. llama3.2 handles intent parsing and response generation. qwen3-vl handles vision. They run as separate Ollama model instances, never sharing context. The entire multimodal AI pipeline runs on port 11434.

The voice input path deserves attention. The browser's MediaRecorder API captures audio from the microphone, sends it to the faster-whisper Python server on port 9000 as a WAV blob, which returns a text transcript. That transcript then flows identically to typed text input. The transcription runs locally with Whisper's small or medium model, not via OpenAI's Whisper API. This is a meaningful architectural choice: voice transcription of home inventory items almost always includes product names, brand names, and room-specific vocabulary (linen closet, medicine cabinet, pantry) that general cloud transcription handles inconsistently. Local Whisper runs as a controlled environment with consistent behavior.

The Code

Snippet One: Semantic Search with pgvector and Prisma (the read path)

// lib/ai.ts (simplified from the actual implementation)
import { ollama } from './ollama-client';
import { prisma } from './prisma';

// ← THIS is the search mechanism: convert user query to a vector,
// then find items whose stored vectors are closest in cosine distance
export async function semanticSearchItems(query: string, limit: number = 5) {

  // Step 1: Generate a query embedding using embeddinggemma:300m
  // ← We use the SAME model that generated stored embeddings.
  // Mixing embedding models breaks search: the vector spaces are incompatible.
  const embedResponse = await fetch('http://localhost:11434/api/embed', {
    method: 'POST',
    headers: { 'Content-Type': 'application/json' },
    body: JSON.stringify({
      model: 'embeddinggemma:300m',  // ← MUST match what was used at write time
      input: query,
    }),
  });

  const { embeddings } = await embedResponse.json();
  const queryVector = embeddings[0]; // 768-dimensional float array

  // Step 2: Cosine similarity search via pgvector
  // ← Prisma doesn't natively support pgvector operators, so we use $queryRaw
  // The <=> operator is pgvector's cosine DISTANCE (not similarity)
  // ORDER BY ASC means smallest distance = most similar = what we want
  const results = await prisma.$queryRaw`
    SELECT
      id,
      name,
      description,
      quantity,
      location,
      -- ← Convert to similarity score for display (1 - cosine_distance)
      1 - (embedding <=> ${queryVector}::vector) AS similarity_score
    FROM items
    ORDER BY embedding <=> ${queryVector}::vector  -- ← THIS is the semantic search
    LIMIT ${limit}
  `;

  // Step 3: Filter by minimum similarity threshold
  // ← Without a threshold, every item returns, including irrelevant ones
  // 0.5 is a reasonable floor for consumer inventory items
  return (results as any[]).filter(r => r.similarity_score > 0.5);
}

// ← Why embeddinggemma:300m and not a larger model?
// 300M parameters, runs in <200MB of RAM, generates embeddings in ~20-50ms locally.
// For 768-dim vectors over a few hundred inventory items, this is more than sufficient.
// A 1.5B embedding model would be 5x slower for negligible quality improvement
// on short item descriptions.

The <=> operator is the entire search mechanism. Everything else, the Ollama embedding call, the Prisma raw query, the similarity threshold, exists to set up and filter that single pgvector operation.

Snippet Two: LLM Intent Parsing and Vision Description Pipeline

// app/api/chat/route.ts (simplified)
import { NextRequest } from 'next/server';

// ← Parse natural language into structured item data using llama3.2:3b
// This replaces what would be a regex or keyword parser in a naive implementation
async function parseItemFromText(userMessage: string): Promise<ParsedItems> {
  const response = await fetch('http://localhost:11434/api/chat', {
    method: 'POST',
    headers: { 'Content-Type': 'application/json' },
    body: JSON.stringify({
      model: 'llama3.2:3b',
      messages: [
        {
          role: 'system',
          // ← THIS is the ReAct-inspired prompt pattern:
          // The LLM is given a specific action type (ADD vs FIND vs UPDATE)
          // and must produce structured JSON, not open-ended text.
          // Constraining output format prevents hallucination of item fields.
          content: `You are a home inventory assistant. Parse the user message and respond with JSON only.
Detect the intent: ADD (adding items), FIND (searching), UPDATE (modifying), LIST (showing all).
For ADD intent, extract: items array with {name, description, quantity, location}.
For FIND intent, extract: searchQuery string.
Return ONLY valid JSON, no other text.
Example ADD response: {"intent": "ADD", "items": [{"name": "shampoo", "description": "3 bottles of Head & Shoulders", "quantity": 3, "location": "bathroom cabinet"}]}`
        },
        { role: 'user', content: userMessage }
      ],
      stream: false,
      // ← temperature 0 for deterministic JSON parsing. We want the same output
      // for the same input, not creative variation in field names.
      options: { temperature: 0 }
    }),
  });

  const data = await response.json();
  // ← The model outputs JSON as a string. We parse it.
  // In production this needs try/catch for malformed LLM output.
  return JSON.parse(data.message.content);
}

// ← Vision pipeline: photo → qwen3-vl description → text → LLM parser
async function describeImageWithVision(imageBase64: string): Promise<string> {
  const response = await fetch('http://localhost:11434/api/chat', {
    method: 'POST',
    headers: { 'Content-Type': 'application/json' },
    body: JSON.stringify({
      model: 'qwen3-vl',  // ← Qwen-VL understands images natively
      messages: [{
        role: 'user',
        content: [
          {
            type: 'image_url',
            image_url: { url: `data:image/jpeg;base64,${imageBase64}` }
          },
          {
            type: 'text',
            // ← Prompt engineering for inventory: ask specifically for
            // quantity, brand, condition. Generic "describe this image"
            // produces poetic descriptions, not inventory-useful metadata.
            text: 'Describe this item for a home inventory system. Include: what it is, approximate quantity if multiple, any visible brand name, and condition. Be concise and specific.'
          }
        ]
      }],
      stream: false,
    }),
  });

  const data = await response.json();
  // ← qwen3-vl returns a text description, which we then pass to parseItemFromText()
  // Vision output → LLM parser → structured JSON → database
  return data.message.content;
}

// Write path: structured item → embedding → database
async function saveItemWithEmbedding(item: ParsedItem, locationId: string) {
  // Generate embedding for the item description
  const embeddingText = `${item.name} ${item.description} ${item.location}`;

  const embedResponse = await fetch('http://localhost:11434/api/embed', {
    method: 'POST',
    headers: { 'Content-Type': 'application/json' },
    body: JSON.stringify({
      model: 'embeddinggemma:300m',
      input: embeddingText,
    }),
  });

  const { embeddings } = await embedResponse.json();

  // ← Save item with its embedding vector for future semantic search
  await prisma.$executeRaw`
    INSERT INTO items (name, description, quantity, location_id, embedding)
    VALUES (${item.name}, ${item.description}, ${item.quantity}, ${locationId}, ${embeddings[0]}::vector)
  `;
}

The temperature=0 setting for the LLM parser is critical. Natural language to JSON parsing must be deterministic. Any temperature above 0 introduces random variation in field names, JSON structure, and intent classification that breaks downstream code. The vision model and the chat response model can use higher temperatures. The parser model cannot.

It In Action: End-to-End Worked Example

Scenario: Voice input of a bathroom inventory, then semantic search for a specific item.

Input: User clicks "Start Recording" and says: "I have two toilet paper rolls, one comb, fifteen razor blade replacements, and two boxes of QTips."

Step 1: Voice transcription (faster-whisper, port 9000)

Input audio: ~6 second WAV clip, 16kHz, 96kbps
faster-whisper model: small (244MB, runs on CPU in ~1-2 seconds)
Output text: "I have two toilet paper rolls, one comb, fifteen razor blade replacements,
              and two boxes of QTips."
Transcription latency: ~1.5 seconds on M2 MacBook Air (CPU inference)

Step 2: LLM parsing (llama3.2:3b, temperature=0)

{
  "intent": "ADD",
  "items": [
    {"name": "toilet paper", "description": "toilet paper rolls", "quantity": 2, "location": "bathroom"},
    {"name": "comb", "description": "comb", "quantity": 1, "location": "bathroom"},
    {"name": "razor blade replacements", "description": "razor blade replacements", "quantity": 15, "location": "bathroom"},
    {"name": "QTips", "description": "QTips cotton swabs boxes", "quantity": 2, "location": "bathroom"}
  ]
}

LLM parsing latency: ~800ms on llama3.2:3b (Apple Silicon Metal GPU acceleration via Ollama)

Step 3: Embedding generation (embeddinggemma:300m, per item)

Input: "toilet paper toilet paper rolls bathroom"
Output: [0.023, -0.441, 0.187, ... ] (768 dimensions)
Embedding latency: ~25ms per item (embeddinggemma:300m is very fast)
4 items × ~25ms = ~100ms total embedding time

Step 4: PostgreSQL upsert with pgvector

INSERT INTO items (name, description, quantity, location_id, embedding)
VALUES
  ('toilet paper', 'toilet paper rolls', 2, 'bathroom-id', '[0.023, -0.441, ...]'::vector),
  ('comb', 'comb', 1, 'bathroom-id', '[...]'::vector),
  ('razor blade replacements', 'razor blade replacements', 15, 'bathroom-id', '[...]'::vector),
  ('QTips', 'QTips cotton swabs boxes', 2, 'bathroom-id', '[...]'::vector)

Step 5: Later query, "Do I have anything to cut my face hair?"

Query embedding: embeddinggemma:300m → query vector for "cut face hair"
pgvector search:
  SELECT name, quantity, location, 1 - (embedding <=> query_vector) AS score
  FROM items ORDER BY score DESC LIMIT 5

Results:
  razor blade replacements  | 15 | bathroom | score: 0.78  ← semantic match on "cutting"
  comb                      |  1 | bathroom | score: 0.52  ← weak match on grooming
  QTips                     |  2 | bathroom | score: 0.38  ← filtered out (< 0.5 threshold)

llama3.2:3b response generation:
"Yes! You have 15 razor blade replacements in the bathroom."
Response generation latency: ~600ms

Total latency (voice input → stored + confirmed): ~4-5 seconds Total latency (text query → answer): ~1.2 seconds All on local hardware, no network calls to external APIs

Why This Design Works, and What It Trades Away

The three-model separation is the correct design for this problem. Using one large multimodal model for embeddings, parsing, and vision would be worse on all three dimensions: larger models are slower for embedding generation (where speed matters more than quality), they are overkill for simple JSON parsing tasks, and mixing embedding and generation in the same model creates architectural coupling that makes model swaps impossible. Aurum's explicit model-per-role design means you can swap llama3.2:3b for gemma3:4b for the parser without touching the embedding pipeline.

The choice of embeddinggemma:300m as the embedding model reflects the correct tradeoff for this scale. EmbeddingGemma is a 300M parameter model from Google, built on the Gemma 3 architecture with T5Gemma initialization, trained on 100+ languages. It runs in less than 200MB of RAM, generates 768-dimensional embeddings, and at this scale (hundreds of home inventory items) is more than sufficient. The MTEB leaderboard shows EmbeddingGemma competitive with models twice its size. Semantic search quality over short item descriptions does not require billion-parameter embedding models.

The pgvector HNSW index is the correct choice over IVFFlat for this use case. At household inventory scale (hundreds to low thousands of items), HNSW (Hierarchical Navigable Small World graph) provides approximate nearest neighbor search with better recall than IVFFlat and no training requirement. IVFFlat requires a training step on representative data before building the index. HNSW builds incrementally as items are inserted.

What Aurum trades away:

Multi-user concurrent access. The Ollama models running locally serve one request at a time (unless OLLAMA_NUM_PARALLEL is set). For a single-user household app, this is fine. For a family of four simultaneously adding items, query latency will degrade proportionally.

Cross-device sync. Data lives in a local PostgreSQL instance. Accessing the inventory from a phone requires either running Aurum on a home server with local network access, or exporting/importing the database. There is no sync mechanism.

Voice accuracy on domain-specific terms. Faster-whisper's small model handles common English well but struggles with unusual product names, foreign words, or strong accents. The medium model improves this at the cost of ~3x more transcription latency.

Technical Moats

The full local privacy guarantee. This is the actual differentiator relative to cloud-based inventory apps. Photos of your home interior, product inventories, medication lists, and sensitive items never leave the machine. For anyone with legitimate privacy concerns about where their home data goes, a cloud-connected inventory app is not acceptable regardless of its features. Aurum's 100% local architecture is not a limitation. It is the primary product attribute.

The multimodal pipeline for zero-configuration item entry. The combination of voice → faster-whisper → LLM parser and photo → qwen3-vl → LLM parser → embedding → database, all on local hardware, is a reference implementation for any local multimodal application. Any team building a similar system (local meeting notes, private document management, physical asset tracking) can replicate this architecture with the same stack.

EmbeddingGemma's performance-per-compute ratio. At 300M parameters running in under 200MB RAM, it outperforms older embedding models many times its size on semantic similarity tasks. The MTEB benchmark shows it competitive with models at 500M-700M parameters. For local deployment where model footprint is a constraint, this is the correct embedding model choice at this scale.

Insights

Insight One: Aurum is not a home inventory app. It is a reference implementation for building any privacy-first local RAG application, and the "home inventory" framing is limiting the community's recognition of what it demonstrates.

Every design decision in Aurum generalizes: three-model Ollama stack (embedding, generation, vision), pgvector semantic search, faster-whisper local transcription, Prisma ORM for type-safe database access, Next.js App Router for the UI layer. Substitute "home inventory items" with "private documents," "medical records," "legal notes," or "personal journals" and the architecture is identical. The specific domain is trivial. The pattern, a fully local, privacy-preserving, multimodal RAG application with sub-second query response over hundreds to thousands of items, is not trivial and is underdemonstrated in open source.

Insight Two: The ReAct pattern from arXiv:2210.03629 is operating implicitly in Aurum's intent parser, and the community building local AI apps routinely implements it without recognizing it.

ReAct (Reasoning and Acting, Yao et al., ICLR 2023) demonstrated that LLMs perform better on interactive tasks when they interleave reasoning traces with action steps: think, then act, observe the result, think again. Aurum's LLM parser implements a simplified version: parse the user's intent (reasoning), classify it as ADD/FIND/UPDATE/LIST (acting), return structured JSON (output). The zero-temperature setting ensures the reasoning step is deterministic rather than creative. The structured output constraint ensures the action step is machine-executable. What differs from vanilla ReAct is the absence of an observation loop: Aurum's parser is single-step. A more capable version would implement the full loop: parse intent, query the database, observe current state, then decide whether to add, update, or inform. This would handle "Add another bottle of shampoo to the bathroom" correctly when the item already exists, rather than creating a duplicate.

Takeaway

pgvector's cosine similarity search over 768-dimensional EmbeddingGemma vectors finds semantically related inventory items, including items described with completely different words, in under 5 milliseconds even without an HNSW index, across a typical household inventory of a few hundred items.

At this scale, the index is not the performance bottleneck. The Ollama embedding call is (~25ms). The LLM parsing call is (~800ms). The PostgreSQL query with a sequential scan is ~2ms. Building the HNSW index does not meaningfully improve query latency until the inventory reaches tens of thousands of items. For a single household, the correct optimization target is LLM parsing latency (use a smaller, faster model), not database index optimization. The community default assumption, "we need a vector database for AI search," is wrong at this scale. PostgreSQL with pgvector is sufficient for anything under a few hundred thousand items.

TL;DR For Engineers

  • Three-model Ollama stack: embeddinggemma:300m for semantic search embeddings (768-dim, <200MB RAM, ~25ms/item), llama3.2:3b at temperature=0 for deterministic intent parsing and response generation, qwen3-vl for photo-to-description vision. Each model has one role. No shared context.

  • Semantic search: user query → embedding → pgvector cosine distance search (<=>) → items ranked by similarity. Threshold at 0.5 to filter irrelevant results. Sequential scan is fast enough at household scale (hundreds of items), HNSW index recommended above 10,000 items.

  • Voice pipeline: browser MediaRecorder API → WAV blob → faster-whisper Python server (port 9000) → text transcript → identical path to typed input. Runs entirely local, no OpenAI Whisper API.

  • The intent parser must run at temperature=0. Natural language → structured JSON parsers require deterministic output. Random variation in JSON field names breaks downstream code.

  • The entire stack runs via docker-compose up (PostgreSQL), ollama serve, three ollama pull commands, and npm run dev. Setup time under 20 minutes on a machine with a GPU.

The Privacy-First Local AI Stack Is Now a Three-Hour Project

Aurum is a proof that the "call OpenAI's API" tutorial pattern is not the only option, and arguably not the right option for applications where the data itself is sensitive. Three Ollama models, one PostgreSQL instance, one Python microservice, and a Next.js frontend. Every component is open-source, self-contained, and runs on consumer hardware. The query response time (1.2 seconds for a text query) is competitive with cloud-connected alternatives while keeping every query, every photo, and every item description on your own machine. The pattern scales: swap the home inventory schema for any structured private data, and the architecture serves as a production-ready local RAG system. The tutorial that finally explains how to build this correctly, not with API calls but with a complete local stack, is what Aurum is, wrapped in the most relatable possible domain.

References

Aurum (MIT, Adam Chan) is a fully local home inventory application built on Next.js 16, PostgreSQL with pgvector, and three Ollama models: EmbeddingGemma:300m (768-dim semantic search embeddings, ~25ms per item), Llama3.2:3b (zero-temperature intent parsing and response generation), and Qwen3-VL (photo-to-description vision). Voice input runs through a local faster-whisper Python microservice on port 9000. Semantic search uses pgvector's cosine distance operator (<=>) with a 0.5 similarity threshold, achieving sub-5ms query time at household inventory scale without an HNSW index. The architecture is a portable reference implementation for any privacy-first local multimodal RAG application: substitute the domain, keep the stack.

Sponsored Ad

If you enjoy practical AI insights, check out SnackOnAI and support the newsletter by subscribing, sharing, and exploring our sponsored ad — it helps us keep building and delivering value 🚀

Analytics on Live Data Without Leaving Postgres

When analytics on Postgres slows down, most teams add a second database. TimescaleDB by Tiger Data takes a different approach: extend Postgres with columnar storage and time-series primitives to run analytics on live data, no split architecture, no pipeline lag, no new query language to learn. Start building for free. No credit card required.

Recommended for you