Mohinish S

Serverless Ventures | Cloud, Data & Distributed Systems | Angel & Advisor | Infra & Data Startups

SnackOnAI Blog

OpenWorker: Andrew Ng Built the AI Coworker That Delivers Finished Files, Not Chat, and Put the Agent Loop on Your Machine

Jul 31, 2026

•

11 min read

OpenWorker: Andrew Ng Built the AI Coworker That Delivers Finished Files, Not Chat, and Put the Agent Loop on Your Machine

Every AI assistant tool available today delivers the same output: a text response. You paste it somewhere, format it, send it, file it, or act on it yourself.

Mohinish S

SnackOnAI Blog

CAMEL: The First Multi-Agent Framework That Used Its Own Agents to Generate the Data That Trained Its Competitors

Jul 30, 2026

•

13 min read

CAMEL: The First Multi-Agent Framework That Used Its Own Agents to Generate the Data That Trained Its Competitors

The dominant narrative around multi-agent frameworks is that you pick one based on features: does it support tool calling, memory, RAG, streaming?

Mohinish S

SnackOnAI Blog

Colibri: A Single C File Runs a 744B MoE on 25GB of RAM by Treating Every Expert as Data to Be Staged, Not State to Be Held

Jul 29, 2026

•

14 min read

Colibri: A Single C File Runs a 744B MoE on 25GB of RAM by Treating Every Expert as Data to Be Staged, Not State to Be Held

The assumption every MoE inference system makes is wrong. It assumes the model must fit in memory.

Mohinish S

SnackOnAI Blog

SkillOpt: Microsoft Trained a Markdown File to 80.7 on SpreadsheetBench, Up From 41.8, Without Touching a Single Model Weight

Jul 28, 2026

•

14 min read

SkillOpt: Microsoft Trained a Markdown File to 80.7 on SpreadsheetBench, Up From 41.8, Without Touching a Single Model Weight

The dominant assumption in agent improvement is that better agents require better models. More parameters, more RLHF, more fine-tuning. SkillOpt (microsoft/SkillOpt, MIT, 13k stars, arXiv:2605.23904) challenges this from the opposite direction: freeze the model entirely, and train the skill document the same way you would train a neural network, with rollout batches, minibatch reflection, bounded updates, a held-out validation gate, and epoch-wise regularization.

Mohinish S

SnackOnAI Blog

Attention Residuals: Moonshot AI Fixed the Residual Connection That Every Deep LLM Has Been Getting Wrong Since 2017

Jul 27, 2026

•

14 min read

Attention Residuals: Moonshot AI Fixed the Residual Connection That Every Deep LLM Has Been Getting Wrong Since 2017

Every transformer you have ever used accumulates layer outputs the same way: add them all together with fixed weight 1. Layer 3 contributes exactly as much as layer 47. Layer 1's token embedding contributes the same as the layer right before the output.

Mohinish S

SnackOnAI Blog

LLM Observability Is Not a Dashboard Problem. It Is a Five-Layer Integration Problem That Nobody Has Solved Yet.

Jul 26, 2026

•

10 min read

LLM Observability Is Not a Dashboard Problem. It Is a Five-Layer Integration Problem That Nobody Has Solved Yet.

You can monitor a web service with four metrics: request latency, error rate, CPU utilization, and memory usage. Those four numbers tell you almost everything you need to know. An LLM in production breaks all four of those assumptions simultaneously. A model can produce fluent, syntactically correct output that is factually wrong.

Mohinish S

SnackOnAI Blog

SIE: Superlinked's Inference Engine Solves the Wrong Problem That Every Other Serving Stack Was Built For

Jul 25, 2026

•

13 min read

SIE: Superlinked's Inference Engine Solves the Wrong Problem That Every Other Serving Stack Was Built For

vLLM, SGLang, and TGI are built for one large model spread across many GPUs. Agents need the opposite: many small models sharing one GPU, switching on demand with sub-second cold start.

Mohinish S

SnackOnAI Blog

DevOps Open Agent: The AI Troubleshooter That Refuses To Trust Its Own AI

Jul 24, 2026

•

9 min read

DevOps Open Agent: The AI Troubleshooter That Refuses To Trust Its Own AI

Most "AI-powered" DevOps tools fail because they trust the LLM too much. This one is interesting because it treats the LLM as a hostile witness.

Mohinish S

SnackOnAI Blog

OpenShip: The Self-Hosted Deployment Platform That Builds Locally and Ships Containers, Leaving Your Servers Free to Do One Job

Jul 23, 2026

•

9 min read

OpenShip: The Self-Hosted Deployment Platform That Builds Locally and Ships Containers, Leaving Your Servers Free to Do One Job

Your production server runs Coolify. Coolify runs your apps. Also Coolify runs the CI. Also Coolify runs the dashboard. Also Coolify runs the build agent and the queue and the metrics collector.

Mohinish S

SnackOnAI Blog

Paperclip: The Agent Orchestration Layer That Solves the Problem Nobody Talks About, Which Is That Nobody Talked to the Agents Before Sending Them to Work

Jul 22, 2026

•

17 min read

Paperclip: The Agent Orchestration Layer That Solves the Problem Nobody Talks About, Which Is That Nobody Talked to the Agents Before Sending Them to Work

The tagline is exact: "If OpenClaw is an employee, Paperclip is the company."

Mohinish S

SnackOnAI Tutorial

From Box to Cluster: Two DGX Sparks, One k3s Cluster, 256GB of Unified GPU Memory, and Every ARM64 Landmine You Need to Know Before You Start

Jul 21, 2026

•

12 min read

From Box to Cluster: Two DGX Sparks, One k3s Cluster, 256GB of Unified GPU Memory, and Every ARM64 Landmine You Need to Know Before You Start

This guide, built from days of first-hand hardware experience, covers every decision and every command, including why certain approaches were tried and abandoned.

Mohinish S, +1

SnackOnAI Blog

Kimi K3 and Mooncake: Moonshot AI Shipped the World's First Open 3T-Class Model on a KVCache-Centric Inference Engine That Gets 525% More Throughput by Treating Cache as the Primary Citizen

Jul 20, 2026

•

12 min read

Kimi K3 and Mooncake: Moonshot AI Shipped the World's First Open 3T-Class Model on a KVCache-Centric Inference Engine That Gets 525% More Throughput by Treating Cache as the Primary Citizen

Every LLM serving paper optimizes for throughput. Mooncake optimizes for cache. The distinction sounds subtle. It is not. When you make KVCache (key-value cache, the memory structure that stores intermediate attention computations) the first-class citizen of your serving architecture, you stop thinking about GPU clusters as compute nodes and start thinking about them as a heterogeneous memory hierarchy.

Mohinish S

SnackOnAI Blog

ModelExpress: NVIDIA Dynamo's Rust-Based Weight Management Layer Transfers a 70B Model Between GPUs Faster Than Loading It From Disk. The JIT Cache Transfer Is the Feature Nobody Is Talking About.

Jul 19, 2026

•

16 min read

ModelExpress: NVIDIA Dynamo's Rust-Based Weight Management Layer Transfers a 70B Model Between GPUs Faster Than Loading It From Disk. The JIT Cache Transfer Is the Feature Nobody Is Talking About.

What used to be a minutes-long startup problem becomes a RDMA transfer measured in seconds.

Mohinish S

SnackOnAI Blog

Inkling: Thinking Machines Lab Built a 975B MoE With Controllable Thinking Effort, Relative Position Embeddings, and Short Convolutions on the Residual Stream. The Self-Fine-Tuning Demo Is the Real Signal.

Jul 17, 2026

•

16 min read

Inkling: Thinking Machines Lab Built a 975B MoE With Controllable Thinking Effort, Relative Position Embeddings, and Short Convolutions on the Residual Stream. The Self-Fine-Tuning Demo Is the Real Signal.

Inkling (thinkingmachines/Inkling, open-weights, July 15, 2026) is Thinking Machines Lab's first model release: a 975B-total/41B-active Mixture-of-Experts transformer with a 1M token context window, encoder-free multimodal inputs (audio as dMel spectrograms, vision as 40x40 pixel patches via 4-layer hMLP), controllable thinking effort (a float you pass at inference time), and 30M+ RL rollouts shaping its behavior.

Mohinish S

SnackOnAI Blog

OpenScience: The Open-Source AI Workbench Launched Five Days After Claude Science. It Supports More Models, More Skills, and Runs on Your Infrastructure. The Tradeoff Is Everything That Comes With Being Five Days Old.

Jul 16, 2026

•

15 min read

OpenScience: The Open-Source AI Workbench Launched Five Days After Claude Science. It Supports More Models, More Skills, and Runs on Your Infrastructure. The Tradeoff Is Everything That Comes With Being Five Days Old.

OpenScience (synthetic-sciences/openscience, Apache 2.0, v1.2.5, YC W26, openscience.sh) is a model-agnostic AI workbench for scientific research that runs the full research loop: literature review, hypothesis, code, experiment, analysis, and write-up, in one continuous session. It ships 250+ editable skills across ML, computational biology, cheminformatics, and cloud compute, plus 30+ scientific databases (UniProt, PDB, ChEMBL, arXiv, OpenAlex, Semantic Scholar) as native agent tools. Any frontier or open-weight model works with a single configuration flag; switching is per-request.

Mohinish S

SnackOnAI Blog

Atomic Task Graph: A 7B Model That Beats GPT-4 ReAct on ALFWorld and WebShop Has Nothing to Do With the 7B. It Is the Control Framework.

Jul 15, 2026

•

18 min read

Atomic Task Graph: A 7B Model That Beats GPT-4 ReAct on ALFWorld and WebShop Has Nothing to Do With the 7B. It Is the Control Framework.

ATG (arXiv:2607.01942, South China University of Technology + Tsinghua University, July 2026) is a training-free control framework that represents LLM agent planning and execution as an explicit directed acyclic graph of atomic tool-use units.

Mohinish S

SnackOnAI Blog

DeLM: The Multi-Agent Framework That Proved the Central Orchestrator Is the Bottleneck, Not the Solution

Jul 14, 2026

•

18 min read

DeLM: The Multi-Agent Framework That Proved the Central Orchestrator Is the Bottleneck, Not the Solution

DeLM (yuzhenmao/DeLM, arXiv:2606.10662, Stanford University, June 2026) is a decentralized multi-agent framework where parallel agents coordinate through a shared verified context and a task queue, with no central controller.

Mohinish S

SnackOnAI Blog

FlashInfer: The Attention Kernel Library That Proves the Bottleneck in LLM Inference Was Never the Model. It Was the Memory Access Pattern.

Jul 13, 2026

•

17 min read

FlashInfer: The Attention Kernel Library That Proves the Bottleneck in LLM Inference Was Never the Model. It Was the Memory Access Pattern.

FlashInfer (flashinfer-ai/flashinfer, Apache 2.0, 5.8k stars, MLSys 2025, arXiv:2501.01005) is a kernel library and kernel generator for LLM inference serving. Its three core contributions are a block-sparse composable format for heterogeneous KV-cache storage, a JIT-compiled customizable attention template system, and a load-balanced scheduling algorithm that works with CUDAGraph despite dynamic batching.

Mohinish S

SnackOnAI Blog

M Star: Stanford and UW Built a Universal Multimodal Serving System. The Key Insight Is That Every Model, From BAGEL to V-JEPA to Qwen3-Omni, Is Just a Graph. Every Request Is Just a Walk.

Jul 12, 2026

•

14 min read

M Star: Stanford and UW Built a Universal Multimodal Serving System. The Key Insight Is That Every Model, From BAGEL to V-JEPA to Qwen3-Omni, Is Just a Graph. Every Request Is Just a Walk.

M (mstar-project/mstar, arXiv:2606.12688, preprint June 2026, Stanford + University of Washington + CMU) is a universal serving runtime for composite multimodal models. Its core abstraction is the Walk Graph: a model is a directed computation graph of heterogeneous components, and every request executes as a series of Walks over that graph.

Mohinish S

SnackOnAI Blog

OpenSage: The Agent Development Kit That Lets the AI Build Its Own Agent Team Solved 39 of 50 Elite CTF Challenges. Claude Code Solved 13 of the Same 50.

Jul 11, 2026

•

13 min read

OpenSage: The Agent Development Kit That Lets the AI Build Its Own Agent Team Solved 39 of 50 Elite CTF Challenges. Claude Code Solved 13 of the Same 50.

OpenSage (opensage-agent/opensage-adk, Apache 2.0, ICML 2026, arXiv:2602.16891) is the first agent development kit where the LLM creates its own sub-agents, writes its own tools, and manages its own memory at runtime, without a human pre-specifying the topology.

Mohinish S

SnackOnAI Blog

OmniRoute: The Free AI Gateway That Turns 160+ Providers Into One Endpoint, Compresses Your Tokens by Up to 95%, and Falls Back Automatically When Any of Them Fails

Jul 10, 2026

•

12 min read

OmniRoute: The Free AI Gateway That Turns 160+ Providers Into One Endpoint, Compresses Your Tokens by Up to 95%, and Falls Back Automatically When Any of Them Fails

OmniRoute (diegosouzapw/OmniRoute, MIT, 4.5k stars, v3.7.9) is a local AI proxy that runs on port 20128 and exposes a single OpenAI-compatible endpoint to every coding tool you use.

Mohinish S

SnackOnAI Blog

Agent-Reach: The Most Honest Description of an AI Infrastructure Tool in 2026 Is "Pure Vibe Coding." The Tool Itself Is a Serious Piece of Agent Scaffolding.

Jul 9, 2026

•

13 min read

Agent-Reach: The Most Honest Description of an AI Infrastructure Tool in 2026 Is "Pure Vibe Coding." The Tool Itself Is a Serious Piece of Agent Scaffolding.

Agent-Reach (Panniantong/Agent-Reach, MIT, 20.3k stars, v1.4.0) gives your AI agent eyes to see the internet. Its design philosophy is the clearest statement of the scaffolding-not-framework principle I have read: install the right upstream tools, register a SKILL.md so the agent knows what it has, then get completely out of the way.

Mohinish S

SnackOnAI Blog

OpenMontage: The AI Video Production System That Proves "Agent as Orchestrator" Is Not a Research Concept Anymore. It Is a Production Architecture.

Jul 8, 2026

•

13 min read

OpenMontage: The AI Video Production System That Proves "Agent as Orchestrator" Is Not a Research Concept Anymore. It Is a Production Architecture.

OpenMontage (calesthio/OpenMontage, AGPL-3.0, 34.5k stars) is the #1 Repository of the Day on GitHub Trending on its launch day and the first open-source agentic video production system to compose a complete production workflow, 12 pipelines, 52 tools, 500+ agent skills, from a plain-language prompt.

Mohinish S

SnackOnAI Blog

Pocket TTS: The 100M-Parameter Voice Cloning Model That Runs on CPU Is a Proof-of-Concept for Why the Entire Audio Language Model Field Chose the Wrong Token Format

Jul 7, 2026

•

14 min read

Pocket TTS: The 100M-Parameter Voice Cloning Model That Runs on CPU Is a Proof-of-Concept for Why the Entire Audio Language Model Field Chose the Wrong Token Format

Every major audio language model, from MusicGen to AudioLM to Moshi, represents audio as sequences of discrete tokens from a lossy neural codec.

Mohinish S

SnackOnAI Blog

Handy: The Most Forkable Speech-to-Text App Is a Better Design Goal Than the Most Accurate One

Jul 6, 2026

•

12 min read

Handy: The Most Forkable Speech-to-Text App Is a Better Design Goal Than the Most Accurate One

The author of Handy (cjpais/Handy, MIT, 21k stars, v0.8.3) wrote this explicitly: "Handy isn't trying to be the best speech-to-text app, it's trying to be the most forkable one."

Mohinish S

Mohinish S

OpenWorker: Andrew Ng Built the AI Coworker That Delivers Finished Files, Not Chat, and Put the Agent Loop on Your Machine

CAMEL: The First Multi-Agent Framework That Used Its Own Agents to Generate the Data That Trained Its Competitors

Colibri: A Single C File Runs a 744B MoE on 25GB of RAM by Treating Every Expert as Data to Be Staged, Not State to Be Held

SkillOpt: Microsoft Trained a Markdown File to 80.7 on SpreadsheetBench, Up From 41.8, Without Touching a Single Model Weight

Attention Residuals: Moonshot AI Fixed the Residual Connection That Every Deep LLM Has Been Getting Wrong Since 2017

LLM Observability Is Not a Dashboard Problem. It Is a Five-Layer Integration Problem That Nobody Has Solved Yet.

SIE: Superlinked's Inference Engine Solves the Wrong Problem That Every Other Serving Stack Was Built For

DevOps Open Agent: The AI Troubleshooter That Refuses To Trust Its Own AI

OpenShip: The Self-Hosted Deployment Platform That Builds Locally and Ships Containers, Leaving Your Servers Free to Do One Job

Paperclip: The Agent Orchestration Layer That Solves the Problem Nobody Talks About, Which Is That Nobody Talked to the Agents Before Sending Them to Work

From Box to Cluster: Two DGX Sparks, One k3s Cluster, 256GB of Unified GPU Memory, and Every ARM64 Landmine You Need to Know Before You Start

Kimi K3 and Mooncake: Moonshot AI Shipped the World's First Open 3T-Class Model on a KVCache-Centric Inference Engine That Gets 525% More Throughput by Treating Cache as the Primary Citizen

ModelExpress: NVIDIA Dynamo's Rust-Based Weight Management Layer Transfers a 70B Model Between GPUs Faster Than Loading It From Disk. The JIT Cache Transfer Is the Feature Nobody Is Talking About.

Inkling: Thinking Machines Lab Built a 975B MoE With Controllable Thinking Effort, Relative Position Embeddings, and Short Convolutions on the Residual Stream. The Self-Fine-Tuning Demo Is the Real Signal.

OpenScience: The Open-Source AI Workbench Launched Five Days After Claude Science. It Supports More Models, More Skills, and Runs on Your Infrastructure. The Tradeoff Is Everything That Comes With Being Five Days Old.

Atomic Task Graph: A 7B Model That Beats GPT-4 ReAct on ALFWorld and WebShop Has Nothing to Do With the 7B. It Is the Control Framework.

DeLM: The Multi-Agent Framework That Proved the Central Orchestrator Is the Bottleneck, Not the Solution

FlashInfer: The Attention Kernel Library That Proves the Bottleneck in LLM Inference Was Never the Model. It Was the Memory Access Pattern.

M Star: Stanford and UW Built a Universal Multimodal Serving System. The Key Insight Is That Every Model, From BAGEL to V-JEPA to Qwen3-Omni, Is Just a Graph. Every Request Is Just a Walk.

OpenSage: The Agent Development Kit That Lets the AI Build Its Own Agent Team Solved 39 of 50 Elite CTF Challenges. Claude Code Solved 13 of the Same 50.

OmniRoute: The Free AI Gateway That Turns 160+ Providers Into One Endpoint, Compresses Your Tokens by Up to 95%, and Falls Back Automatically When Any of Them Fails

Agent-Reach: The Most Honest Description of an AI Infrastructure Tool in 2026 Is "Pure Vibe Coding." The Tool Itself Is a Serious Piece of Agent Scaffolding.

OpenMontage: The AI Video Production System That Proves "Agent as Orchestrator" Is Not a Research Concept Anymore. It Is a Production Architecture.

Pocket TTS: The 100M-Parameter Voice Cloning Model That Runs on CPU Is a Proof-of-Concept for Why the Entire Audio Language Model Field Chose the Wrong Token Format

Handy: The Most Forkable Speech-to-Text App Is a Better Design Goal Than the Most Accurate One

Quick Links

Subscription

Socials