SnackOnAI Blog

Jul 17, 2026

Inkling: Thinking Machines Lab Built a 975B MoE With Controllable Thinking Effort, Relative Position Embeddings, and Short Convolutions on the Residual Stream. The Self-Fine-Tuning Demo Is the Real Signal.

Inkling (thinkingmachines/Inkling, open-weights, July 15, 2026) is Thinking Machines Lab's first model release: a 975B-total/41B-active Mixture-of-Experts transformer with a 1M token context window, encoder-free multimodal inputs (audio as dMel spectrograms, vision as 40x40 pixel patches via 4-layer hMLP), controllable thinking effort (a float you pass at inference time), and 30M+ RL rollouts shaping its behavior.

Jul 16, 2026

OpenScience: The Open-Source AI Workbench Launched Five Days After Claude Science. It Supports More Models, More Skills, and Runs on Your Infrastructure. The Tradeoff Is Everything That Comes With Being Five Days Old.

OpenScience (synthetic-sciences/openscience, Apache 2.0, v1.2.5, YC W26, openscience.sh) is a model-agnostic AI workbench for scientific research that runs the full research loop: literature review, hypothesis, code, experiment, analysis, and write-up, in one continuous session. It ships 250+ editable skills across ML, computational biology, cheminformatics, and cloud compute, plus 30+ scientific databases (UniProt, PDB, ChEMBL, arXiv, OpenAlex, Semantic Scholar) as native agent tools. Any frontier or open-weight model works with a single configuration flag; switching is per-request.

Jul 15, 2026

Atomic Task Graph: A 7B Model That Beats GPT-4 ReAct on ALFWorld and WebShop Has Nothing to Do With the 7B. It Is the Control Framework.

ATG (arXiv:2607.01942, South China University of Technology + Tsinghua University, July 2026) is a training-free control framework that represents LLM agent planning and execution as an explicit directed acyclic graph of atomic tool-use units.

Jul 14, 2026

DeLM: The Multi-Agent Framework That Proved the Central Orchestrator Is the Bottleneck, Not the Solution

DeLM (yuzhenmao/DeLM, arXiv:2606.10662, Stanford University, June 2026) is a decentralized multi-agent framework where parallel agents coordinate through a shared verified context and a task queue, with no central controller.

Jul 13, 2026

FlashInfer: The Attention Kernel Library That Proves the Bottleneck in LLM Inference Was Never the Model. It Was the Memory Access Pattern.

FlashInfer (flashinfer-ai/flashinfer, Apache 2.0, 5.8k stars, MLSys 2025, arXiv:2501.01005) is a kernel library and kernel generator for LLM inference serving. Its three core contributions are a block-sparse composable format for heterogeneous KV-cache storage, a JIT-compiled customizable attention template system, and a load-balanced scheduling algorithm that works with CUDAGraph despite dynamic batching.

Jul 12, 2026

M Star: Stanford and UW Built a Universal Multimodal Serving System. The Key Insight Is That Every Model, From BAGEL to V-JEPA to Qwen3-Omni, Is Just a Graph. Every Request Is Just a Walk.

M (mstar-project/mstar, arXiv:2606.12688, preprint June 2026, Stanford + University of Washington + CMU) is a universal serving runtime for composite multimodal models. Its core abstraction is the Walk Graph: a model is a directed computation graph of heterogeneous components, and every request executes as a series of Walks over that graph.

Jul 11, 2026

OpenSage: The Agent Development Kit That Lets the AI Build Its Own Agent Team Solved 39 of 50 Elite CTF Challenges. Claude Code Solved 13 of the Same 50.

OpenSage (opensage-agent/opensage-adk, Apache 2.0, ICML 2026, arXiv:2602.16891) is the first agent development kit where the LLM creates its own sub-agents, writes its own tools, and manages its own memory at runtime, without a human pre-specifying the topology.

Jul 10, 2026

OmniRoute: The Free AI Gateway That Turns 160+ Providers Into One Endpoint, Compresses Your Tokens by Up to 95%, and Falls Back Automatically When Any of Them Fails

OmniRoute (diegosouzapw/OmniRoute, MIT, 4.5k stars, v3.7.9) is a local AI proxy that runs on port 20128 and exposes a single OpenAI-compatible endpoint to every coding tool you use.

Jul 9, 2026

Agent-Reach: The Most Honest Description of an AI Infrastructure Tool in 2026 Is "Pure Vibe Coding." The Tool Itself Is a Serious Piece of Agent Scaffolding.

Agent-Reach (Panniantong/Agent-Reach, MIT, 20.3k stars, v1.4.0) gives your AI agent eyes to see the internet. Its design philosophy is the clearest statement of the scaffolding-not-framework principle I have read: install the right upstream tools, register a SKILL.md so the agent knows what it has, then get completely out of the way.

Jul 8, 2026

OpenMontage: The AI Video Production System That Proves "Agent as Orchestrator" Is Not a Research Concept Anymore. It Is a Production Architecture.

OpenMontage (calesthio/OpenMontage, AGPL-3.0, 34.5k stars) is the #1 Repository of the Day on GitHub Trending on its launch day and the first open-source agentic video production system to compose a complete production workflow, 12 pipelines, 52 tools, 500+ agent skills, from a plain-language prompt.

Jul 7, 2026

Pocket TTS: The 100M-Parameter Voice Cloning Model That Runs on CPU Is a Proof-of-Concept for Why the Entire Audio Language Model Field Chose the Wrong Token Format

Every major audio language model, from MusicGen to AudioLM to Moshi, represents audio as sequences of discrete tokens from a lossy neural codec.

Jul 6, 2026

Handy: The Most Forkable Speech-to-Text App Is a Better Design Goal Than the Most Accurate One

The author of Handy (cjpais/Handy, MIT, 21k stars, v0.8.3) wrote this explicitly: "Handy isn't trying to be the best speech-to-text app, it's trying to be the most forkable one."

Jul 5, 2026

Meetily: The AI Meeting Assistant That Proves Privacy and Cloud Are Not a Tradeoff. They Are an Architecture Decision.

Every AI meeting tool you have seen evaluated sends your recordings to someone else's server. Meetily (Zackriya-Solutions/meetily, MIT, 15.4k stars) does not. It is a self-contained Tauri desktop application that captures audio, transcribes with Whisper or Parakeet, and summarizes via Ollama or any OpenAI-compatible endpoint, entirely on your local machine.

Jul 4, 2026

Loop Engineering: The Claude Code Paper Proves Your Agent Loop Is the Smallest Part. The Systems Around It Are Where the Real Engineering Lives.

Nine out of ten developers still prompt their AI coding agents by hand. For those who have moved to automated loops, the mistake is treating the loop itself as the engineering problem.

Jul 2, 2026

Hindsight: Agent Memory That Lifts a 20B Open-Source Model From 39% to 83.6% on Long-Horizon Benchmarks, Past Full-Context GPT-4o

Every agent memory system built in 2025 made the same mistake: treating memory as a retrieval problem. Dump facts into a vector store, fetch top-k, inject into context.

Jul 1, 2026

LuxTTS: The 48kHz Voice Cloning Model That Fits in 1GB of VRAM and Runs 150x Faster Than Real Time Is Built on an ASR Architecture Nobody Expected to Work for TTS

Every production TTS system in 2025 makes the same engineering tradeoff: quality requires a large diffusion model with many inference steps, and speed requires sacrificing either.

Jun 30, 2026

PixelRAG: Your RAG Pipeline Is Losing 40% of the Evidence Before the LLM Ever Sees It. The Fix Is Screenshots.

HTML parsers are the silent quality tax on every RAG pipeline. They discard 40%+ of recoverable text from a web page, and choosing a different parser changes your final accuracy by up to 10 percentage points on SimpleQA.

Jun 29, 2026

Sakana Fugu: A 10,000-Parameter Evolved Head Coordinates GPT-5, Claude, and Gemini Better Than Any of Them Can Work Alone

Sakana Fugu (sakana.ai/fugu-beta) is not a foundation model in the conventional sense. It is a trained coordinator that orchestrates a pool of frontier models, assigning them roles and custom instructions at inference time, producing answers that outperform any individual model in the pool.

Jun 28, 2026

MLflow Made OpenTelemetry the Default Substrate for LLM Tracing. The Gen AI Semantic Conventions Are the Spec Everyone Is Converging On, Whether They Know It or Not.

Every LLM observability tool in 2026 is building on the same underlying substrate: OpenTelemetry's GenAI Semantic Conventions, a CNCF-backed standard that defines exactly which span attributes capture an LLM call, how token counts are structured, and what a tool invocation looks like in a trace. MLflow 3.6.0 shipped an OTLP endpoint at /v1/traces, dual export, and native gen_ai.* attribute support.

Jun 27, 2026

SANA-Sprint Runs Text-to-Image in One Step. The Reason It Works Is Not the Distillation. It Is the Training Stability Fix Nobody Talks About.

Consistency distillation for text-to-image has been stuck at a fundamental tradeoff: good quality requires multiple steps, one-step generation produces artifacts. SANA-Sprint (NVIDIA + MIT, arXiv:2503.09641) breaks this at 1024x1024 resolution with 0.1 second latency on H100, achieving FID 7.59 and GenEval 0.72 in a single step, outperforming FLUX-schnell (12B parameters, 0.5 samples/s) while running a 0.6B model at 7.22 samples/s.

Jun 26, 2026

The Web Scraping Stack Behind AI Data Pipelines: Ten Open-Source Tools From HTTP to Agent-Driven Automation

The paid scraping industry sells access to open infrastructure you could own. The ten repos in this issue span every layer of the modern web data stack, from raw HTTP fingerprint impersonation at the bottom to AI agents that click, scroll, and log in at the top.

Jun 25, 2026

DeerFlow's Real Contribution Is Not Deep Research. It Is the Super Agent Harness Nobody Else Built.

DeerFlow v2.0 (bytedance/deer-flow, 68k stars, MIT) is a ground-up rewrite that shares zero code with v1.x. Version 1 was a single deep research pipeline. Version 2 is a super agent harness: an orchestration system that manages sub-agents, sandboxed code execution, long-term memory, and extensible skill packages, handling tasks that take minutes to hours rather than seconds.

Jun 24, 2026

The Agentic Web Stack: How x402, Agent Identity, and Micropayment Economics Form a Complete Protocol Layer

Five separate research papers and one production open standard published between September 2025 and April 2026 each solve one piece of the same problem: how do autonomous AI agents discover, authenticate to, and pay for internet services without human intervention at each step.

Jun 23, 2026

x402 and the Agentic Market: HTTP 402 Was Defined in 1992 and Nobody Implemented It Until AI Agents Made It Necessary

HTTP status code 402, "Payment Required," was reserved in the original HTTP/1.1 specification in 1992 with an explicit note: "reserved for future use."

Jun 22, 2026

Auth.MD: The robots.txt for AI Agents That Lets Them Register for Services Without a Sign-Up Form

Your AI agent needs to sign up for a SaaS tool on your behalf. Currently, it either screen-scrapes a sign-up form, gets blocked by CAPTCHA, asks you to do it manually, or your developer hard-codes credentials into the agent configuration.

Mohinish S

Serverless Ventures | Cloud, Data & Distributed Systems | Angel & Advisor | Infra & Data Startups