Snack On AI

Never miss a post subscribe below to get the latest updates and exclusive content delivered straight to your inbox.

Jul 17, 2026

Inkling: Thinking Machines Lab Built a 975B MoE With Controllable Thinking Effort, Relative Position Embeddings, and Short Convolutions on the Residual Stream. The Self-Fine-Tuning Demo Is the Real Signal.

Inkling (thinkingmachines/Inkling, open-weights, July 15, 2026) is Thinking Machines Lab's first model release: a 975B-total/41B-active Mixture-of-Experts transformer with a 1M token context window, encoder-free multimodal inputs (audio as dMel spectrograms, vision as 40x40 pixel patches via 4-layer hMLP), controllable thinking effort (a float you pass at inference time), and 30M+ RL rollouts shaping its behavior.

Jul 16, 2026

OpenScience: The Open-Source AI Workbench Launched Five Days After Claude Science. It Supports More Models, More Skills, and Runs on Your Infrastructure. The Tradeoff Is Everything That Comes With Being Five Days Old.

OpenScience (synthetic-sciences/openscience, Apache 2.0, v1.2.5, YC W26, openscience.sh) is a model-agnostic AI workbench for scientific research that runs the full research loop: literature review, hypothesis, code, experiment, analysis, and write-up, in one continuous session. It ships 250+ editable skills across ML, computational biology, cheminformatics, and cloud compute, plus 30+ scientific databases (UniProt, PDB, ChEMBL, arXiv, OpenAlex, Semantic Scholar) as native agent tools. Any frontier or open-weight model works with a single configuration flag; switching is per-request.

Jul 15, 2026

Atomic Task Graph: A 7B Model That Beats GPT-4 ReAct on ALFWorld and WebShop Has Nothing to Do With the 7B. It Is the Control Framework.

ATG (arXiv:2607.01942, South China University of Technology + Tsinghua University, July 2026) is a training-free control framework that represents LLM agent planning and execution as an explicit directed acyclic graph of atomic tool-use units.

Jul 14, 2026

DeLM: The Multi-Agent Framework That Proved the Central Orchestrator Is the Bottleneck, Not the Solution

DeLM (yuzhenmao/DeLM, arXiv:2606.10662, Stanford University, June 2026) is a decentralized multi-agent framework where parallel agents coordinate through a shared verified context and a task queue, with no central controller.

Jul 13, 2026

FlashInfer: The Attention Kernel Library That Proves the Bottleneck in LLM Inference Was Never the Model. It Was the Memory Access Pattern.

FlashInfer (flashinfer-ai/flashinfer, Apache 2.0, 5.8k stars, MLSys 2025, arXiv:2501.01005) is a kernel library and kernel generator for LLM inference serving. Its three core contributions are a block-sparse composable format for heterogeneous KV-cache storage, a JIT-compiled customizable attention template system, and a load-balanced scheduling algorithm that works with CUDAGraph despite dynamic batching.

Jul 12, 2026

M Star: Stanford and UW Built a Universal Multimodal Serving System. The Key Insight Is That Every Model, From BAGEL to V-JEPA to Qwen3-Omni, Is Just a Graph. Every Request Is Just a Walk.

M (mstar-project/mstar, arXiv:2606.12688, preprint June 2026, Stanford + University of Washington + CMU) is a universal serving runtime for composite multimodal models. Its core abstraction is the Walk Graph: a model is a directed computation graph of heterogeneous components, and every request executes as a series of Walks over that graph.