Sashiko: The AI That Reviews Linux Kernel Code Better Than Most Humans (And Everyone Knows It)

Sponsored by

SnackOnAI Engineering | Senior AI Systems Researcher | Technical Deep Dive | June 9, 2026

The Linux kernel community spent decades building the gold standard of open-source code review. Thousands of maintainers, hundreds of specialized subsystem reviewers, mailing lists stretching back 30 years. In March 2026, a nine-stage agentic AI system quietly started outperforming that entire apparatus on the one metric that actually matters: catching bugs before they ship.

That system is Sashiko (刺し子, "little stabs"). It found 53.6% of bugs from a completely unfiltered set of 1,000 recent upstream kernel commits. Every single one of those bugs had already passed human review.

Not a research demo. Not a synthetic benchmark. Production, on the real Linux Kernel Mailing List, funded by Google, owned by the Linux Foundation.

What It Actually Does

Sashiko is not a code-generation tool, not a copilot, not a fuzzer. It does one thing: reads proposed kernel patches and produces the kind of review a senior kernel engineer would write, but faster, at scale, without getting tired, and without caring whose name is on the patch.

It monitors lore.kernel.org for new submissions to the LKML and subsystem mailing lists. When a patch arrives, Sashiko ingests it along with significant surrounding kernel git history, runs an 11-stage review protocol, and posts findings back to the web interface (email delivery is in progress). Maintainers like Andrew Morton now wait for Sashiko's output before merging into their trees.

The scope is real: Sashiko caught use-after-free bugs, lock ordering violations, missing return checks, UAPI breakages, uninitialized memory leaks, and DMA mapping errors. These are the classes of bugs that kernel fuzzers like Syzkaller find only after they panic in CI or trigger CVEs in production.

The stack: written in Rust, backed by SQLite, talks to Gemini 3.1 Pro or Claude via API, ingests patches from NNTP (lore.kernel.org), and maintains a local Linux kernel git tree for context retrieval.

The Architecture, Unpacked

The design centers on a deliberate architectural choice: depth over speed. Sashiko does not try to parallelize one generic review; it sequences eleven specialized review passes, each building on the prior one.

Caption: Focus on Stage 8-10, the deduplication and conflict resolution loop. This is what keeps the false positive rate under 20%.

The review-prompt base was seeded from Chris Mason's kernel-review-prompts (Mason created Btrfs). Sashiko layers its own per-subsystem prompt overrides on top, so a memory management patch gets different heuristics than a driver patch.

The tool calls available to the LLM during review are deliberately narrow: read_file, git_grep, find_function, find_callchain, find_callers. No internet access, no code execution, no mutation. This is a read-only analysis agent. The constraint is intentional: uncontrolled tool access introduces latency and unpredictability; a small expressive toolset lets you reason about worst-case cost per review.

The Code, Annotated

Snippet 1: Claude provider configuration with prompt caching

# Settings.toml
[ai]
provider = "claude"
model = "claude-sonnet-4-5"
max_input_tokens = 180000  # ← Safety margin below 200K context limit

[ai.claude]
prompt_caching = true  # ← THIS is the trick
# Reuses the kernel context across the 11 review stages.
# Gemini 3.1 Pro runs the same config at 950K tokens.
# Without caching, each stage re-sends the full patch + context bundle,
# multiplying API cost by 11x. Caching amortizes the kernel history
# across stages at a 5-minute TTL per cached prefix.

The caching design is doing serious cost engineering here. A typical patch with full git history context might consume 100K-200K tokens. Without prompt caching, running 11 stages means paying for 1.1M-2.2M tokens per patch. With caching, you pay for the context once and reuse it across all stages. At production LKML volume, this is the difference between financially feasible and infeasible.

Snippet 2: Environment setup for NNTP ingestion and review concurrency

[nntp]
server = "nntp.lore.kernel.org"
port = 119
groups = ["org.kernel.linux.kernel", "org.kernel.linux.mm"]
# ← Monitors the actual LKML + subsystem lists; no polling delay tuning exposed

[review]
concurrency = 4         # ← Parallel reviews; bounded by LLM rate limits
worktrees = 4           # ← Git worktrees for parallel kernel history reads
                        # worktrees let concurrent reviews read the same
                        # kernel repo without blocking each other on file locks

[git]
kernel_path = "/path/to/linux"  # ← Must be a full clone with history
                                 # Sashiko reads callers, callchains, file history
                                 # A shallow clone breaks context retrieval entirely

The worktrees = 4 pairing with concurrency = 4 is not accidental. Git worktrees let each parallel review get an independent view of the kernel tree without git locking overhead. Without this, concurrent reviews would serialize on the repo lock and eliminate the concurrency benefit.

It In Action: End-to-End Worked Example

Input: A patch to mm/huge_memory.c that refactors zap_huge_pmd(), submitted to the linux-mm mailing list.

Step 1: Ingestion (under 2 seconds) Sashiko's NNTP monitor sees the new message on org.kernel.linux.mm. It extracts the patch diff, identifies the files touched (mm/huge_memory.c, mm/mmap.c), and bundles them with context from git history.

Step 2: Context expansion (~30 seconds) Using find_callchain and find_callers, Sashiko reads:

The full definition of zap_huge_pmd() before the patch
All callers of zap_huge_pmd() in the tree (pmd spinlock holders, mmap_lock contexts)
Related functions the patch modifies or calls
The last 20 commits touching those files Context bundle: typically 80K-150K tokens for a mid-size mm patch.

Step 3: 11-stage review (~4-8 minutes total with Gemini 3.1 Pro) Each stage gets the same context bundle (cached after Stage 1) plus focused instructions:

Stage 5 (locking): checks whether the patch preserves pmd spinlock ordering, if RCU read-side critical sections are balanced, if mmap_lock is held correctly across the refactored call sites
Stage 6 (security): checks for TOCTOU windows, OOB access on huge page boundary conditions, any new copy_to_user without bounds checks

Step 4: Deduplication and conflict resolution (Stage 8-9) Stage 8 consolidates all concerns from Stages 1-7. Stage 9 compares them against dismissed concerns (situations where a later stage determined a concern from an earlier stage was a false positive due to locking context that Stage 3 could not see). Result: typically 3-7 confirmed findings, severity-ranked.

Step 5: Report generation (Stage 11) Output is formatted as a standard LKML inline reply:

On [date], [author] wrote:
> +	if (pmd_trans_huge(*pvmw.pmd))

The refactored path drops the pmd_lock() annotation in the error exit
at line 847. The caller at mmap.c:3021 holds mmap_lock for write, but
not the pmd spinlock. This can race with split_huge_pmd() running
concurrently on the same pmd, leading to use-after-free on the pmd page.

Severity: high

Real numbers: Sashiko reviewed 4,765 patchsets on LKML by April 1, 2026. Of 252 linux-mm reviews with findings, 164 were low severity, 243 medium, 518 high, 56 critical. 85% of findings concern the submitted change or its direct interactions, not unrelated pre-existing code. The per-review bug hit rate on linux-mm is approximately 73.5% (vs 54.4% on the broader LKML), likely because memory management code has denser interdependencies that generate more findings.

Why This Design Works (and What It Trades Away)

Why it works:

The sequential 11-stage pipeline mimics expert specialization. A single generic "review this patch" prompt cannot simultaneously hold locking discipline, DMA coherency rules, RCU constraints, and UAPI backward compatibility in focus. By giving each concern its own stage, Sashiko forces the LLM to reason about the code from a single specialized angle at a time. The deduplication stages then cross-reference those focused reviews.

The read-only constraint on tool use means reproducible, auditable behavior. You can log exactly which files were read, which git commands ran, and replay the review.

Seeding from Chris Mason's prompts gave Sashiko a standing start. These prompts encode years of subsystem-specific review heuristics that would have taken months to rediscover by trial and error.

What it trades away:

Every patch with a full kernel history context bundle costs real money. Google is funding the production instance; self-hosted deployments need careful cost monitoring. The 5-minute Claude prompt cache TTL means long-running reviews that exceed the TTL re-pay the context cost on later stages.

The probabilistic nature of LLM output means the same patch may get different findings on different runs. Sashiko acknowledges this explicitly. It is a probabilistic layer in a deterministic pipeline, which means you cannot treat its output as final, only as evidence.

Race-condition vulnerabilities are harder. The Patch-to-PoC study (arxiv:2602.07287) on LLM-based kernel vulnerability reproduction found that race conditions with complex thread interleavings remain challenging for LLM agents, who are good at invoking syscall sequences but struggle to reason about non-deterministic concurrent execution paths.

Technical Moats

Prompt engineering depth. The 11-stage protocol with per-subsystem overrides is not replicable by copying the README. It encodes kernel-specific review heuristics that took kernel developers years to internalize. Mason's review-prompts repo is a starting point; the actual production prompts at Sashiko are the real asset.

Context bundle construction. The find_callchain and find_callers tools retrieve exactly the context that a human reviewer would read, not a naive sliding window over the diff. Getting this right requires understanding kernel coding conventions (e.g., understanding that rcu_dereference() is only valid inside a read-side critical section, so you need to trace the callers to know whether the critical section exists).

Deduplication architecture. Stage 8-9 are the false-positive filters. Running stages in sequence without a deduplication pass would flood maintainers with redundant findings. The conflict resolution in Stage 9 is where the 20% false positive cap is enforced.

Linux Foundation ownership. The code belongs to the kernel community now. This is not a vendor tool that can be deprecated or gated. The compute is donated; the protocol is open.

Contrarian Insights

Insight 1: The 53.6% bug catch rate is evidence that human code review is structurally broken at scale, not that Sashiko is good.

Human reviewers missed 100% of the bugs Sashiko caught. Not because the reviewers are bad engineers, but because the Linux kernel processes thousands of patches per month and the review burden is concentrated on a handful of senior maintainers. The Purdue FLINT study (arxiv:2603.24825) quantified this: the memory management subsystem depends significantly on just a few developers for reviews, and they are already failing to keep up with submission volume. Sashiko did not raise the bar. It revealed how low the bar was.

Insight 2: Mandatory Sashiko review will make kernel development slower in the short term and that is the correct engineering call.

Andrew Morton has already signaled he will delay merges to give Sashiko time to run. Lorenzo Stoakes objected that this adds burden to overworked maintainers. Both are right. The correct framing is: the kernel community has been accepting a hidden bug tax by optimizing for review throughput. Sashiko makes the tax visible. Slowing down is not a failure mode; it is the correct response to discovering that your review process has a 53.6% miss rate on bugs that eventually required Fixes: tags in production commits.

Takeaway

Sashiko reads more code than the diff. The review protocol instructs the LLM to call find_callers and find_callchain on every modified function. This means roughly 1 in 5 linux-mm reviews includes findings about pre-existing bugs in code the patch author merely touched but did not write. When you submit a refactor to zap_huge_pmd() and Sashiko flags a use-after-free in a caller you did not modify, that bug ends up in your inbox. This is technically correct behavior (you are the person best positioned to understand the context), but it explains why maintainers perceive the false-positive rate as higher than the official sub-20% figure: the "false positives" are often real bugs, just not in the code you submitted.

TL;DR For Engineers

Sashiko is an 11-stage sequential agentic reviewer written in Rust, running Gemini 3.1 Pro or Claude, reviewing every LKML submission in production today
It caught 53.6% of bugs on a 1,000-commit benchmark where human reviewers caught 0% of the same bugs
The architecture uses prompt caching (Claude) or large context windows (Gemini at 950K tokens) to amortize 11 LLM calls per patch into a tractable cost
Read-only tool use (read_file, git_grep, find_callers, find_callchain) keeps behavior auditable and reproducible
The 73.5% per-review hit rate on linux-mm vs 54.4% on broader LKML confirms that subsystem-specialized prompts outperform generic review

Conclusion: The First AI That the Linux Kernel Actually Trusts

Sashiko passed the only test that matters in the kernel community: maintainers changed their workflow because of it. Andrew Morton now waits for its output before merging. The Linux Foundation owns the code. Google funds the compute. The review protocol is open and forkable.

This is not AI as a productivity tool. It is AI as infrastructure, embedded in the most scrutinized open-source project in computing history. Whether that accelerates the next generation of kernel development or creates a new class of review theater depends entirely on whether the community treats Sashiko's output as evidence or as authority.

That distinction will define the next three years of kernel development more than any model upgrade.

References

Sashiko is an 11-stage agentic Linux kernel patch reviewer, written in Rust, owned by the Linux Foundation, and running in production on LKML today. It caught 53.6% of bugs that 100% of human reviewers missed on a 1,000-commit benchmark. Its architecture sequences specialized review passes with prompt caching and read-only kernel context retrieval to keep cost and false positives bounded.

Sponsored Ad

If you enjoy practical AI insights, check out SnackOnAI and support the newsletter by subscribing, sharing, and exploring our sponsored ad, it helps us keep building and delivering value 🚀

Stop Paying for 6 Tools. One AI Does It All

Most e-commerce sellers are running their store across 6 to 8 separate tools — and paying hundreds of dollars a month for the privilege. StoreClaw replaces your entire stack with one autonomous AI engine that monitors competitors, optimizes listings, automates marketing, and tracks real profit across Shopify, Amazon, and beyond.

It doesn't wait for you to ask. It runs 24/7 in the background, so you wake up to a full dashboard instead of a list of things you forgot to check.

Connect your store, and StoreClaw gets to work — no prompts, no complex setup, no six-app stack.

Free to start. No credit card required.

Try it for free today