KARL: Why Databricks Built a Custom RL Agent Instead of Paying for Claude

In partnership with

SnackOnAI Engineering | Senior AI Systems Researcher | Technical Deep Dive | May 10, 2026

The prevailing assumption in enterprise AI deployment is that frontier model APIs are the path of least resistance for complex knowledge work. Call GPT or Claude, get a good answer, pay the inference cost, accept the latency. This works until inference costs grow unsustainably, which Databricks describes as happening already for their Agent Bricks product.

KARL (Knowledge Agent via Reinforcement Learning, arXiv:2603.05218, Databricks AI Research, March 5, 2026) is the documented result of the alternative path: train a custom model specifically for enterprise grounded reasoning using reinforcement learning on synthetic data, and measure whether it can match frontier models on quality while beating them on cost and latency. The paper makes four distinct technical contributions: KARLBench (a six-task multi-capability evaluation suite), multi-task heterogeneous RL training, an agentic synthesis pipeline for hard-to-verify training data, and OAPL (a new off-policy RL post-training paradigm). The benchmark results claim 33% lower cost and 47% lower latency versus Claude Opus 4.6 at matched quality on KARLBench.

The caveat is important and the paper does not hide it: Databricks designed KARLBench itself and has not released the dataset for independent verification. The most independently meaningful result, which is also the most technically interesting, is something different: KARL was trained on only two of the six KARLBench task types and generalizes to all four held-out tasks without additional training. This out-of-distribution generalization is harder to game through benchmark design and is the stronger evidence that RL produced genuine reasoning capabilities rather than task-specific memorization.

This newsletter dissects KARL as a systems and training methodology document: what KARLBench's six task types reveal about enterprise search diversity, how the agentic synthesis pipeline generates hard-to-verify training data at scale, what OAPL's off-policy design solves that online GRPO could not handle for large MoE models, why trained context compression matters more than embedding model choice, and what the SFT-versus-RL ablation reveals about the limits of distillation for novel task generalization.

Scope: KARL architecture, KARLBench task taxonomy, OAPL training paradigm, agentic synthesis pipeline, context compression, and the SFT-vs-RL ablation. Not covered: Agent Bricks product integration details, or the WebSailor/SearchGym papers beyond brief comparison.

What It Actually Does

KARL is a knowledge agent trained by Databricks AI Research to perform grounded reasoning, multi-step information gathering combined with complex reasoning grounded in retrieved evidence, across a diverse set of enterprise search tasks. It is built on top of GLM 4.5 Air (Zeng et al., 2025) as the base model and post-trained using OAPL, Databricks' new off-policy RL paradigm.

The key architecture decision: KARL is a single-tool agent. During both training and evaluation, it has access to exactly one tool: a vector search endpoint against a document corpus. This constraint is deliberate. It isolates the agent's reasoning quality from tool selection, routing, and API-calling capabilities. KARL must solve all six KARLBench task types using only iterative vector search.

KARLBench six task regimes:

Task	Type	Benchmark
Constraint-driven entity search	Boolean + filter over retrieved docs	BrowseComp-Plus
Cross-document report synthesis	Multi-hop, multi-source synthesis	TREC-Biogen
Tabular numerical reasoning	Long-doc + table extraction + arithmetic	FinanceBench
Exhaustive entity retrieval	Find ALL relevant entities, not just top-k	QAMPARI
Procedural reasoning over technical docs	Multi-step instruction following	FreshStack
Fact aggregation over enterprise notes	Internal knowledge base Q&A	PMBench

KARL was trained on BrowseComp-Plus and TREC-Biogen (the first two). The four remaining tasks were held out and never seen during training. Generalization to these held-out tasks is the central result.

The Architecture, Unpacked

Focus on the training task mismatch: KARL is trained on 2 of 6 KARLBench tasks and evaluated on all 6. The four held-out tasks were never seen during training. Generalization to these tasks, not the headline cost/latency numbers, is the primary evidence that RL developed generalizable search reasoning.

The Code, Annotated

Snippet One: OAPL Off-Policy RL Objective (design intent reconstruction)

# OAPL: Off-Policy Agentic Post-training for Large-batch RL
# (Ritter et al. 2026, concurrent with KARL paper)
# This reconstructs the key design decisions from the paper's description.

import torch
import torch.nn.functional as F

def oapl_loss(
    model,
    trajectory_batch: list[dict],  # pre-collected trajectories (off-policy)
    reward_fn,  # hard-to-verify reward: LLM judge or exact match
) -> torch.Tensor:
    """
    OAPL: Iterative large-batch off-policy RL for agentic tasks.

    Why off-policy instead of on-policy (GRPO)?
    ← THIS is the key design decision:
    Online GRPO requires tight coupling between the trainer and the inference engine.
    For large-scale MoE models, this coupling creates infrastructure complexity:
    - clipped importance weighting (to correct for policy drift)
    - data deletion (when trajectories are too off-policy)
    - router replay (to maintain load balancing in MoE)
    OAPL embraces off-policyness: it accepts that training and inference
    engines will diverge, and designs the objective to be robust to this.
    Result: no clipped IS weights, no deletion, no replay. Simpler infrastructure.

    Large-batch approach: collect many trajectories, update once per batch
    ← This is more sample-efficient than per-step online updates
    ← Easier to parallelize across many inference workers
    """
    total_loss = torch.tensor(0.0, requires_grad=True)

    for traj in trajectory_batch:
        # Each trajectory: sequence of (state, action, tool_result) triples
        # Action = one vector search query issued by the agent
        # tool_result = retrieved document chunks (context)
        states = traj['states']       # list of context states
        actions = traj['actions']     # list of search queries issued
        final_answer = traj['answer'] # agent's final synthesized answer
        corpus_id = traj['corpus_id'] # which document corpus was searched

        # Reward: hard-to-verify (no single correct answer for synthesis tasks)
        # For BrowseComp-Plus: exact entity match or LLM-based binary judge
        # For TREC-Biogen: TREC evaluation metrics (precision, recall on citations)
        # ← THIS is the critical challenge: reward signal for grounded reasoning
        #   cannot use simple string matching. LLM-as-judge introduces noise.
        reward = reward_fn(
            answer=final_answer,
            ground_truth=traj['ground_truth'],
            task_type=traj['task_type'],
        )

        # Policy gradient over full trajectory
        # Off-policy correction: importance weights, but NOT clipped
        # OAPL accepts IS weight variance in exchange for stability
        log_probs = compute_trajectory_log_prob(model, states, actions, final_answer)

        # ← Multi-task: simply add losses from different task types
        # No separate task-specific routing or weighting required
        # BrowseComp-Plus loss + TREC-Biogen loss → combined gradient
        task_loss = -reward * log_probs.sum()
        total_loss = total_loss + task_loss

    return total_loss / len(trajectory_batch)


def compute_context_compression_reward(
    compressed_context: str,
    original_retrieved_docs: list[str],
    final_answer_quality: float,
) -> float:
    """
    Implicit reward for context compression via final answer quality.

    KARL learns to compress retrieved documents end-to-end.
    There is no explicit compression objective: the model receives reward
    for producing the correct final answer, and compression emerges
    because shorter, more relevant context → better final reasoning.

    Ablation result from the paper:
    Without trained compression: significantly worse performance
    ← Embedding model choice: robust (swapping doesn't hurt much)
    ← Context compression: critical (removing this collapses quality)
    """
    # ← THIS is the trick: compression is NOT a separate module
    # RL trains the agent to produce a compressed scratchpad as a byproduct
    # of maximizing final answer reward
    # The reward signal does not mention compression at all
    return final_answer_quality  # compression emerges from this signal alone

The multi-task training simplicity is the OAPL design payoff. Online GRPO for a multi-task MoE model would require careful routing, task-balanced batches, and per-task gradient management. OAPL's large-batch off-policy approach combines task losses with a simple sum. The infrastructure complexity the paper avoids is what would otherwise make multi-task RL on large MoE models impractical.

Snippet Two: Agentic Synthesis Pipeline and Hard-to-Verify Reward Design

# Agentic synthesis pipeline: generating training data for hard-to-verify tasks
# This is the most underappreciated contribution in the KARL paper.

from typing import Callable
import json

class AgenticSynthesisPipeline:
    """
    Generate diverse, grounded (q, a, trajectory) triples for RL training.

    The problem with hard-to-verify tasks:
    Enterprise grounded reasoning often has no single correct answer.
    "Synthesize a report on Q3 clinical trial results from these 47 papers"
    has no exact-match ground truth. Standard RL reward functions fail here.

    KARL's approach:
    1. Use frontier LLMs with tool use to generate high-quality (q, a) pairs
    2. Verify quality by having a separate LLM judge the answer-evidence alignment
    3. Use the resulting (q, a, trajectory) triples as RL training data
    4. Bootstrap: use increasingly capable KARL models to generate harder data
    """

    def __init__(self, frontier_model, vector_search_tool, judge_model):
        self.frontier_model = frontier_model  # e.g., Claude or GPT for data gen
        self.search_tool = vector_search_tool
        self.judge_model = judge_model  # LLM-as-judge for quality verification

    def generate_training_triple(
        self,
        corpus: list[str],
        task_type: str,
    ) -> dict | None:
        """
        Generate one (query, answer, trajectory) triple via agentic exploration.
        Returns None if the generated answer fails quality verification.
        """
        # Step 1: Frontier model explores the corpus with tool use
        # ← Long-horizon reasoning: 10-50 search steps to find the answer
        # This is expensive but produces diverse, grounded training data
        trajectory = []
        context_window = []

        # Generate a query that requires multi-step search
        query = self.frontier_model.generate_query(
            corpus_sample=corpus[:10],  # give a glimpse of the corpus
            task_type=task_type,
            diversity_prompt="Generate a query that requires at least 5 search steps"
        )

        # Agentic exploration: frontier model searches and synthesizes
        max_steps = 50
        for step in range(max_steps):
            search_query = self.frontier_model.plan_next_search(
                original_query=query,
                trajectory_so_far=trajectory,
                context=context_window,
            )

            # Execute vector search (the only tool available)
            retrieved_docs = self.search_tool.search(search_query, k=5)
            trajectory.append({'query': search_query, 'results': retrieved_docs})

            # Model decides whether to continue searching or synthesize
            if self.frontier_model.should_synthesize(trajectory, query):
                break

            context_window.extend(retrieved_docs)

        # Generate final answer from full trajectory
        answer = self.frontier_model.synthesize_answer(query, trajectory)

        # Step 2: Quality verification via LLM judge
        # ← THIS is the trick: hard-to-verify tasks need LLM-as-judge
        #   rather than exact match. The judge checks:
        #   - Is the answer grounded in the retrieved documents?
        #   - Does the answer correctly address the query?
        #   - Are citations accurate (for synthesis tasks)?
        quality_score = self.judge_model.evaluate(
            query=query,
            answer=answer,
            retrieved_docs=trajectory,
            criteria=['grounding', 'accuracy', 'completeness'],
        )

        if quality_score < 0.7:  # quality threshold
            return None  # discard low-quality training examples

        return {
            'query': query,
            'answer': answer,
            'trajectory': trajectory,
            'task_type': task_type,
            'quality_score': quality_score,
        }

    def bootstrap_round(self, current_karl_model, harder_threshold: float = 0.8):
        """
        Iterative bootstrapping: use current KARL to generate harder training data.

        ← Why bootstrap? Because frontier-model-generated data may be too easy
        for a KARL model that is already performing well on BrowseComp-Plus.
        Using KARL itself to generate data creates harder queries and longer
        trajectories that push the model's current capability ceiling.
        """
        # Current KARL generates queries it almost (but not quite) gets right
        # ← Uses quality threshold to select "near-miss" examples
        # These are maximally informative for RL: the model gets signal
        # at exactly the boundary of its current capability
        pass  # full implementation requires KARL inference loop

The iterative bootstrapping loop is the mechanism that keeps training data difficulty calibrated to the model's current capability level. A frontier model generating training data for KARL's first training round produces data that is appropriate for the base GLM 4.5 Air model. By round three, KARL itself generates harder examples that push its capability ceiling further.

It In Action: End-to-End Worked Example

Task type: Cross-document report synthesis (TREC-Biogen, one of the two training tasks)

Input query: "What are the efficacy and safety outcomes for the use of vedolizumab versus biologics in patients with inflammatory bowel disease? Cite supporting evidence from clinical trial reports."

Agent execution (KARL, up to ~30 search steps):

Step 1: search("vedolizumab IBD efficacy clinical trial")
  → Retrieved: 5 documents about vedolizumab Phase 3 trials
  Context window so far: 5 docs, ~8,000 tokens

Step 2: search("vedolizumab versus adalimumab comparative trial IBD")
  → Retrieved: 3 comparative studies
  Context: 8 docs, ~13,000 tokens

Step 3: search("vedolizumab safety profile adverse events ulcerative colitis")
  → Retrieved: 4 safety reports
  Context: 12 docs, ~20,000 tokens

[Agent compresses context at this point]
  ← Trained context compression: agent writes a structured scratchpad
    "Key efficacy: GEMINI trials show 47.1% clinical response at week 6.
     Safety: lower incidence of opportunistic infections vs anti-TNF agents.
     Comparative: VARSITY trial shows superiority to adalimumab in UC."
  Context after compression: ~4,000 tokens (80% reduction)

Steps 4-18: [additional targeted searches for specific safety metrics,
            pediatric studies, Crohn's vs UC subgroups, mechanism of action]

Step 19: [Agent synthesizes report]

Final answer: 847-word clinical report with 23 citation references,
  covering: efficacy endpoints (clinical remission, endoscopic remission),
  safety profile (infection rates, hypersensitivity), comparative effectiveness,
  and patient selection criteria.

Evaluation: TREC-Biogen citation precision/recall metrics
  Citation precision: 0.87 (23 cited, 20 correct)
  Citation recall: 0.74 (27 relevant docs in corpus, found 20)

Out-of-distribution task: Exhaustive entity retrieval (QAMPARI, held out during training)

Query: "List ALL drugs approved by the FDA for treatment of relapsed/refractory
        multiple myeloma between 2015 and 2024."

KARL behavior (never trained on exhaustive retrieval):
  Steps 1-8: Searches for myeloma drug approvals by year
  Steps 9-15: Cross-references FDA approval dates
  Steps 16-22: Checks for completeness, searches for missed approvals
              (the exhaustive retrieval signal emerged from RL training,
               not task-specific supervision)

Output: [bortezomib (2003), carfilzomib (2012), lenalidomide (2006),
         pomalidomide (2013), daratumumab (2015), elotuzumab (2015),
         ixazomib (2015), selinexor (2019), isatuximab (2020),
         belantamab mafodotin (2020), idecabtagene vicleucel (2021),
         teclistamab (2022), talquetamab (2023), elranatamab (2023)]

Evaluation: QAMPARI F1 score
  KARL (RL-trained, OOD): 0.68
  SFT-distilled baseline (OOD): 0.51
  Claude Opus 4.6 (OOD): 0.71

← The OOD gap (0.68 vs 0.51 for SFT) is the key result.
  SFT learns to imitate the training task pattern.
  RL develops retrievable strategies that transfer.

Test-time compute scaling:

KARLBench overall score (best-of-N parallel rollouts):
  N=1:   KARL < Sonnet 4.6, KARL < Opus 4.6
  N=3:   KARL > Sonnet 4.6, KARL < Opus 4.6
  N=10:  KARL ≈ Opus 4.6 (matches)
  N>10:  KARL > Opus 4.6 (surpasses)

At N=1 (baseline cost):
  Cost: 33% less than Claude Opus 4.6
  Latency: 47% lower than Claude Opus 4.6
  Quality: Below Opus 4.6 but above Sonnet 4.6

The Pareto frontier: at every point on the cost-quality curve,
KARL is better than either Claude 4.6 variant on KARLBench.

Why This Design Works, and What It Trades Away

Multi-task RL training is the design decision that produces OOD generalization. Training KARL on only two task types (BrowseComp-Plus and TREC-Biogen) with a single objective that combines both losses forces the model to learn search strategies that are not task-specific. A model trained on constraint-driven entity search alone learns to search for entities under constraints. A model trained on report synthesis alone learns to gather and organize evidence for reports. KARL trained on both simultaneously learns that all six KARLBench tasks share the underlying requirement of iterative evidence gathering, and the strategies that work for gathering evidence under constraints also work for exhaustive entity retrieval.

The SFT ablation is the clearest evidence that this interpretation is correct. SFT distillation of the same two tasks produces a model that improves substantially on in-distribution tasks (69.1 → 75.3 with parallel sampling) but barely improves on OOD tasks (59.4 → 59.6 with parallel sampling). The model learned to imitate the expert behavior on training tasks rather than the underlying search strategy. RL learns the strategy. SFT learns the behavior. The distinction matters when you encounter a task outside the training distribution.

Trained context compression emerging from end-to-end RL is the most important ablation result in the paper. When the paper removes the agent's ability to compress its context mid-trajectory, quality drops significantly across all tasks. This compression is not implemented as a separate module. It is a behavior that RL induces because maintaining a compressed, relevant context window produces better final answers, which produces higher rewards. The model teaches itself to compress because compression leads to better outcomes.

What KARL trades away:

KARLBench independence. Databricks designed and controls KARLBench. The 33% cost and 47% latency claims are measured on Databricks' own benchmark, which has not been released for independent verification. The OOD generalization result is the claim that is hardest to game through benchmark design, and it remains meaningful. The headline cost and latency numbers should be treated as internal benchmarks until third-party replication is available.

Single-tool constraint. KARL uses only vector search. Enterprise search pipelines typically require SQL queries, API calls, structured data lookups, and web search in addition to vector retrieval. A single-tool constraint produces cleaner research results but limits direct applicability to production enterprise stacks that require multi-tool orchestration.

Task coverage beyond the six KARLBench types. The six tasks cover a meaningful subset of enterprise knowledge work but not all of it. Tasks requiring graph traversal, time-series reasoning over structured databases, or multi-modal retrieval are not represented. The OOD generalization claim applies within the space of text-based retrieval and synthesis tasks.

Technical Moats

OAPL's robustness to trainer-inference discrepancy. Large-scale MoE models like GLM 4.5 Air cannot be loaded into the same environment as the RL trainer without significant engineering overhead. Online GRPO, which requires tight coupling between inference and training, hits infrastructure complexity walls: clipped importance weighting, data deletion policies, and router replay all exist to handle policy drift in online settings. OAPL's embrace of off-policyness eliminates this complexity. The ability to use vLLM for fast inference while training with a separate optimizer, without algorithmic corrections, is a genuine infrastructure advantage.

Iterative bootstrapping with increasingly capable models. The agentic synthesis pipeline generates training data using the current best model, which generates harder queries as the model improves. This self-improving data loop is the mechanism that keeps training data difficulty calibrated to the model's current capability ceiling. Replicating this requires both the inference infrastructure to run large-scale rollouts and the judgment infrastructure to verify generated query-answer quality, both of which Databricks has as a core platform capability.

The KARLBench benchmark itself. Whether or not the benchmark is ideal for independent comparison, the six-task taxonomy is the most comprehensive characterization of enterprise search diversity published as a single evaluation suite. Any competing enterprise knowledge agent now has a target. Building an equivalent multi-task evaluation that covers constraint-driven search, synthesis, tabular reasoning, exhaustive retrieval, procedural reasoning, and internal knowledge retrieval requires significant curation effort.

Insights

Insight One: KARL's cost-latency advantage is real but the more important claim is the OOD generalization, and industry coverage has inverted this priority.

The 33% cost and 47% latency reductions are the headline numbers in both the Databricks blog post and most coverage. They are also the claims most dependent on KARLBench design choices and least independently verifiable. The SFT-versus-RL OOD ablation (SFT OOD: 59.4→59.6 with parallel sampling, RL OOD: meaningful scaling) is the claim that stands regardless of benchmark design. If RL produces models that generalize to unseen task types while SFT produces models that do not, this distinction matters for every enterprise that does not have six clean task categories in their knowledge work. Most enterprises have task distributions that do not match any training set. The OOD generalization result is therefore the actual competitive advantage.

Insight Two: KARL demonstrates that the hard problem in training enterprise knowledge agents is not the RL algorithm. It is generating reward signals for hard-to-verify tasks, and the agentic synthesis pipeline is the solution that has received the least attention.

Reinforcement learning for coding agents has exact-match reward: code either passes the tests or it does not. Reinforcement learning for enterprise grounded reasoning does not have exact-match reward: synthesis tasks, report generation, and multi-hop entity retrieval have no single correct answer. The paper's third contribution, the agentic synthesis pipeline that generates diverse, grounded, hard-to-verify training data via frontier model exploration with LLM-as-judge quality verification, is what makes the RL training tractable. Without a mechanism to generate high-quality training data for hard-to-verify tasks, the RL algorithm is irrelevant. The pipeline is the bottleneck, and it receives two paragraphs in the abstract compared to extensive treatment of OAPL.

Takeaway

KARL's trained context compression is more important than the choice of embedding model for vector search, according to the ablation study, and this inverts the conventional wisdom about RAG optimization.

The standard enterprise RAG optimization conversation focuses heavily on embedding model selection: which embedding model produces the best retrieval quality, how much does chunk size matter, which retrieval strategy (dense vs. sparse vs. hybrid) dominates. KARL's ablation shows that swapping the embedding model produces negligible quality difference, while removing the agent's trained ability to compress its context mid-trajectory produces significant quality degradation across all tasks. The agent's learned compression of retrieved context, which emerges from end-to-end RL training with no explicit compression objective, matters more than which embedding model fetched the context in the first place. For teams spending engineering effort optimizing their RAG embedding pipeline, this is the most practically actionable finding in the paper: the reasoning model's ability to distill what it retrieves may matter more than the precision of the retrieval itself.

TL;DR For Engineers

KARL (Databricks, arXiv:2603.05218, March 2026) is a knowledge agent trained via RL on GLM 4.5 Air for enterprise grounded reasoning. Trained on 2 of 6 KARLBench task types, generalizes to all 4 held-out types. Pareto-optimal vs Claude 4.6 and GPT 5.2 on cost-quality and latency-quality on KARLBench.
Four contributions: KARLBench (6-task multi-capability eval), multi-task heterogeneous RL training, agentic synthesis pipeline for hard-to-verify data, and OAPL (iterative large-batch off-policy RL robust to trainer/inference discrepancy without clipped IS weights, data deletion, or router replay).
The critical ablation: SFT distillation improves OOD performance from 59.4→59.6 with parallel sampling. RL scales meaningfully OOD. SFT learns behavior, RL learns strategy. This distinction drives the OOD generalization result that is harder to game through benchmark design than cost/latency claims.
Trained context compression (emergent from RL with no explicit objective) is more important than embedding model choice. Ablation shows removing compression collapses quality; swapping the embedding model does not.
Test-time compute scaling: N=3 rollouts beats Sonnet 4.6 quality, N=10 matches Opus 4.6 quality, at 33% lower cost and 47% lower latency than Opus 4.6 at N=1 on KARLBench (Databricks' own benchmark, not independently verified).

The Benchmark Is Theirs. The Generalization Result Is Not.

KARL makes two claims: a cost-latency-quality claim on a benchmark Databricks controls, and an OOD generalization claim that is structurally harder to manipulate. The industry has been discussing the first and largely ignoring the second. The correct framing is that the OOD generalization, training on two task types and generalizing to four held-out types without additional supervision, is the evidence that Databricks' RL training approach did something real, not just something that looks good on a proprietary benchmark. If that result holds under independent evaluation, the combination of OAPL's training efficiency, the agentic synthesis pipeline's ability to generate hard-to-verify training data, and the resulting OOD transfer represents a replicable template for training purpose-built enterprise knowledge agents at lower cost than routing through frontier APIs.

References

KARL: Knowledge Agents via Reinforcement Learning, arXiv:2603.05218, Databricks AI Research, March 5, 2026
Meet KARL: A Faster Agent for Enterprise Knowledge, Databricks Engineering Blog, March 5, 2026
KARL Technical Report PDF
WebSailor-V2: Bridging the Chasm to Proprietary Agents, arXiv:2509.13305, related web search agent work
WebSailor: Navigating Super-human Reasoning for Web Agent, arXiv:2507.02592
SearchGym: Bootstrapping Real-World Search Agents, arXiv:2601.14615
Beyond Ten Turns: Unlocking Long-Horizon Agentic Search, arXiv:2505.12345
GLM 4.5 Air base model, Zhipu AI, the base model KARL is fine-tuned from
BrowseComp-Plus benchmark, extended browsing comprehension benchmark
TREC-Biogen benchmark, TREC biomedical report generation evaluation
Agent Bricks product page, the Databricks product powered by KARL

KARL (arXiv:2603.05218, Databricks AI Research, March 2026) is a knowledge agent trained via multi-task RL on GLM 4.5 Air for enterprise grounded reasoning across six task types: constraint-driven entity search, cross-document report synthesis, tabular numerical reasoning, exhaustive entity retrieval, procedural reasoning, and fact aggregation. Trained on only two task types and generalizing to four held-out types, KARL demonstrates that multi-task RL learns transferable search strategies where SFT distillation (OOD: 59.4→59.6 with parallel sampling, vs RL's meaningful OOD scaling) merely imitates task-specific behavior. OAPL, the concurrent off-policy RL paradigm, achieves robustness to trainer-inference discrepancy without clipped importance weighting, data deletion, or router replay. Trained context compression, emerging from end-to-end RL with no explicit objective, is the most important single capability in ablations, outweighing embedding model choice.

Sponsored Ad

If you enjoy practical AI insights, check out SnackOnAI and support the newsletter by subscribing, sharing, and exploring our sponsored ad — it helps us keep building and delivering value 🚀

Investors are watching this fast growing tech company.

🚨 No, it's not the publicly traded tech giant you might expect… Meet $MODE, the disruptor turning phones into income generators.

📲 Mode’s 32,481% revenue growth ranked them #1 on Deloitte’s list of fastest-growing companies in software. They aim to pioneer "Privatized Universal Basic Income" powered by technology, not government, and their EarnPhone has already helped consumers earn & save $1B+.

Invest today and earn up to 20% bonus shares.

^{Please read the offering circular and related risks at}^{invest.modemobile.com}^{. This is a paid advertisement for Mode Mobile’s Regulation A+ Offering.}

^{Mode Mobile recently received their ticker reservation with Nasdaq ($MODE), indicating an intent to IPO in the next 24 months. An intent to IPO is no guarantee that an actual IPO will occur.}

^{The Deloitte rankings are based on submitted applications and public company database research, with winners selected based on their fiscal-year revenue growth percentage over a three-year period.}