In partnership with

The rise of agentic AI systems that can plan, use tools, and handle long tasks has exposed a clear evaluation gap. While these models perform well on leaderboards, our ability to reliably measure their real-world performance has not kept up. As a result, many enterprises remain cautious about deploying them, since traditional benchmarks often fail to reflect real operational complexity.

To close this gap, new evaluation methods are needed. In response, Snorkel AI and its partners have launched a $3 million Open Benchmarks Grants program to support the creation of open, realistic benchmarks that better reflect how modern AI systems are actually used.

Core Dimensions Of Next-Gen Evaluation

Future benchmarks need to evolve in several key ways to reflect how AI systems are actually used in the real world:

Environment Complexity:
Real-world AI systems operate in messy, dynamic environments not clean, text-only settings. Agents must handle domain-specific context, multiple input types, and real toolchains. In practice, this includes incomplete documentation, rate limits, internal policies, and collaboration with humans. Benchmarks should reflect these conditions instead of testing isolated, ideal scenarios.

Autonomy Horizon:
Evaluation must go beyond short, single-step tasks and measure long-term reliability. Many real failures emerge gradually as agents drift from goals or accumulate small errors. Benchmarks should test agents over extended action sequences to assess their ability to stay aligned, recover from mistakes, and adapt to change.

Output Complexity:
Modern agents produce complex outputs like codebases or reports, which cannot be judged using simple pass/fail metrics. Benchmarks need richer evaluation methods that assess quality, reasoning, and clarity. This includes checking whether agents manage uncertainty well and know when to involve a human.

The Open Benchmarks Grants Program

To address these challenges, Snorkel AI, supported by partners like Hugging Face, Prime Intellect, Together AI, Factory HQ, Harbor, and PyTorch, announced the Open Benchmarks Grants a $3 million initiative to support open and high-quality AI evaluation tools. Instead of giving cash, the program provides practical support such as expert data-labeling services, hands-on help from engineers, and compute credits. The goal is to help teams build realistic benchmarks that better measure how advanced AI systems perform.

Projects selected under this program will create open-source datasets, benchmarks, and evaluation tools that push AI evaluation forward. Teams work closely with Snorkel’s researchers and advisors to design strong, reliable tests, while partner organizations supply the needed platforms and infrastructure. A steering committee made up of experienced AI leaders, including university faculty, reviews proposals to ensure the work meets high scientific standards.

All results from the program must be released under open and permissive licenses, such as MIT or Apache for code and CC BY for data. This means anyone in the AI community can use, study, and improve these benchmarks. While no direct cash funding is provided, teams receive valuable services and credits and they keep ownership of what they build, as long as it remains openly available to others.

Expert Data-As-A-Service (DaaS)

A key innovation in this program is Snorkel’s Expert Data-as-a-Service (DaaS). Traditional crowdsourced labeling often cannot capture deep domain knowledge. Instead, Snorkel mobilizes a network of specialist experts, hundreds of professionals with advanced degrees (PhDs, MDs, JDs, CPAs, etc.), to craft and verify data. Their network spans over a thousand specialized domains, from aviation to oncology.

Each benchmark dataset is built with multi-layer quality control. Domain experts author detailed examples, annotations, or chain-of-thought reasoning. Their work is peer-reviewed by other experts, and AI-powered checks flag inconsistencies. This expert-driven approach yields very high-fidelity labels, with accuracy typically cited near 99% for specialized tasks. Top experts can earn more than $3,000 per week, reflecting the intensive expertise required.

To scale further, Snorkel applies programmatic labeling. Instead of hand-labeling each example, experts write labeling functions code snippets that encode rules or heuristics. These functions automatically label large datasets at machine speed. In effect, programmatic labeling leverages expert insight at scale, generating orders-of-magnitude more labeled data than manual processes.

Example Benchmarks & Infrastructure

The grants will fund benchmarks that embody these principles. Notable examples and tools include:

  • Terminal-Bench 2.0:
    A leading coding agent benchmark with 89 real-world engineering tasks such as builds and ML pipelines. It intentionally caps top-agent performance around 50% to preserve meaningful learning signal. Trivial “Hello World” problems are eliminated, ensuring every task reflects real engineering work and remains challenging even for frontier models.

  • Harbor Evaluation Framework:
    An open-source harness for agent benchmarks that abstracts containerized task execution. Harbor enables researchers to run thousands of parallel agent trials across different cloud environments. Each task includes instructions and a reward script, while detailed agent trajectories tool use, observations, and actions are logged in a standardized format. This makes cross-agent and cross-environment comparison feasible.

  • Domain-Specific Benchmarks:
    The program prioritizes specialized fields. For example, Snorkel Underwrite evaluates insurance underwriting decisions using real policies, regulations, and databases curated by certified underwriters. A Theoretical Physics Benchmark poses PhD-level problems, where early results show state-of-the-art models solve fewer than 20% correctly. These benchmarks often require multi-turn reasoning and clarifying questions, mirroring real expert workflows.

Key Takeaways

As AI systems evolve into autonomous agents, evaluation has emerged as the new frontier: scaling model size alone is no longer enough without realistic, rigorous measurement. The Open Benchmarks Grants accelerate progress by providing $3 million in services and partner credits, lowering barriers for researchers to build next-generation benchmarks. Crucially, the initiative emphasizes expert-driven data over generic crowd labeling for high-stakes domains, while enforcing open-source outputs to prevent evaluation silos and enable shared, community-wide standards.

As agentic AI outpaces traditional benchmarks, Snorkel AI’s $3 million Open Benchmarks Grants aim to close the evaluation gap by funding open, expert-driven, real-world benchmarks that make advanced AI systems measurable, reliable, and trustworthy.

Sponsored Ad

Smart starts here.

You don't have to read everything — just the right thing. 1440's daily newsletter distills the day's biggest stories from 100+ sources into one quick, 5-minute read. It's the fastest way to stay sharp, sound informed, and actually understand what's happening in the world. Join 4.5 million readers who start their day the smart way.

1 

Recommended for you