RelBench v1: The Benchmark That Forced Honest Evaluation on Relational Deep Learning

Sponsored by

SnackOnAI Engineering | Senior AI Systems Researcher | Technical Deep Dive | May 25, 2026

The problem with benchmarking ML on relational databases before 2024 was not a shortage of datasets. It was a shortage of comparability. Team A trains on data up to January and tests on February. Team B trains on data up to March and tests on April. Both report "AUROC 0.82 on user churn." The numbers are not comparable. They never were. Worse, researchers who accidentally include future data in their feature engineering (a common mistake with relational join queries) report inflated results that no production deployment can match.

RelBench v1 (Fey, Hu, Huang, Yoon, Gui, Hamilton, Leskovec, et al., NeurIPS 2024 Datasets Track, snap-stanford/relbench, MIT) is the benchmark that fixes this by making temporal correctness enforced rather than requested. Seven real databases, thirty tasks, standardized temporal splits, unified evaluators, and defaults that prevent leakage without requiring the user to understand the risk.

The paper's hypothesis, stated explicitly in the accompanying ICML 2024 position paper, is that end-to-end deep learning on relational databases, treating foreign key relationships as edges in a heterogeneous graph and learning representations directly from the relational structure, can produce models competitive with expert-engineered feature pipelines. RelBench v1 is the evaluation framework to test this hypothesis. The GNN reference implementation and the leaderboard are the initial answers.

Scope: RelBench v1 architecture, the seven databases and thirty tasks, temporal leakage prevention, the GNN reference implementation via PyTorch Geometric, and the user study comparing relational DL to manual feature engineering. Not covered: RelBench v2 (covered separately) or detailed GNN architecture variants beyond the reference.

What It Actually Does

RelBench v1 is a benchmark infrastructure for end-to-end machine learning on relational databases. Three installation options:

pip install relbench               # core data loading, framework-agnostic
pip install relbench[full]         # + PyTorch Geometric + PyTorch Frame
pip install relbench[example]      # + example scripts for GNN training

Seven databases in v1:

Database	Domain	Key Tables	Approx Size
rel-stack	StackOverflow Q&A	posts, votes, users, badges, tags	Large (millions of posts)
rel-amazon	E-commerce reviews	reviews, products, users, categories	Large
rel-trial	Clinical trials	trials, conditions, interventions, outcomes	Medium
rel-f1	Formula 1 motorsport	races, drivers, results, constructors	Small-medium
rel-hm	H&M fashion retail	articles, customers, transactions	Large
rel-event	Event ticketing	events, venues, users, transactions	Medium
rel-mimic	MIMIC-III clinical	admissions, diagnoses, procedures, labs	Medium (de-identified)

Thirty tasks across these seven databases. Each task is a prediction defined on a specific (database, entity, timestamp) tuple: predict a property of an entity as of a given time, using only data available before that time.

The Architecture, Unpacked

Focus on Steps 1 and 2. The defaults prevent the two most common sources of invalid results in relational ML: temporal leakage (using future data) and label leakage (peeking at test targets). Making incorrect evaluation the path of explicit effort rather than the path of least resistance is the benchmark's central contribution.

The Code, Annotated

Snippet One: Full Pipeline from Dataset to Evaluation

# RelBench v1 complete evaluation pipeline
# Source: relbench.stanford.edu/start/ + snap-stanford/relbench (MIT)

from relbench.datasets import get_dataset
from relbench.tasks import get_task
from relbench.modeling.graph import make_pkey_fkey_graph
import numpy as np

# ── Data loading ──────────────────────────────────────────────────────────────
dataset = get_dataset("rel-amazon", download=True)
# ← download=True: verifies SHA hash on cached data, downloads if missing
# ← RELBENCH_CACHE_DIR env var overrides ~/.cache/relbench default location

# Database access: temporal leakage prevention is the DEFAULT
db = dataset.get_db()
# ← Rows with timestamp > dataset.test_timestamp are EXCLUDED by default
# This is not optional: you must explicitly opt in to include future data
# get_db(upto_test_timestamp=False) if you need full data for analysis

# ── Task setup ────────────────────────────────────────────────────────────────
task = get_task("rel-amazon", "user-churn", download=True)
# ← "user-churn": predict whether each user will return within 90 days
# ← Task handles the target column definition and temporal anchoring

train_table = task.get_table("train")  # labels visible
val_table   = task.get_table("val")    # labels visible

# ← THIS is the trick: test labels are HIDDEN by default
# You cannot accidentally look at test labels during model development
# Standard ML practice, but enforced at the API level rather than by convention
test_table  = task.get_table("test")
print([c for c in test_table.df.columns if "churn" in c.lower()])
# Output: []  ← label column absent

# ── Graph construction ────────────────────────────────────────────────────────
data = make_pkey_fkey_graph(
    db=db,
    col_stats_dict={},   # populated during training for feature normalization
)
# ← No domain knowledge required to build this graph
# rel-amazon graph: users, reviews, products, orders, categories as node types
# FK links: reviews.user_id → users, reviews.product_id → products, etc.

# ── Training (reference: HeteroGraphSAGE) ────────────────────────────────────
# ... train your model using data and train_table, validate on val_table ...

# ── Evaluation ───────────────────────────────────────────────────────────────
test_pred = np.random.rand(len(test_table.df))  # replace with model predictions
metrics = task.evaluate(test_pred)
print(metrics)
# Output: {"AUROC": 0.73}
# Same metric computation for every team → valid comparison across papers

val_metrics = task.evaluate(val_pred, val_table)  # validate before final eval
print(val_metrics)
# Output: {"AUROC": 0.71}

The task.evaluate(test_pred) call is the unification that makes RelBench results reproducible across papers. Before RelBench, two teams could implement AUROC differently (different tie-breaking, different handling of missing labels) and report incomparable numbers. The unified evaluator removes this source of variance.

Snippet Two: GNN Reference Implementation and the pkey-fkey Graph Hypothesis

# RelBench v1 GNN reference implementation
# Source: snap-stanford/relbench/examples/gnn_entity.py (MIT)
# Command: python gnn_entity.py --dataset rel-f1 --task driver-position

import torch
from torch_geometric.data import HeteroData
from relbench.modeling.nn import HeteroGraphSAGE
from relbench.modeling.graph import make_pkey_fkey_graph, get_node_train_table_input

# The core hypothesis: pkey-fkey links contain enough relational signal
# that a GNN learning over them can match manual feature engineering
# without any domain-specific join logic

dataset = get_dataset("rel-f1", download=True)
db = dataset.get_db()
task = get_task("rel-f1", "driver-position", download=True)

# rel-f1 graph (what make_pkey_fkey_graph produces):
# Node types: races (1,097), drivers (858), results (25,840),
#             constructors (212), qualifying (9,961)
# Edge types (all from FK references):
#   races → results  (via results.race_id FK)
#   drivers → results (via results.driver_id FK)
#   constructors → results (via results.constructor_id FK)
#   drivers → qualifying (via qualifying.driver_id FK)
#   races → qualifying (via qualifying.race_id FK)

data: HeteroData = make_pkey_fkey_graph(db)

# ← HeteroGraphSAGE: standard heterogeneous GNN
# 2 message-passing layers: aggregates 2-hop neighborhood in the relational graph
# This is equivalent to: for each result, aggregate features from the race AND
# from the driver AND from the constructor (all reachable in 2 hops)
model = HeteroGraphSAGE(
    node_types=list(data.node_types),
    edge_types=list(data.edge_types),
    channels=128,           # embedding dimension
    num_layers=2,           # GNN depth = max join depth explored
    # ← 2 layers means: race features reach results, and results features
    # reach drivers. This captures: "what races has this driver competed in?"
    # without writing a single SQL join.
)

# Get training node indices and labels for the driver-position task
train_input = get_node_train_table_input(
    task.get_table("train"),
    entity_col="driver_id",  # the entity to predict on
    node_type="drivers",
)

# Forward pass: embeddings for all nodes
embeddings = model(data.x_dict, data.edge_index_dict)
# embeddings["drivers"]: shape [858, 128] → one embedding per driver

# Prediction: linear head on driver embeddings for each (driver, race) in task
driver_emb = embeddings["drivers"][train_input.node_indices]  # [N_train, 128]
race_emb = embeddings["races"][train_input.race_indices]     # [N_train, 128]
pred_position = linear_head(torch.cat([driver_emb, race_emb], dim=-1))

# GNN baseline result (from paper and leaderboard):
# MAE ≈ 3.82 finishing positions (average error)
# Mean baseline (always predict 7th): MAE ≈ 5.1
# ← 25% improvement from relational structure alone, zero feature engineering

The 2-layer GNN is equivalent to a 2-hop neighborhood aggregation in the relational graph, which corresponds to a 2-table join. Deeper GNNs correspond to deeper join chains. The model learns which paths through the foreign key graph carry predictive signal, replacing the manual decision about which features to engineer.

It In Action: End-to-End Worked Example

Task: driver-position on rel-f1 (predict each driver's finishing position for each race)

Step 1: Environment

pip install relbench[example]
git clone https://github.com/snap-stanford/relbench
cd relbench/examples

Step 2: Run the reference GNN

python gnn_entity.py --dataset rel-f1 --task driver-position --epochs 20
# Downloads rel-f1 to ~/.cache/relbench (~300 MB)
# Constructs HeteroData graph: 5 node types, 8 edge types, ~85,000 edges
# Trains HeteroGraphSAGE for 20 epochs on a single GPU

Step 3: Training output

Epoch 01: Train MAE=5.89, Val MAE=5.43
Epoch 05: Train MAE=4.21, Val MAE=4.08
Epoch 10: Train MAE=3.94, Val MAE=3.91
Epoch 20: Train MAE=3.72, Val MAE=3.82  ← converged

Test evaluation:
  task.evaluate(test_pred) → {"MAE": 3.82, "RMSE": 5.14}

Step 4: What the numbers mean

Mean baseline (always predict 7th place of 20):   MAE = 5.10
Random baseline:                                   MAE = 7.4
GNN reference (2-layer HeteroGraphSAGE):           MAE = 3.82
Best leaderboard (as of mid-2025):                 MAE ≈ 3.1

Improvement of GNN over mean: (5.10 - 3.82) / 5.10 = 25%
← This 25% improvement comes entirely from relational graph structure
← Zero domain-specific features engineered by a human
← The GNN learned: race difficulty, constructor performance,
   driver historical form — all from FK traversals alone

Step 5: User study result from the v1 paper

Manual feature engineering pipeline for rel-f1:
  Domain expert joins: ~3 days of work
  Feature set: ~40 manually selected features
  Model: XGBoost on engineered features
  Result: MAE ≈ 3.4

RelBench GNN reference:
  Setup time: ~2 hours (pip install + run example)
  Feature set: none (pkey-fkey graph only)
  Model: HeteroGraphSAGE
  Result: MAE ≈ 3.82

Gap: 0.42 MAE points (GNN within 12% of expert-engineered baseline)
Effort: ~1/10th the human time
← The paper's user study: competitive accuracy at significantly lower effort

Why This Design Works, and What It Trades Away

The temporal leakage prevention via defaults is the correct design pattern. Alternative designs (documentation that warns users to implement their own temporal splits, or code that requires explicit opt-in for temporal correctness) produce inconsistent results across teams. Researchers under time pressure skip the careful implementation. RelBench's design makes the careless path the correct one: get_db() and get_table("test") work correctly by default. Incorrect usage requires explicit effort.

The pkey-fkey to graph mapping is the mechanically correct translation of relational structure into a GNN-compatible format. Every foreign key IS a directed relationship: "this review was written about this product." The GNN treating FK references as edges is not an approximation. It is a lossless representation of the relational schema's explicit structure. What the GNN cannot capture (implicit business logic not encoded in FK relationships) is exactly what manual feature engineering adds. This makes the benchmark a clean test of whether explicit relational structure is sufficient.

The framework-agnostic design is the correct choice for a benchmark that wants to compare methods, not architectures. By providing standardized data loading and evaluation but not mandating a model, RelBench allows GNNs, gradient boosted trees, transformers, and linear models to be compared on equal footing.

What RelBench v1 trades away:

Seven databases is not enough for statistical confidence in cross-domain conclusions. Results on rel-f1 may reflect Formula 1-specific patterns rather than general relational learning capabilities. Seven databases produce seven data points; conclusions require more.

The reference GNN is not optimized. The paper explicitly uses a simple HeteroGraphSAGE baseline. This demonstrates feasibility but undersells what GNN methods can achieve. The leaderboard shows substantially better results from more sophisticated approaches.

Technical Moats

The temporal split implementation. The split is defined by val_timestamp and test_timestamp, which are database-specific and set to ensure meaningful held-out sets. Computing the correct timestamps for each database requires understanding the temporal distribution of the data: a test set that covers only one month of a slowly-evolving database produces artificially easy evaluation. The timestamps were set by domain experts familiar with each database's temporal dynamics.

The evaluator standardization. Metric implementations differ subtly. AUROC with different tie-breaking produces different numbers. NDCG at different k values is incomparable. RelBench's unified evaluator locks in the implementation, ensuring that every paper citing RelBench results is comparing the same metric with the same computation.

Insights

Insight One: RelBench's most important contribution is not the seven databases. It is demonstrating that a benchmark for relational ML can be built at all with enforced evaluation standards. Before RelBench, the community assumed that the heterogeneity of relational databases made standardization impossible. RelBench disproved this by showing that temporal splits, leakage prevention, and metric standardization can be implemented in a framework-agnostic way that works across healthcare, e-commerce, motorsport, and social networks.

Insight Two: The GNN reference implementation's 25% improvement over the mean baseline from pkey-fkey structure alone is not proof that relational deep learning works. It is proof that the task is non-trivial and that some signal exists in the relational structure. The more important comparison, GNN vs. expert-engineered features, shows the GNN within 12% at 1/10th the effort. This effort ratio is the actual argument for the approach, not the raw accuracy.

Takeaway

The v1 paper's user study measured engineer effort alongside model quality, making RelBench one of the only ML benchmarks that explicitly evaluates cost-adjusted accuracy. The finding: relational deep learning produces models within 12% of expert-engineered accuracy at approximately 1/10th the engineering time. This two-axis evaluation is more useful for production deployment decisions than single-axis accuracy comparison, and it is buried in the paper's appendix.

If the ML community adopted cost-adjusted accuracy as a standard evaluation axis alongside raw performance, the case for automated relational learning would be substantially stronger. RelBench v1 makes this comparison possible. Almost no coverage of the paper highlights it.

TL;DR For Engineers

RelBench v1 (arXiv:2407.20060, NeurIPS 2024 Datasets Track, MIT, snap-stanford/relbench) is a benchmark for end-to-end ML on relational databases: 7 databases, 30 tasks, temporal splits enforced by default, unified evaluators, HF Spaces leaderboard.
Critical defaults: get_db() excludes future rows, get_table("test") hides labels. Both require explicit opt-in to override. These defaults make temporal correctness the path of least resistance.
Graph construction: make_pkey_fkey_graph(db) converts every FK reference into a directed edge in a PyTorch Geometric HeteroData object. No feature engineering required to build the graph. GNN learning over this graph is equivalent to end-to-end learned multi-table joins.
GNN baseline on rel-f1 driver-position: MAE 3.82 vs mean baseline 5.1 (25% improvement). Expert-engineered XGBoost: MAE 3.4. GNN is within 12% of expert accuracy at ~1/10th the engineering effort.
Framework-agnostic: any model (XGBoost, MLP, GNN, Transformer) can be evaluated using the standardized task tables and the task.evaluate(test_pred) unified metric call.

The Default Is the Contribution

RelBench v1's lasting contribution is not any specific database or task. It is the precedent that relational ML benchmarks can be standardized: temporal splits enforced by code, test labels hidden until evaluation, metrics computed identically for all teams. The GNN results show the hypothesis is worth testing. The infrastructure ensures the tests are valid. Future work builds on both.

References

RelBench: A Benchmark for Deep Learning on Relational Databases, arXiv:2407.20060, NeurIPS 2024 Datasets Track
Position: Relational Deep Learning — Graph Representation Learning on Relational Databases, ICML 2024
snap-stanford/relbench GitHub (MIT)
RelBench website + quickstart
RelBench HuggingFace Leaderboard
PyTorch Geometric — HeteroData and HeteroGraphSAGE used in reference implementation
Why Tree-Based Models Still Outperform Deep Learning on Tabular Data, Grinsztajn et al. 2022, arXiv:2207.08815 — the context that makes RelBench's user study result meaningful

RelBench v1 (arXiv:2407.20060, NeurIPS 2024 Datasets Track, Stanford SNAP, MIT) is a benchmark for end-to-end deep learning on relational databases: 7 databases (Q&A, e-commerce, clinical trials, motorsport, fashion, events, healthcare) with 30 predictive tasks, standardized temporal splits enforced by default (future rows excluded, test labels hidden), PyTorch Geometric graph construction from pkey-fkey relationships, and a unified evaluator (task.evaluate(test_pred)). The GNN reference achieves MAE 3.82 on rel-f1 driver-position (25% over mean baseline, 12% below expert-engineered XGBoost) at approximately 1/10th the engineering effort, per the paper's user study.

Sponsored Ad

If you enjoy practical AI insights, check out SnackOnAI and support the newsletter by subscribing, sharing, and exploring our sponsored ad — it helps us keep building and delivering value 🚀

Catch Bad Actors. Let Good Users Flow.

When your goal is to increase the difficulty of online attacks, the advanced features of hCaptcha Enterprise is the most robust solution.

Take it from one of our customers:

“Compared to last year [when using competitor], we had a 96% reduction in bot throughput.” - Top 10 Gaming Company

Category leaders in every industry have been switching to hCaptcha because of the robustness and durability of our detection and deterrence solutions.

Virtually all companies that book a demo decide to move forward.