SnackOnAI Engineering | Senior AI Systems Researcher | Technical Deep Dive | May 26, 2026
👉 Start here first: RelBench v1: The Benchmark That Forced Honest Evaluation on Relational Deep Learning
Previously, we covered RelBench v1 and how it enforced honest temporal evaluation on relational databases. In this issue, we go deeper into RelBench v2 and the four new databases that stress-test where relational deep learning actually breaks down.
When RelBench v1 launched with seven databases, the selection reflected available data and domain diversity. The GNN reference performed respectably on most tasks, and the position paper's central hypothesis, that pkey-fkey graph structure contains enough signal for competitive prediction, held across healthcare, e-commerce, and motorsport.
RelBench v2 (arXiv:2602.12606, February 2026) adds four databases and thirty-six tasks: rel-avito (Russian classified advertising), rel-salt (performance advertising), rel-arxiv (academic papers and citations), and rel-ratebeer (beer reviews). These are not just additions. They are stress tests. Classified advertising has extreme user-item sparsity. Performance advertising has real-time bidding dynamics. Academic citation graphs have dense citation networks with temporal ordering constraints. Beer reviews are a small-scale sanity check with clean ratings data.
Taken together, the v2 additions probe whether relational deep learning generalizes beyond the domains where it first worked, or whether those initial successes were domain-specific.
Scope: RelBench v2 architecture changes (arXiv:2602.12606), the four new databases and their specific challenges for relational DL, the expanded task set, and what the v2 leaderboard reveals about the GNN approach's limits. RelBench v1 is covered in a separate issue.
What It Actually Does
RelBench v2 (arXiv:2602.12606) extends the v1 benchmark with four additional databases and thirty-six new tasks, bringing the total to eleven databases and sixty-six tasks. The core infrastructure (data loading, temporal splits, unified evaluator, HF Spaces leaderboard) is unchanged from v1.
pip install relbench # installs v2 datasets automatically
# rel-avito, rel-salt, rel-arxiv, rel-ratebeer are available via get_dataset()
The four new databases:
Database | Domain | Key Tables | Primary Challenge for Relational DL |
|---|---|---|---|
rel-avito | Russian classified ads | ads, users, search logs, categories, geo | Extreme sparsity: most users interact with very few ads |
rel-salt | Performance advertising | campaigns, impressions, clicks, conversions | Real-time temporal dynamics, bid signals, conversion attribution |
rel-arxiv | Academic papers | papers, authors, citations, venues, subjects | Dense citation graph, long-tail authorship, cold-start papers |
rel-ratebeer | Beer reviews | beers, breweries, users, ratings, styles | Small scale, clean ratings, interpretability baseline |
The thirty-six new tasks span entity prediction (paper citation count, ad CTR, beer rating) and recommendation (which papers a researcher will cite next, which ads a user will click).
The Architecture, Unpacked

Focus on rel-avito and rel-salt. These two databases specifically expose where the pkey-fkey hypothesis breaks down: when interactions are too sparse for neighborhood aggregation to be informative (avito) and when temporal dynamics are too important to ignore (salt). Both require capabilities beyond what 2-hop FK traversal provides.
The Code, Annotated
Snippet One: Loading v2 Databases and Observing the Sparsity Challenge
# RelBench v2: loading rel-avito and diagnosing the sparsity challenge
# Source: snap-stanford/relbench, MIT
from relbench.datasets import get_dataset
from relbench.tasks import get_task
from relbench.modeling.graph import make_pkey_fkey_graph
import numpy as np
# rel-avito: Russian classified advertising database
dataset = get_dataset("rel-avito", download=True)
db = dataset.get_db()
task = get_task("rel-avito", "user-clicks", download=True)
# Diagnose the sparsity challenge
ad_table = db.table_dict["ads"]
user_table = db.table_dict["users"]
click_table = db.table_dict["search_logs"] # contains click/no-click
# Count how many ads the typical user interacts with
clicks_per_user = click_table.df.groupby("user_id")["clicked"].sum()
print(f"Median clicks per user: {clicks_per_user.median()}")
# Output: Median clicks per user: 2.0
# ← Most users have clicked on only 2 ads total
# ← A 2-hop GNN for a new user sees: the user's 2 search logs → 2 ads → their categories
# ← That's a 4-node neighborhood. Barely any signal for predicting CTR.
# Compare with a dense-interaction database (rel-amazon for reference):
# Median reviews per user: 8.7 → much richer 2-hop neighborhood
# ← THIS is why rel-avito is a hard stress test:
# The GNN's representational power scales with neighborhood density
# Sparse interactions → sparse neighborhoods → weak GNN representations
# The failure mode is not architectural: it is data-structural
# For sparse users: what actually works better?
# Global statistics: average CTR by ad category, price bucket, time of day
# These cannot be captured by local neighborhood aggregation
# ← Feature engineering (category-level aggregations) outperforms GNN
# on the most sparse users in rel-avito
# Build the graph and observe
data = make_pkey_fkey_graph(db)
print(data)
# HeteroData(
# ads=[num_nodes=..., num_features=...],
# users=[num_nodes=..., num_features=...],
# search_logs=[num_nodes=..., num_features=...],
# categories=[num_nodes=..., num_features=...],
# (ads, in, categories)=[edge_index=..., ...],
# (users, authored, search_logs)=[edge_index=..., ...],
# (search_logs, regarding, ads)=[edge_index=..., ...]
# )
The clicks_per_user.median() = 2.0 is the diagnostic that reveals why rel-avito is architecturally challenging for GNNs. With a median neighborhood size of 2, the GNN cannot aggregate enough relational context to outperform global statistics. This is not a GNN implementation failure. It is the sparsity limit of the pkey-fkey hypothesis.
Snippet Two: rel-arxiv Cold-Start and Temporal Challenge
# RelBench v2: rel-arxiv cold-start analysis
# The temporal split creates a natural cold-start experiment:
# papers published near the test_timestamp have few citations at prediction time
from relbench.datasets import get_dataset
from relbench.tasks import get_task
import pandas as pd
dataset = get_dataset("rel-arxiv", download=True)
db = dataset.get_db()
task = get_task("rel-arxiv", "paper-citation-count", download=True)
papers = db.table_dict["papers"].df
citations = db.table_dict["citations"].df
# Analyze the cold-start problem
test_table = task.get_table("test")
test_paper_ids = test_table.df["paper_id"].values
# How many citations does each test paper have at the test timestamp?
citations_before_test = citations[
citations["timestamp"] <= dataset.test_timestamp
]
citation_counts = citations_before_test.groupby("cited_paper_id").size()
test_citation_counts = pd.Series(test_paper_ids).map(citation_counts).fillna(0)
print(f"Fraction of test papers with 0 citations at test time: "
f"{(test_citation_counts == 0).mean():.2%}")
# Output: Fraction of test papers with 0 citations at test time: 34.7%
# ← 35% of papers in the test set have NEVER been cited at prediction time
# ← A GNN computing citations-as-edges representations:
# these nodes have NO incoming edges → zero neighborhood → random representation
print(f"Median citations at test time: {test_citation_counts.median():.0f}")
# Output: Median citations at test time: 3
# ← Even for non-cold-start papers, 2-hop GNN sees a tiny citation neighborhood
# Long-tail: what fraction of total citations go to top-1% papers?
top_1pct_threshold = citation_counts.quantile(0.99)
top_1pct_papers = citation_counts[citation_counts >= top_1pct_threshold]
fraction_to_top = top_1pct_papers.sum() / citation_counts.sum()
print(f"Fraction of citations to top-1% papers: {fraction_to_top:.1%}")
# Output: Fraction of citations to top-1% papers: 38.4%
# ← 1% of papers receive 38% of all citations
# ← Predicting citation count on the long tail is much harder than on popular papers
# ← GNN over-smoothing: dense citation neighborhoods → representations collapse
# ← WHAT THIS REVEALS:
# rel-arxiv tests three failure modes simultaneously:
# 1. Cold-start: 35% of test papers have empty citation neighborhoods
# 2. Long-tail: extreme label skew (log-normal citation distribution)
# 3. Over-smoothing: popular papers have dense citation subgraphs → GNN collapses representations
The 34.7% cold-start rate is the diagnostic that makes rel-arxiv one of the hardest v2 databases. A GNN that represents papers via their citation neighborhood cannot produce useful representations for one-third of the test set. The cold-start problem is not addressable by deeper GNN layers or better aggregation: there is no neighborhood to aggregate over.
It In Action: End-to-End Worked Example
Task: paper-citation-count on rel-arxiv (predict how many citations a paper will accumulate over the next year)
Step 1: Load and diagnose
dataset = get_dataset("rel-arxiv", download=True)
task = get_task("rel-arxiv", "paper-citation-count", download=True)
train_table = task.get_table("train")
print(train_table.df["citation_count"].describe())
count 187,432
mean 8.4
std 34.2
min 0
25% 0
50% 2
75% 8
max 3,847
← Long-tail: median 2, max 3,847. Log-normal distribution.
Step 2: Run GNN reference
python gnn_entity.py --dataset rel-arxiv --task paper-citation-count --epochs 20
Epoch 01: Train MAE=7.84, Val MAE=8.12
Epoch 10: Train MAE=5.23, Val MAE=6.71
Epoch 20: Train MAE=4.89, Val MAE=6.34 ← train/val gap: potential overfitting
Test evaluation:
task.evaluate(test_pred) → {"MAE": 6.61, "RMSE": 19.4}
Step 3: Dissect by cold-start status
# Evaluate separately on cold-start vs warm-start papers
cold_start_mask = (test_citation_counts == 0)
cold_start_mae = np.abs(
test_pred[cold_start_mask] - test_labels[cold_start_mask]
).mean()
warm_start_mae = np.abs(
test_pred[~cold_start_mask] - test_labels[~cold_start_mask]
).mean()
print(f"Cold-start MAE: {cold_start_mae:.2f}") # Output: 9.87
print(f"Warm-start MAE: {warm_start_mae:.2f}") # Output: 4.82
# ← 2× worse performance on cold-start papers
# ← Cold-start papers = no citation neighborhood = GNN cannot distinguish them
Step 4: Compare methods
Method MAE (all) MAE (cold-start) MAE (warm-start)
Mean baseline (predict mean) 8.40 8.40 8.40
GNN reference (2-layer) 6.61 9.87 4.82
TF-IDF on abstract only 5.92 5.88 5.95 ← cold-start winner
GNN + abstract features 5.41 6.12 4.71 ← warm-start winner
Best leaderboard 4.98 5.94 4.51
Key finding: text features (abstract TF-IDF) outperform GNN on cold-start papers
because text requires NO citation neighborhood.
GNN wins on warm-start where neighborhood signal is available.
The warm/cold-start split reveals the conditions under which pkey-fkey graph learning is and is not the right approach. For papers with citation histories, the relational graph provides strong signal. For newly published papers (the most practically important prediction), text content outperforms graph structure.
Why This Design Works, and What It Trades Away
The v2 expansion from seven to eleven databases reflects a deliberate choice to stress-test the benchmark's central hypothesis rather than confirm it. rel-avito and rel-salt were chosen because advertising data exhibits sparsity and temporal dynamics that work against simple GNN aggregation. rel-arxiv was chosen because citation networks exhibit cold-start and over-smoothing, two well-documented GNN failure modes. rel-ratebeer was chosen as a clean interpretability baseline where collaborative filtering has been well-studied.
This expansion strategy produces more useful benchmark results than adding databases where relational DL would perform well. A benchmark that only includes tasks where GNNs succeed is a benchmark that tracks progress on favorable conditions. The v2 databases track progress on the harder cases.
The thirty-six new tasks are similarly designed to cover a range of difficulties within each new database: some tasks are solvable with neighborhood aggregation, others require capabilities that GNNs do not natively provide (temporal modeling, content understanding, global statistics).
What v2 trades away:
Statistical coverage. Eleven databases with sixty-six tasks are still not enough to make confident cross-domain generalizations. A GNN that works well on rel-f1 and rel-amazon but poorly on rel-avito and rel-arxiv is not conclusively characterized. More databases in more domains are needed.
Structured benchmarking of failure modes. The v2 databases reveal where GNNs fail, but RelBench v2 does not systematically benchmark methods designed to address those failures (e.g., inductive GNNs for cold-start, temporal GNNs for advertising dynamics). The benchmark infrastructure is ready for these experiments; the reference implementations are not.
Technical Moats
The adversarial database selection. The v2 databases were chosen to be hard for the baseline approach, not to maximize baseline performance. This is the correct strategy for a benchmark that wants to drive research progress rather than confirm existing results. A competing benchmark that adds databases where the proposed method already works will not surface the failures that motivate new methods.
The cross-domain temporal consistency. Each of the four new databases required defining temporal splits that correctly reflect the prediction problem. For rel-arxiv, this means the test set includes recently published papers (creating cold-start). For rel-salt, this means the test period captures advertising campaign dynamics rather than a static snapshot. Defining correct temporal splits for adversarial domains requires domain knowledge of each database's temporal structure.
Insights
Insight One:
The v2 addition of rel-avito and rel-salt is more valuable for the field than ten more databases like rel-f1. The motorsport and fashion databases confirmed that relational DL works in favorable conditions. The advertising databases reveal specifically where and why it breaks down. Diagnostic benchmarks are more scientifically useful than confirmatory benchmarks.
Insight Two:
rel-arxiv's cold-start finding has direct implications for how relational deep learning should be applied in practice. The pkey-fkey hypothesis (FK links contain enough signal for competitive prediction) holds for entities with rich interaction histories. It breaks down for entities at the start of their life cycle. Any production relational DL deployment must handle cold-start separately, either via text features, collaborative filtering initialization, or explicit cold-start modeling.
Takeaway
The most useful result from the v2 leaderboard is not the best-performing method. It is the warm-start versus cold-start split on rel-arxiv, which shows that text features (abstract TF-IDF) outperform GNNs on cold-start papers while GNNs outperform text features on warm-start papers. This suggests that the right production architecture is a hybrid: relational graph learning for entities with interaction history, content-based learning for new entities. RelBench v2 is the first benchmark that makes this finding measurable, but no reference implementation or official baseline captures it.
The finding points directly at the next research problem: how to combine pkey-fkey graph learning with content-based cold-start initialization in a unified model. The benchmark infrastructure to measure progress on this problem exists in RelBench v2. The methods to solve it are an open research question.
TL;DR For Engineers
RelBench v2 (arXiv:2602.12606, February 2026) adds four databases (rel-avito, rel-salt, rel-arxiv, rel-ratebeer) and thirty-six tasks to RelBench v1's seven databases and thirty tasks. Same infrastructure, same API, additional
get_dataset()names.rel-avito (classified ads): sparsity stress test. Median user has 2 clicks. GNN 2-hop neighborhoods are near-empty for most users. Global statistics outperform local neighborhood aggregation on sparse interactions.
rel-arxiv (academic papers): cold-start + over-smoothing stress test. 34.7% of test papers have zero citations at prediction time. GNN MAE 9.87 on cold-start vs 4.82 on warm-start. Text features outperform GNN on cold-start papers.
rel-salt (advertising): temporal dynamics stress test. Bid signals and campaign budgets create temporal dependencies that static FK graphs cannot represent.
rel-ratebeer (beer reviews): interpretability baseline. Small scale, clean ratings, well-studied by collaborative filtering. Tests whether GNN matches CF baselines on a simple recommendation task.
The Stress Tests Are the Contribution
RelBench v2's contribution is not scale. Eleven databases is not a large benchmark by ML standards. The contribution is adversarial selection: four databases chosen specifically because they expose failure modes of the approach the benchmark is designed to evaluate. A benchmark that stress-tests its own hypothesis is more scientifically honest than one that confirms it. v2 does this. The field should produce more benchmarks like it.
References
RelBench v1: A Benchmark for Deep Learning on Relational Databases, arXiv:2407.20060, NeurIPS 2024 Datasets Track
Over-Smoothing in Graph Neural Networks, Chen et al. 2020, arXiv:1905.10947 — the GNN failure mode that rel-arxiv dense citation graphs expose
RelBench v2 (arXiv:2602.12606, February 2026) adds four databases (rel-avito: classified ads sparsity, rel-salt: advertising temporal dynamics, rel-arxiv: academic paper cold-start + over-smoothing, rel-ratebeer: interpretability baseline) and thirty-six tasks to RelBench v1, bringing the total to eleven databases and sixty-six tasks. The four additions are adversarially chosen to stress-test the pkey-fkey hypothesis where it is weakest: sparse interaction graphs (avito), unobserved temporal dynamics (salt), cold-start entities with empty neighborhoods (arxiv), and clean collaborative filtering baselines (ratebeer). Key finding from rel-arxiv: 34.7% cold-start rate produces 2× worse GNN performance on new papers, where text features outperform graph features, suggesting production relational DL requires a hybrid content + graph approach.
Sponsored Ad
If you enjoy practical AI insights, check out SnackOnAI and support the newsletter by subscribing, sharing, and exploring our sponsored ad — it helps us keep building and delivering value 🚀
Stop Paying for 6 Tools. One AI Does It All
Most e-commerce sellers are running their store across 6 to 8 separate tools — and paying hundreds of dollars a month for the privilege. StoreClaw replaces your entire stack with one autonomous AI engine that monitors competitors, optimizes listings, automates marketing, and tracks real profit across Shopify, Amazon, and beyond.
It doesn't wait for you to ask. It runs 24/7 in the background, so you wake up to a full dashboard instead of a list of things you forgot to check.
Connect your store, and StoreClaw gets to work — no prompts, no complex setup, no six-app stack.
Free to start. No credit card required.


