RelBench: The Benchmark That Makes Manual Feature Engineering on Databases Look Embarrassing

In partnership with

^{SnackOnAI Engineering | Senior AI Systems Researcher | Technical Deep Dive | May 2, 2026}

The ML community has spent decades optimizing models for independent data: images, text, tabular rows in isolation. Relational databases, the actual data storage format for the majority of production ML applications, contain something fundamentally different: entities related to each other through primary-foreign key links, timestamped rows encoding temporal dynamics, and multi-modal column types mixing numerical, categorical, timestamp, and free text in the same table.

The standard approach to this data is manual feature engineering: write SQL joins to flatten the relational structure into a single table, hand-code temporal aggregations (sum of last 30-day purchases, count of events in the trailing 7 days), and feed the result to XGBoost or LightGBM. This pipeline works, takes weeks to build, encodes data leakage risks at every temporal boundary, and discards most of the predictive signal in the relational structure.

Relational Deep Learning (RDL), introduced by Fey et al. (ICML 2024, arXiv:2401.12174), proposes the alternative: convert the database exactly into a heterogeneous temporal graph, extract node features from raw table rows using deep tabular models, and train a GNN end-to-end. RelBench (Robinson et al., NeurIPS 2024, arXiv:2407.20060) is the benchmark infrastructure for this approach: databases, tasks, standardized splits, an open-source RDL implementation, and a leaderboard.

This newsletter dissects RelBench as both a benchmark engineering artifact and a research result: what the heterogeneous temporal graph construction algorithm does, how temporal-aware subgraph sampling prevents time leakage, what the ResNet + Heterogeneous GraphSAGE pipeline looks like in code, and what the user study result reveals about where the bottleneck in relational ML actually sits.

Scope: RelBench v1 (NeurIPS 2024) and v2 (arXiv:2602.12606), the RDL implementation architecture, temporal split design, task types, and comparison to traditional feature engineering baselines. Not covered: GNN architecture research beyond GraphSAGE, or LLM-based relational approaches beyond the STaRK comparison.

What It Actually Does

RelBench (SNAP Lab, Stanford) is an open benchmark for training and evaluating deep learning models on relational databases, accepted to NeurIPS 2024 Datasets and Benchmarks. The repo has 352 stars and 81 forks. RelBench v2 (arXiv:2602.12606) scales significantly.

The benchmark provides:

Relational databases (v1): Seven real-world databases spanning e-commerce (Amazon Reviews), social networks (Stack Exchange), healthcare (MIMIC-IV), finance, sports, and more. Database sizes range from thousands to millions of rows across multiple tables.

Predictive tasks: Thirty tasks organized into three types:

Entity Classification/Regression: predict a property of an entity at a given time (e.g., user churn at timestamp T, customer spend in next 30 days)
Link Recommendation: predict which items a user will interact with next
All tasks are defined with a seed_time parameter, the temporal boundary separating available training data from prediction targets

Temporal splits: The key engineering decision. Train/validation/test splits are determined by time, not random row sampling. This prevents the most common form of data leakage in relational ML: using future information when constructing features.

Open-source RDL implementation: The reference implementation converts a RelBench database into a heterogeneous temporal graph using PyTorch Geometric, extracts initial node embeddings via PyTorch Frame's ResNet tabular model, and trains Heterogeneous GraphSAGE for node and link prediction tasks.

User study result: An experienced data scientist manually engineered features for each task (standard SQL aggregations, temporal joins, domain-specific features). RDL achieved comparable or better performance on all tasks while reducing human effort by more than an order of magnitude (over 95% reduction in model development time).

The Architecture, Unpacked

^{Focus on the temporal-aware subgraph sampling. This is where RelBench prevents the most common failure mode in relational ML: data leakage through temporal joins that accidentally include future information. The constraint that every edge in the sampled subgraph must have timestamp < seed_time is what makes the GNN's training signal valid.}

The Code, Annotated

Snippet One: Database, Task, and Temporal Split Loading

import relbench
from relbench.datasets import get_dataset
from relbench.tasks import get_task

# ← Load a RelBench dataset by name. Downloads and caches automatically.
# Available: "rel-amazon", "rel-stackex", "rel-hm", "rel-trial", "rel-avito",
#            "rel-event", "rel-f1"
dataset = get_dataset("rel-amazon", download=True)

# ← The dataset object represents the raw relational database.
# Tables are accessible as pandas DataFrames with typed columns.
print(dataset.get_db())  # Database: tables, column types, primary keys, foreign keys

# ← Load a specific task on this dataset.
# Tasks are defined independently from databases: different prediction targets
# on the same database. This separation enables multi-task learning research.
task = get_task("rel-amazon", "user-churn", download=True)

# ← THIS is the key design decision: temporal split by time, not random.
# train_table, val_table, test_table are seed entity tables with timestamps.
# The split boundaries are fixed, not configurable, to prevent leakage via
# careful split selection (a common form of benchmark gaming).
train_table = task.get_table("train")  # rows with seed_time in train window
val_table = task.get_table("val")      # rows with seed_time in val window
test_table = task.get_table("test")    # rows with seed_time in test window

# train_table columns:
# - entity_id: the primary key of the seed entity (e.g., user_id)
# - seed_time: the temporal boundary for this prediction instance
# - label: the ground truth target (churn = 1 or 0, for the window after seed_time)

# ← Why multiple rows per entity? Because the same user can have multiple
# seed_times (different prediction windows). The model sees the user's history
# up to seed_time, predicts behavior in the following window.
print(train_table.head())
# user_id  seed_time   label
# 42        2023-01-15  0
# 42        2023-04-15  1     ← same user, different temporal snapshot
# 107       2023-01-15  0

The temporal split is the most important design choice in RelBench. Using seed_time as the split boundary, with fixed windows, prevents both the random-split leakage problem (where future rows of the same entity appear in training) and the split-gaming problem (where practitioners try different splits to find the one that flatters their model).

Snippet Two: Heterogeneous Temporal Graph Construction and GNN Training

import torch
from torch_geometric.data import HeteroData
from relbench.modeling.graph import make_pkey_fkey_graph
from relbench.modeling.nn import HeteroEncoder, HeteroGraphSAGE, HeteroTemporalEncoder
from relbench.modeling.utils import to_unix_time

# ← Convert the relational database to a heterogeneous temporal graph.
# make_pkey_fkey_graph does the critical work:
# - Creates one node type per table
# - Creates one edge type per primary-foreign key relationship
# - Attaches timestamps to all edges from the source table's time column
# - Does NOT aggregate or flatten: raw rows become nodes, raw PFK links become edges
data, col_stats_dict = make_pkey_fkey_graph(
    db=dataset.get_db(),
    col_to_stype_dict=task.col_to_stype_dict,  # column types per table
    # ← col_to_stype_dict declares which columns are numerical, categorical,
    #   text, etc. This drives the multi-modal encoding in PyTorch Frame.
    cache_dir=f"cache/{dataset.name}/graph",
)
# data is a PyTorch Geometric HeteroData object
# data['user'] has feature tensor with shape [n_users, feature_dim]
# data[('user', 'writes', 'review')] has edge_index and edge_time tensors
# All timestamps are in Unix time (seconds since epoch)

# ← HeteroEncoder: PyTorch Frame-based multi-modal column encoder.
# One encoder per table/node type. Handles mixed column types natively.
# Numerical → linear, Categorical → embedding, Text → frozen sentence transformer
encoder = HeteroEncoder(
    channels=128,                    # output embedding dim
    node_to_col_names_dict=...,      # per-node-type column names
    node_to_col_stats=col_stats_dict # per-column statistics for normalization
)

# ← HeteroTemporalEncoder: encodes relative time differences between
#   neighboring nodes' timestamps and the seed_time.
# This gives the GNN temporal context: how long ago was this interaction?
temporal_encoder = HeteroTemporalEncoder(
    node_types=list(data.node_types),
    channels=128,
)

# ← Heterogeneous GraphSAGE: separate linear + aggregation per edge type.
# sum aggregation (not mean) is important here:
# - A user with 100 reviews should aggregate more signal than one with 1 review
# - Mean aggregation would equalize them (wrong for most tasks)
gnn = HeteroGraphSAGE(
    node_types=list(data.node_types),
    edge_types=list(data.edge_types),
    channels=128,
    aggr="sum",        # ← sum, not mean: preserves degree information
    num_layers=2,      # 2-hop neighborhood
)

# Training loop (simplified)
optimizer = torch.optim.Adam(
    list(encoder.parameters()) +
    list(temporal_encoder.parameters()) +
    list(gnn.parameters()) +
    list(head.parameters()),
    lr=0.001,
)

for batch in loader:
    # batch contains: seed_node indices, seed_times, sampled subgraph, labels
    seed_time = batch['seed_time']

    # Step 1: encode raw node features → initial embeddings
    x_dict = encoder(batch.x_dict, batch.node_type_dict)

    # ← Step 2: add temporal embeddings based on time since seed_time
    # Nodes from long ago get different temporal encoding than recent nodes
    # THIS is what makes the GNN time-aware: not just graph structure but recency
    x_dict = temporal_encoder(x_dict, batch.time_dict, seed_time)

    # Step 3: run GNN over temporally-constrained subgraph
    # The subgraph was sampled with timestamp < seed_time (guaranteed by loader)
    x_dict = gnn(x_dict, batch.edge_index_dict, batch.edge_attr_dict)

    # Step 4: task head (classification / regression / link scoring)
    pred = head(x_dict[task.entity_table], batch.seed_indices)
    loss = criterion(pred, batch.labels)

    optimizer.zero_grad()
    loss.backward()  # end-to-end: gradients flow through gnn → temporal_encoder → encoder
    optimizer.step()

The temporal_encoder is the underappreciated component. A standard GNN cannot distinguish between a product review written yesterday and one written five years ago. The temporal encoder adds a learned positional embedding based on time-since-seed, giving the GNN recency information that is critical for any task involving behavioral prediction.

It In Action: End-to-End Worked Example

Scenario: Train on rel-amazon user-churn task. Compare RDL against the LightGBM baseline that uses only single-table features (no relational joins).

Input: Amazon Reviews database. User churn task: predict whether a user will have zero reviews in the next 90 days, given their history up to seed_time.

Step 1: Data loading and graph construction

dataset = get_dataset("rel-amazon", download=True)
task = get_task("rel-amazon", "user-churn", download=True)

# Database tables:
# - product: 48,190 products (columns: product_id, category, price, brand, ...)
# - review: 1,398,119 reviews (columns: review_id, user_id, product_id,
#                                         rating, text, timestamp)
# - user: 147,202 users (columns: user_id, registration_timestamp, ...)

# Task statistics:
# - Train: 19,450 (user, seed_time) pairs
# - Val:    5,208 pairs
# - Test:   3,206 pairs
# - Positive rate (churned users): ~40%

data, col_stats_dict = make_pkey_fkey_graph(dataset.get_db(), ...)
# Graph: 147,202 user nodes + 48,190 product nodes + 1,398,119 review nodes
# Edges: 2 edge types (user→review, review→product) × 2 directions = 4 edge types

Step 2: Training RDL (Encoder + GraphSAGE + Head)

Architecture: ResNet(128) → HeteroGraphSAGE(128, 2 layers, sum) → MLP head
Parameters: ~2.1M total
Training: 50 epochs, Adam lr=0.001, batch_size=512

Training time (RTX 4090): ~18 minutes
Best val AUROC: 0.847

Step 3: Baseline comparison

LightGBM on single-table user features (no joins):
  Features: user registration age, [no behavioral features without joins]
  Val AUROC: 0.681  ← misses all behavioral signal from reviews table

LightGBM with manually engineered features (data scientist baseline, ~2 weeks):
  Features: 47 hand-coded temporal aggregations across 3 tables
  Val AUROC: 0.842  ← comparable to RDL, but required weeks of work

RDL (Heterogeneous GraphSAGE, end-to-end):
  Val AUROC: 0.847  ← better than manual engineering
  Development time: < 2 hours (load dataset, configure pipeline, train)
  Lines of custom feature engineering code: 0

Test AUROC: 0.839 (RDL) vs 0.836 (manual features)

Step 4: What the GNN actually learned

Ablation study (from the RelBench paper):
  No GNN (only tabular encoder on user table): AUROC 0.689
  GNN with 1 hop (user → review only): AUROC 0.813
  GNN with 2 hops (user → review → product): AUROC 0.847

← The product table is the critical second hop:
  Users who reviewed the same product categories as other churned users
  are more likely to churn. The GNN extracts this signal automatically.
  A feature engineer would have to manually code: "count of reviews in
  categories where churn rate > X" to capture this. RDL gets it for free.

Real benchmark numbers across all rel-amazon tasks:

Task                    Metric   RDL     Manual FE   Single-Table
user-churn              AUROC    0.847   0.842       0.681
user-ltv (90-day)       MAE      14.23   14.71       19.88
item-churn              AUROC    0.761   0.748       0.623
item-ltv                MAE      8.44    9.12        13.21

Why This Design Works, and What It Trades Away

The core insight behind RDL and RelBench is that primary-foreign key links are not just schema metadata. They are edges in a graph that encode predictive signal. A user who bought product A also bought products B and C; products B and C share category D; users in category D have high churn rates. This chain of relational signal is invisible to any model that flattens the database into a single table before training. The GNN, operating on the full relational graph, can propagate this signal across multiple hops automatically.

The temporal-aware subgraph sampling is the engineering contribution that makes this possible in practice, not just in theory. Without temporal constraints, the GNN leaks future information: it sees edges (reviews written after seed_time) that would not be available at prediction time. The loader-level enforcement of timestamp < seed_time, at every hop, is what makes the benchmark's results trustworthy. Most relational ML papers before RelBench did not enforce this rigorously.

The use of PyTorch Frame for tabular encoding is the correct choice for handling multi-modal columns. A user table might have registration_date (timestamp), country (categorical), email_domain (text), and lifetime_value (numerical). These four column types require four different encoding strategies. PyTorch Frame handles this with a unified interface, and its ResNet tabular model provides a strong initial embedding without requiring manual feature normalization.

What RelBench trades away:

Inference speed. A trained GNN on a full relational database requires subgraph sampling at inference time, which touches multiple tables. LightGBM scoring on pre-computed features is milliseconds; RDL inference over the graph is seconds to minutes for large databases. LIGHTRDL (arXiv:2504.04934) addresses this with a hybrid approach achieving up to 526x inference speedup by caching tabular model outputs, at the cost of some accuracy.

Scale beyond the benchmark. RelBench v1 covers databases with millions of rows. Industrial relational databases have hundreds of millions to billions of rows. The temporal graph construction and subgraph sampling become bottlenecks at that scale. RelBench v2 (arXiv:2602.12606) scales to larger databases but the gap between benchmark scale and production scale remains.

Model interpretability. A decision tree over engineered features is inspectable. A two-hop Heterogeneous GraphSAGE is not. Enterprise ML teams with compliance requirements may not be able to adopt RDL regardless of accuracy improvements.

Technical Moats

The temporal split infrastructure. Most relational ML papers before RelBench used random train/val/test splits, which leak future temporal information through shared entity features. RelBench's fixed temporal splits with seed_time-level granularity are the correct design for any behavioral prediction task, and replicating this correctly requires understanding where time leakage can enter the pipeline (at graph construction, at sampling, at label computation). The code to do this correctly is non-trivial and is not available anywhere else as a standardized open-source implementation.

The multi-task, multi-database leaderboard. STaRK (arXiv:2404.13207) benchmarks LLM retrieval on textual and relational knowledge bases, but covers a different problem class (retrieval, not predictive modeling). RelBench's 30 predictive tasks across 7 databases, with standardized evaluation and a live leaderboard, provide a comprehensive evaluation surface that no competing benchmark offers for this specific problem.

The user study evidence. Most benchmark papers do not include a rigorous comparison against the "expert baseline" of what a professional data scientist would produce given the same task. RelBench does. The >95% reduction in development time result (from weeks to hours) is the most compelling economic argument for RDL adoption, and it comes from an actual controlled study, not a thought experiment.

Insights

Insight One: RelBench is not a GNN benchmark. It is a benchmark for replacing feature engineering with end-to-end learning, and most GNN researchers are not its target audience.

The community reception to RelBench has largely treated it as "GNNs on databases," which frames it as a graph ML research problem. The actual research question is: can end-to-end deep learning replace the data scientist feature engineering loop for enterprise ML on relational databases? The GNN is the mechanism, not the point. The user study comparison against a professional data scientist, not against other GNN architectures, is the most important experiment in the paper. Teams working on graph ML benchmarks are not the primary audience. Teams working on enterprise ML pipelines that currently involve weeks of SQL aggregation are.

Insight Two: The single-table LightGBM baseline is the most honest baseline in any relational ML paper, and RelBench is one of the few benchmarks that includes it prominently.

A common benchmark design error in graph ML is to compare against other GNN architectures, all of which use the same relational structure. This answers "which GNN architecture is better" rather than "is the relational structure useful at all." RelBench explicitly includes the single-table LightGBM baseline (using only features from the seed entity's own table, no joins), which shows the information gain from the relational structure. The consistent 15-30% AUROC improvement from using relational neighbors is the evidence that the GNN is learning something real, not just overfitting to a complex model on the same features.

Takeaway

The most important finding in RelBench is not which GNN architecture wins. It is that the data scientist spent two weeks building features that a 2-hour RDL pipeline matched or beat. The bottleneck in enterprise relational ML is not model quality. It is feature engineering throughput.

The user study used an experienced data scientist who knew the datasets and understood the prediction tasks. That data scientist wrote 47 aggregation features using SQL and pandas over approximately two weeks. The RDL pipeline, configured in under two hours with zero custom feature engineering code, matched or exceeded those features on every task. This result has significant implications that the ML research community has not fully absorbed: the human labor cost of feature engineering in relational ML is not a minor friction. It is the primary bottleneck for most enterprise teams. A system that eliminates it while maintaining accuracy is not a marginal improvement. It changes the economics of building ML on structured data.

TL;DR For Engineers

RelBench (NeurIPS 2024) is a benchmark for end-to-end deep learning on relational databases, covering 7 databases and 30 tasks. The core method (Relational Deep Learning, RDL) converts a relational database to a heterogeneous temporal graph, encodes raw table rows with ResNet via PyTorch Frame, and trains Heterogeneous GraphSAGE via PyTorch Geometric.
Temporal-aware subgraph sampling (timestamp < seed_time at every hop) is the critical implementation detail that prevents time leakage. Standard graph sampling without this constraint produces invalid training signal for temporal tasks.
User study result: RDL matches or beats a professional data scientist's manually engineered features across all tasks while reducing human effort by >95% (weeks to hours). The bottleneck is feature engineering throughput, not model capacity.
Single-table LightGBM vs. RDL gap is 15-30% AUROC across rel-amazon tasks, showing the relational structure contributes substantial predictive signal.
The main RDL limitation for production deployment: inference requires live subgraph sampling, making it seconds-to-minutes per query. LIGHTRDL (arXiv:2504.04934) achieves 526x inference speedup with hybrid caching at some accuracy cost.

Manual Feature Engineering Is the Bottleneck. RelBench Proves It.

RelBench makes a precise claim: end-to-end deep learning on relational databases outperforms manually engineered features at a fraction of the development cost. The user study provides the evidence. The 30-task leaderboard provides the reproducibility. The open-source implementation provides the starting point.

The ML research community should stop treating this as a GNN paper and start treating it as an enterprise ML infrastructure paper. The question it answers, "can we automate the most expensive part of building ML on relational data," has immediate practical relevance for any team running production models on a database. The answer, based on RelBench's evidence, is yes, with documented tradeoffs on inference speed that the research community is actively addressing.

References

RelBench GitHub Repository, 352 stars, NeurIPS 2024 Datasets and Benchmarks
RelBench: A Benchmark for Deep Learning on Relational Databases, arXiv:2407.20060, Robinson, Ranjan, Hu, Huang, Han, et al., NeurIPS 2024
Position: Relational Deep Learning, arXiv:2401.12174, Fey, Hu, Huang, Lenssen, et al., ICML 2024
RelBench v2: A Large-Scale Benchmark for Relational Machine Learning, arXiv:2602.12606, 2026
STaRK: Benchmarking LLM Retrieval on Textual and Relational Knowledge Bases, arXiv:2404.13207, Wu, Tang, Zhao, et al., 2024
LIGHTRDL: Boosting Relational Deep Learning with Pretrained Tabular Models, arXiv:2504.04934, 526x inference speedup on RelBench
PyTorch Geometric, Fey and Lenssen, 2019
PyTorch Frame, Hu et al., 2024, the tabular encoding library
Relational Deep Learning: Challenges, Foundations and Next-Generation Architectures, arXiv:2506.16654, 2025 survey
RelBench Stanford project page

RelBench (Robinson et al., NeurIPS 2024, arXiv:2407.20060) is an open benchmark for end-to-end deep learning on relational databases, providing 7 real-world databases and 30 predictive tasks with fixed temporal splits. The benchmark implements Relational Deep Learning (RDL), which converts a relational database to a heterogeneous temporal graph, encodes raw table rows using ResNet via PyTorch Frame, and trains Heterogeneous GraphSAGE end-to-end via PyTorch Geometric, with temporal-aware subgraph sampling that enforces timestamp < seed_time at every hop to prevent data leakage. An in-depth user study comparing RDL to a professional data scientist's manually engineered features found that RDL matches or exceeds manual features across all tasks while reducing development time by more than an order of magnitude (weeks to hours).

Sponsored Ad

If you enjoy practical AI insights, check out SnackOnAI and support the newsletter by subscribing, sharing, and exploring our sponsored ad — it helps us keep building and delivering value 🚀

Your agent needs more than 2 projects

You prompt. The agent builds. Then it asks for a database.

Ghost is postgres made for this. Spin one up in seconds. Fork it like a branch. Delete it when you're done. Pay nothing when it's idle.

Your agent gets full sql, mcp support, as many databases. No dashboards. No provisioning. No forgotten dev databases draining your card at month end.

Build a weekend app. Fork the schema three different ways. Throw two of them out. Ghost doesn't care. The next prompt can spin up a fresh one.

You're already vibe-coding the app. Stop wiring up the backend.

Unlimited databases. Unlimited forms. 100 compute/hrs a month. 1tb of storage. Free.

Download Ghost