Kunlun: Why Meta's Ads Models Are Wasting 83% of Their GPU, and How They Fixed It

In partnership with

SnackOnAI Engineering | Senior AI Systems Researcher | Technical Deep Dive | May 19, 2026

The scaling law hypothesis for LLMs is now industry consensus: more compute, more data, more parameters, better model. The relationship is a predictable power law. You can budget compute and forecast quality improvement. This is enormously valuable for capacity planning.

Recommendation systems are the ML workload that drives most of the revenue at Google, Meta, TikTok, and Amazon. They are also the workload where scaling laws have historically not held. You spend more compute, the model gets better, but the relationship is not predictable. Sometimes it plateaus. Sometimes it exhibits diminishing returns. The root cause, according to Kunlun (arXiv:2602.10016, Meta, February 2026), is that the models were wasting most of their compute budget before it ever reached useful calculation.

The 3-15% MFU figure is the indictment. Language models running on the same hardware achieve 40-60%. The recommendation models are spending the vast majority of their GPU cycles on memory-bound operations, irregular tensor shapes, and sequential dependencies that leave the tensor cores sitting idle. You cannot derive a clean scaling law on a compute budget that is 85% waste.

Kunlun's answer is model-efficiency codesign: redesign the fundamental building blocks so that the FLOPs you allocate actually get used, then allocate them intelligently across components. The result: MFU from 17% to 37% on NVIDIA B200 GPUs, 2× scaling efficiency over prior best approaches, and a predictable power-law relationship between compute and NE (Normalized Entropy, the evaluation metric for CTR prediction). Kunlun is now deployed in major Meta Ads models.

This newsletter dissects Kunlun as a systems engineering document: what GDPA replaces and why the LLM attention kernel was the wrong starting point, how HSP addresses the pooling bottleneck, what CompSkip and Event-level Personalization mean for compute allocation, and why the PyTorch GDPA kernel blog post (March 2026) is the most technically precise public artifact in the entire Kunlun ecosystem.

Scope: Kunlun architecture (arXiv:2602.10016), GDPA kernel design (PyTorch blog, March 2026), the Wukong context, CompSkip, Event-level Personalization, and the scaling law results. Not covered: Meta GEM (Generative Ads Model) beyond its relationship to Kunlun, or the full HSTU architecture.

What It Actually Does

Kunlun is a unified architecture for large-scale click-through rate (CTR) prediction models, deployed in Meta Ads. CTR prediction, the task of estimating the probability that a user clicks on a given ad, is the backbone of Meta's revenue. Getting this right has direct P&L implications.

The two-bottleneck diagnosis:

Inefficient modules (3-15% MFU): Recommendation models process heterogeneous feature spaces: sparse ID features (user embeddings, item embeddings) and dense continuous features, plus sequential user behavior histories. The resulting tensor shapes are irregular, embedding dimensions are small, sequences are short and jagged. LLM attention kernels assume long, dense, uniform sequences. These assumptions fail completely on RecSys workloads, leaving most compute idle.
Inefficient resource allocation: Naively scaling all components equally produces diminishing returns. Different event types (click, purchase, video view) have different information densities. Different layers have different computational requirements. Uniform scaling wastes budget on low-value components.

Kunlun's five technical contributions:

Component	Level	What It Does
GDPA (Generalized Dot-Product Attention)	Low-level	Replaces softmax with custom activations (GELU, SiLU) for unified interaction pattern
HSP (Hierarchical Seed Pooling)	Low-level	Efficient sequence summarization via hierarchical pooling
Sliding Window Attention	Low-level	Linear-complexity attention for sequence modeling
CompSkip	High-level	Layer-wise component selection: not every layer needs every module
Event-level Personalization	High-level	Importance-based resource allocation across heterogeneous event types

Deployment: Kunlun is deployed in Meta's GEM (Generative Ads Model), Meta's largest RecSys training foundation model.

The Architecture, Unpacked

Focus on the bidirectional information flow. Unlike prior architectures that processed sequence and non-sequence features in separate streams, Kunlun maintains continuous exchange between the two paths within each layer. Sequence summaries inform context-aware interaction; context features guide what the sequence model attends to. This is the architectural choice that enables depth-wise scaling to produce consistent NE gains.

The Code, Annotated

Snippet One: GDPA Kernel Design and Why the LLM Kernel Was Wrong

# GDPA: Generalized Dot-Product Attention
# Source: PyTorch blog (March 2026), Meta FAIR, based on Kunlun + InterFormer
# github.com/facebookresearch/ads_model_kernel_library/blob/main/gdpa/README.md

# Standard dot-product attention (LLM style):
# Attention(Q, K, V) = softmax(QK^T / sqrt(d_k)) · V
# This assumes:
#   - Long, dense, uniform sequences (LLM tokens)
#   - softmax for probability normalization
#   - K and V have similar lengths to Q

# GDPA (Generalized Dot-Product Attention):
# Attention(Q, K, V) = activation(QK^T / sqrt(d_k)) · V
# where activation ∈ {GELU, SiLU, identity, ...}
#
# Why replace softmax?
# RecSys interaction patterns (self-attention, PMA, PFFN) do NOT need
# probability normalization. Forcing softmax:
#   - Adds unnecessary normalization overhead
#   - Forces a warp specialization design optimized for softmax correction
#   - The softmax correction stage requires dedicated warps
#
# By removing softmax, GDPA can:
# ← ELIMINATE the correction warp stage (saves 4 warps per kernel)
# ← FREE 16 registers per activation warp (available for compute)
# ← Unify self-attention, PMA, and PFFN into ONE kernel

import torch
import triton
import triton.language as tl

# GDPA forward pass (simplified from production implementation)
# Key design choices annotated below
@triton.jit
def gdpa_fwd_kernel(
    Q_ptr, K_ptr, V_ptr, Out_ptr,
    seq_len_q, seq_len_kv,  # ← SHORT and ASYMMETRIC in RecSys (e.g., 64 vs 8)
    batch_size, num_heads, d_model,
    # Jagged inputs: each sequence has a different actual length
    seq_start_q, seq_start_kv,  # ← THIS is the challenge: no assumption of uniform length
    activation: tl.constexpr,   # GELU, SILU, or IDENTITY
    BLOCK_Q: tl.constexpr,
    BLOCK_KV: tl.constexpr,     # Outer-loop scheduling vs inner-loop (see below)
):
    """
    Production GDPA kernel for RecSys training.

    Key differences from FlashAttention (designed for LLMs):

    1. NO SOFTMAX CORRECTION STAGE:
       Standard FA has: compute(scores) → softmax(scores) → correct(running_max)
       GDPA has:       compute(scores) → activation(scores)
       ← Eliminates the correction warp group entirely
       ← Saves 4 warps, gains 16 registers per activation warp

    2. OUTER-LOOP SCHEDULING (not inner-loop):
       FA assumes long K/V: the inner K/V loop amortizes setup cost across many iters.
       RecSys K/V is SHORT (8-128 tokens): the inner loop runs 1-2 times.
       Pipeline setup cost dominates → poor occupancy.
       ← Fix: schedule outer loop over batch/head dimensions
         to amortize setup cost even when K/V is short.

    3. JAGGED INPUT HANDLING:
       LLM sequences are padded to uniform length → simple indexing.
       RecSys sequences have variable actual lengths → pointer arithmetic.
       ← software-level tile scheduling accounts for actual length per sample.
    """

    # Warp specialization: simplified (no correction warps)
    pid = tl.program_id(axis=0)  # maps to (batch, head, query_block)

    # Load Q block
    q_block = tl.load(Q_ptr + pid * BLOCK_Q * d_model, ...)

    # Inner loop over K/V blocks (short in RecSys → few iterations)
    acc = tl.zeros([BLOCK_Q, d_model], dtype=tl.float32)
    for kv_start in range(0, seq_len_kv, BLOCK_KV):
        k_block = tl.load(K_ptr + kv_start * d_model, ...)
        v_block = tl.load(V_ptr + kv_start * d_model, ...)

        # Attention scores
        scores = tl.dot(q_block, tl.trans(k_block))  # [BLOCK_Q, BLOCK_KV]

        # ← GDPA: activation instead of softmax
        if activation == "GELU":
            scores = gelu(scores)     # No running max needed, no correction
        elif activation == "SILU":
            scores = silu(scores)
        # else: identity (just the raw dot product)

        acc += tl.dot(scores, v_block)

    tl.store(Out_ptr + pid * BLOCK_Q * d_model, acc, ...)

The 4-warp elimination from removing softmax correction is why this matters: those 4 freed warps translate to 16 additional registers per activation warp, which enables the kernel to maintain higher tensor core utilization on the short-sequence workloads that dominate RecSys. The kernel achieves 97% tensor core utilization, compared to the 2.6× performance gap observed with the LLM-oriented Triton baseline.

Snippet Two: CompSkip Layer-Wise Component Selection and Event-level Personalization

# CompSkip: layer-wise component selection
# Source: Kunlun paper (arXiv:2602.10016), Section 3.2
# The insight: not every layer benefits equally from every module

import torch
import torch.nn as nn

class KunlunLayer(nn.Module):
    """
    One Kunlun layer with CompSkip applied.

    CompSkip observations from ablations:
    - Early layers: feature interaction (FM, MLP) provides most NE gain
    - Late layers: sequence modeling (attention, PFFN) provides most NE gain
    - Middle layers: balanced contribution from both

    CompSkip:
    - At each layer, independently decide whether to run transformer block,
      interaction block, or both
    - The decision is made based on a learned gate or a fixed allocation
      derived from the resource budget
    ← THIS is the trick: the MFU gain from GDPA + HSP creates a compute
      budget surplus. CompSkip decides how to allocate that surplus
      instead of naively scaling all components uniformly.
    """

    def __init__(
        self,
        d_model: int,
        use_transformer: bool = True,  # CompSkip: skip if not useful this layer
        use_interaction: bool = True,   # CompSkip: skip if not useful this layer
    ):
        super().__init__()
        self.use_transformer = use_transformer
        self.use_interaction = use_interaction

        if use_transformer:
            self.transformer_block = KunlunTransformerBlock(d_model)
        if use_interaction:
            self.interaction_block = KunlunInteractionBlock(d_model)

    def forward(
        self,
        non_seq_features: torch.Tensor,   # context features [B, d]
        seq_features: torch.Tensor,        # user history [B, N, d]
    ) -> tuple[torch.Tensor, torch.Tensor]:

        seq_out = seq_features
        ctx_out = non_seq_features

        if self.use_transformer:
            # Sequence modeling: attend over user history
            # Guided by non-seq context (bidirectional flow)
            seq_out = self.transformer_block(seq_features, non_seq_features)

        if self.use_interaction:
            # Feature interaction: incorporate sequence summary into context
            # HSP summarizes seq_out → fixed-size seed vectors → FM-style interaction
            ctx_out = self.interaction_block(non_seq_features, seq_out)

        # Bidirectional update: both paths updated with each other's information
        return ctx_out, seq_out


class EventLevelPersonalization(nn.Module):
    """
    Event-level personalization: allocate more compute to high-value events.

    In RecSys, event types have different information density:
    - Purchase (conversion): very high signal, rare
    - Add-to-cart: high signal
    - Video watch (50%): medium signal
    - Impression without click: low signal

    Problem with uniform allocation:
    If all events share the same model layers (same compute budget),
    high-signal events are under-served and low-signal events waste compute.

    Event-level personalization:
    ← Each event type gets a DIFFERENT computational depth
    ← High-value events: more layers, more capacity
    ← Low-value events: fewer layers, less capacity

    This is NOT just different loss weights.
    It's different model CAPACITY per event type within the same model.
    """

    def __init__(self, event_type_configs: dict[str, int]):
        super().__init__()
        # event_type_configs: {"purchase": 8, "click": 6, "impression": 4}
        # Value = number of Kunlun layers allocated to this event type
        self.event_configs = event_type_configs

        # Each event type has its own tower of Kunlun layers
        # ← These share parameters for early layers, diverge for later layers
        self.event_towers = nn.ModuleDict({
            event: nn.ModuleList([
                KunlunLayer(d_model=256) for _ in range(depth)
            ])
            for event, depth in event_type_configs.items()
        })

    def forward(
        self,
        features: torch.Tensor,
        event_type: str,
    ) -> torch.Tensor:
        tower = self.event_towers[event_type]
        ctx = features[:, :256]    # non-sequence features
        seq = features[:, 256:]    # sequence features (reshaped)
        seq = seq.view(seq.shape[0], -1, 256)  # [B, N, d]

        for layer in tower:
            ctx, seq = layer(ctx, seq)

        return ctx  # final context for CTR prediction

Event-level personalization is the high-level insight that makes Kunlun's resource allocation efficient. The baseline (all events share the same model) wastes capacity on low-signal impressions and under-serves high-signal conversions. Giving purchase events more layers, and impression events fewer layers, within the same unified model is the allocation decision that maximizes NE per FLOP.

It In Action: End-to-End Worked Example

Task: CTR prediction for a user who has purchased 3 items in the past 30 days and is now viewing an ad for a product in the same category.

Input features:

Event type: "click" (6 Kunlun layers)

Dense features (continuous):
  - user_age_bucket: 0.34 (normalized)
  - time_of_day: 0.67
  - days_since_last_purchase: 0.23
  - ...
  → W_dense · x_dense → unified embedding ∈ R^256

Sparse features (categorical):
  - user_id embedding: [u_1, ..., u_256] ∈ R^256
  - item_id embedding: [i_1, ..., i_256] ∈ R^256
  - category_id embedding: [c_1, ..., c_256] ∈ R^256
  → concatenate + project → unified embedding ∈ R^256

Sequence features (user purchase history):
  - item_1 embedding: R^256, timestamp: t_1
  - item_2 embedding: R^256, timestamp: t_2
  - item_3 embedding: R^256, timestamp: t_3
  → 3 sequence tokens (short! N=3 vs LLM N=128k)
  → Sliding Window Attention can process entire sequence at once

Forward pass through 6 Kunlun layers:

Layer 1 (interaction-heavy via CompSkip):
  Interaction block: FM-style interaction across dense + sparse + seq embeddings
  → captures first-order cross-feature interactions
  CompSkip: transformer_block SKIPPED (early layer, interaction more valuable)

Layer 2-3 (balanced via CompSkip):
  Transformer block:
    Sliding Window Attention over 3 sequence tokens
    ← N=3: inner loop runs once per query, outer-loop scheduling used
    GDPA kernel: GELU activation (not softmax)
    → produces sequence-aware item representations
  Interaction block:
    HSP: K=4 seed vectors aggregate 3 sequence tokens
    → [B, 4, 256]: hierarchical user interest summary
    Personalized weight generation from non-seq context
    → weights sequence differently based on user context

Layer 4-6 (sequence-heavy via CompSkip):
  Transformer block: refines sequence model with full context
  Interaction block: final feature interaction with richer representations

Output: single logit → sigmoid → P(click) = 0.247

Scaling law measurement:

Without Kunlun (baseline, 17% MFU):
  FLOPs budget: 100 GFLOPs
  NE improvement: δNE = 0.12%
  Scaling curve: non-monotonic, plateaus early

With Kunlun (37% MFU):
  FLOPs budget: 100 GFLOPs (same budget)
  NE improvement: δNE = 0.24% (2× per FLOP)
  Scaling curve: follows power law: NE ∝ C^α, α ≈ 0.07

At 180 GFLOPs scale:
  Kunlun: 154.9 GFLOPs actual compute, NE improvement: +δ
  Prior SOTA: 172.3 GFLOPs actual compute, NE improvement: δ (baseline)
  Kunlun achieves better NE with LOWER actual FLOPs
  ← Better algorithmic efficiency, not just better hardware utilization

Why This Design Works, and What It Trades Away

The MFU gap between RecSys and LLMs (3-15% vs 40-60%) is the fundamental engineering problem Kunlun addresses, and it explains why recommendation system scaling laws could not be derived before: you cannot fit a clean power law to a process that is 85% noise. The noise in this case is compute wasted on memory-bound operations, irregular tensor shapes, and sequential bottlenecks.

Kunlun's solution is not a single magic component. It is the composition of five targeted fixes, each addressing a specific efficiency gap:

GDPA eliminates the softmax correction overhead and unifies three formerly separate interaction modules (self-attention, PMA, PFFN) under one kernel, enabling a single fused implementation that achieves 97% tensor core utilization. HSP replaces flat pooling with hierarchical seed aggregation, providing richer sequence summaries at lower cost. Sliding Window Attention reduces sequence modeling from O(N²) to O(N×window) for the short sequences that dominate RecSys workloads. CompSkip eliminates the assumption that all layers need all components. Event-level Personalization eliminates the assumption that all event types deserve equal compute.

The efficiency gains from the low-level fixes (GDPA, HSP, Sliding Window) create a compute budget surplus. The high-level fixes (CompSkip, Event-level Personalization) determine how that surplus is allocated to maximize NE per FLOP. This two-level codesign is the reason Kunlun achieves 2× scaling efficiency: the gains compound.

What Kunlun trades away:

Generalizability beyond Meta's feature distribution. The GDPA kernel is optimized for short, asymmetric, jagged sequences with large batch sizes. These are the characteristic shapes of Meta's ad recommendation workload. A RecSys team with different feature distributions (longer sequences, more uniform lengths, smaller batches) may see different efficiency gains.

Architecture complexity. Five independent improvements, each with its own hyperparameters (number of seeds in HSP, window size in Sliding Window Attention, CompSkip allocation per layer, event-type compute budgets) increase the hyperparameter space. The paper reports ablation results for each component, but combining them optimally for a new system requires careful validation.

Architectural debt from heterogeneous event towers. Event-level personalization gives different event types different computational depths. When the set of event types changes, the model architecture must be updated. This is a maintenance challenge that uniform architectures avoid.

Technical Moats

The GDPA kernel on NVIDIA B200. The GDPA kernel is co-authored by Tri Dao (FlashAttention author) and the Meta team and achieves 97% tensor core utilization. The production code is available at github.com/facebookresearch/ads_model_kernel_library. The kernel design is published. The moat is not secrecy. It is the production-driven optimization knowledge: the real-world traffic data distribution, the specific shape distributions of Meta's workloads, and the B200 cluster configurations that motivated each design choice. A team without access to comparable real-world data distribution will struggle to validate whether their GDPA implementation is correctly tuned.

The 37% MFU baseline for comparison. Kunlun at 37% MFU is still below LLM efficiency levels (40-60%). It is, however, 2× better than the prior art for RecSys, and for the first time makes power-law scaling predictable. Any system that claims to challenge Kunlun's results must match both the MFU improvement AND the predictable scaling law, not just one.

The integration with GEM. Kunlun is deployed in Meta's Generative Ads Model, the system described as "the central brain accelerating ads recommendation AI innovation." The production deployment provides implicit validation that the architecture is robust to the full diversity of real-world traffic, not just the training distribution used in the paper.

Insights

Insight One: The reason recommendation system scaling laws have been elusive is not that recommendation models are fundamentally harder to scale than LLMs. It is that nobody cleaned up the compute waste before trying to derive the law. Kunlun's contribution is methodological: prove that clean scaling laws appear once you fix MFU, rather than discover a new architectural principle.

The Wukong paper (arXiv:2403.02545, Meta 2024) was the first to establish a scaling law for large-scale recommendation using a unified architecture (FM-based interaction stacking). Kunlun extends this by solving the MFU problem that was preventing Wukong's scaling from being fully efficient. The intellectual contribution is not "we found a new way to process features." It is "we found that the existing approaches were wasting most of their compute, and when you fix that, the scaling law appears as predicted." This is the same insight that motivated FlashAttention for LLMs: attention was bottlenecked by memory bandwidth, not compute. Fix the bandwidth bottleneck and the compute hardware's full potential becomes accessible.

Insight Two: Event-level personalization, giving different event types different computational depths within a single model, is the most practically impactful contribution in the paper and the one most likely to be adopted independently of the rest of the Kunlun architecture. It requires no kernel-level work and directly addresses a known inefficiency in most production RecSys stacks.

Most production recommendation systems use a single model for all event types, possibly with different loss weights or multi-task heads, but with the same number of layers and the same computational budget per example. Event-level personalization challenges this: the signal from a purchase is qualitatively different from the signal from a video impression, and giving them the same model capacity is a fundamental mismatch. The specific mechanism (separate towers of Kunlun layers per event type, with different depths) is straightforward to implement on top of any modern RecSys framework. The insight generalizes well beyond Kunlun: any team using a single-depth architecture for multi-event CTR prediction is likely undertreating high-signal events and overtreating low-signal ones.

Takeaway

The GDPA kernel on NVIDIA B200 is achieving 97% tensor core utilization, which is higher than the typical utilization for LLM attention kernels on the same hardware. This means that the RecSys kernel, which was the starting point with 2.6× lower performance than benchmark, now outperforms LLM-oriented kernels on its specific workload. The key was recognizing that the problem was not the hardware. It was the assumption that LLM-style kernel design would transfer to RecSys workloads.

The PyTorch GDPA blog post (March 2026) documents the performance gap between real-world RecSys traffic and the synthetic benchmark: real-world forward pass was 2.6× slower than benchmark, with worst-case gaps reaching 4×. This gap existed because LLM kernels optimize for long, dense, uniform sequences, and RecSys traffic is short, sparse, and jagged. The GDPA kernel redesigns the pipeline for these characteristics: outer-loop scheduling (not inner-loop), elimination of the softmax correction stage, and software-level tile scheduling for jagged inputs. The resulting kernel achieves better tensor core utilization on RecSys workloads than the LLM kernels it replaced. The lesson: hardware specialization for workload-specific characteristics matters as much as hardware selection itself.

TL;DR For Engineers

Kunlun (arXiv:2602.10016, Meta, Feb 2026) raises RecSys model MFU from 17% → 37% on NVIDIA B200 GPUs and doubles scaling efficiency over prior SOTA. Deployed in Meta GEM (Meta's largest Ads RecSys training foundation model). The result: a predictable power-law scaling relationship between compute and NE, enabling reliable capacity planning.
Root cause of poor RecSys scaling: 3-15% MFU (vs 40-60% for LLMs) due to heterogeneous feature spaces, small embedding dimensions, short and jagged sequences, and irregular tensor shapes. LLM attention kernels assume long, dense, uniform sequences and fail on RecSys workloads.
Five technical contributions: GDPA (replaces softmax with GELU/SiLU activations, unifies self-attn/PMA/PFFN into one kernel, eliminates softmax correction warps, achieves 97% tensor core utilization), HSP (hierarchical seed pooling), Sliding Window Attention, CompSkip (layer-wise module selection), Event-level Personalization (different compute depth per event type).
GDPA kernel co-authored with Tri Dao, published at PyTorch blog (March 2026), code at github.com/facebookresearch/ads_model_kernel_library. 2× forward speedup, 1.6× backward speedup, 30% end-to-end throughput improvement.
Event-level personalization is the most independently replicable contribution: give high-signal events (purchases) more layers than low-signal events (impressions) within a unified model. Directly addresses the mismatch between event information density and model capacity allocation in production RecSys stacks.

Clean the Waste, Then Derive the Law

Kunlun's central result is not a new accuracy record or a new model architecture. It is a methodological result: recommendation system scaling laws become predictable once you fix the compute waste. The 17% → 37% MFU improvement is the evidence that the waste was fixable. The predictable power-law scaling is the evidence that the methodology was correct. The deployment in Meta Ads is the evidence that it works in production.

The RecSys community has spent years unable to derive reliable scaling laws, attributing this to the fundamental heterogeneity of recommendation features. Kunlun's answer is more direct: the scaling laws were there. The compute efficiency was not.

References

Kunlun: Establishing Scaling Laws for Massive-Scale Recommendation Systems through Unified Architecture Design, arXiv:2602.10016, Meta, February 2026
Generalized Dot-Product Attention: Tackling Real-World Challenges in GPU Training Kernels, PyTorch Blog, March 2026
GDPA Kernel Code, GitHub
Wukong: Towards a Scaling Law for Large-Scale Recommendation, arXiv:2403.02545, Meta, 2024 — the predecessor that established scaling laws for FM-based RecSys
UniMixer: A Unified Architecture for Scaling Laws in Recommendation Systems, arXiv:2604.00590, 2026 — provides a unified theoretical framework for Wukong, Kunlun, and other scaling approaches
Understanding Scaling Laws for Recommendation Models, arXiv:2208.08489 — foundational analysis of RecSys scaling properties
Meta GEM: Meta's Generative Ads Model — the production system where Kunlun is deployed
FlashAttention-4, Tri Dao — the LLM attention kernel that GDPA builds on and diverges from

Kunlun (arXiv:2602.10016, Meta, February 2026) is a unified architecture for large-scale CTR prediction that establishes predictable power-law scaling for recommendation systems by solving the MFU bottleneck: prior models achieved 3-15% MFU vs 40-60% for LLMs due to heterogeneous feature spaces, small embedding dimensions, and short jagged sequences. Five technical contributions raise MFU from 17% to 37% on NVIDIA B200: GDPA (eliminates softmax correction, unifies self-attention/PMA/PFFN into one fused kernel achieving 97% tensor core utilization), HSP (hierarchical seed pooling for sequence summarization), Sliding Window Attention (linear-complexity for short sequences), CompSkip (layer-wise component selection), and Event-level Personalization (importance-based resource allocation across heterogeneous event types). Deployed in Meta's GEM ads model with 2× scaling efficiency over prior SOTA.

Sponsored Ad

If you enjoy practical AI insights, check out SnackOnAI and support the newsletter by subscribing, sharing, and exploring our sponsored ad — it helps us keep building and delivering value 🚀

Fast browsing. Faster thinking.

Your browser gets you to a page. Norton Neo gets you to the answer. The first safe AI-native browser built by Norton moves with you from idea to action without slowing you down. Magic Box understands your intent before you finish typing. AI that works inside your flow, not beside it. No prompting. No copy-pasting. No switching apps.

Built-in AI, instantly and for free. Privacy handled by Norton. Built-in VPN and ad blocking protect you by default. No configuration. No extra apps. Nothing to think about.

Fast. Safe. Intelligent. That's Neo.

Download Norton Neo