Sponsored by

SnackOnAI Engineering | Senior AI Systems Researcher | Technical Deep Dive | June 10, 2026

Every major vision-language model that does object detection has the same bottleneck hiding in plain sight: it generates bounding box coordinates the same way a drunk person reads off a phone number, one digit at a time, left to right, with no awareness that x1, y1, x2, y2 are four sides of the same geometric fact.

NVIDIA's LocateAnything (arXiv:2605.27365) kills this bottleneck. It predicts each complete bounding box in a single parallel forward pass, trains on 138 million samples and 785 million boxes, and posts 12.7 boxes per second (BPS) throughput on a single H100, compared to 1.1 BPS for Qwen3-VL and 5.0 BPS for Rex-Omni. It is also more accurate: +3.8% mean F1 on LVIS and +14.5 points on M6Doc layout grounding over prior VLM-based approaches.

Speed and accuracy improving together is not the normal tradeoff story. Understanding why requires going inside the decoding architecture.

Scope: the Parallel Box Decoding (PBD) mechanism, the four-block output formulation, the three inference modes, and the LocateAnything-Data construction. Not covered: the full training recipe details or the embodied agent integration in the Eagle repo.

What It Actually Does

LocateAnything is a unified visual grounding and detection model. You give it an image and a natural-language query ("find all pedestrians," "locate the submit button," "where is the price tag?"), and it returns bounding boxes, tightly localized and ranked.

The model supports six task categories from one unified backbone: general object detection, GUI element grounding, referring expression comprehension, text/OCR localization, layout grounding, and point-based localization. Prior models either trained separate heads per task or paid severe latency penalties to unify them under autoregressive token generation.

The backbone is Moon-ViT (vision encoder, native resolution) plus Qwen2.5 (language decoder), bridged by an MLP projector. The novel component is not the backbone. It is what happens at the output: instead of emitting coordinate tokens one by one, LocateAnything emits fixed-length atomic blocks, one complete box per block, in a single parallel step.

The Architecture, Unpacked

Caption: Focus on the BOX BLOCK output unit. All four coordinates (x1, y1, x2, y2) are emitted together in one parallel step. This is the entire architectural bet.

The four block types define a complete output grammar. A Semantic Block names the category. A Box Block carries the four coordinates as a unit. A Negative Block signals no match was found (critical for open-world detection where negative queries are common). An End Block terminates the sequence. The LM emits these in order: Semantic → Box (or Negative) → repeat → End.

The x-y Corner Order sorting (top-left to bottom-right) was validated in ablation to be the highest-accuracy box ordering strategy among four tested orderings. The model learns a spatial prior from the ordering; arbitrary output ordering hurts F1.

The Code, Annotated

Snippet 1: HuggingFace inference with Hybrid Mode

# Source: nvidia/LocateAnything-3B on HuggingFace
# Pattern: load model, run grounding query, parse block-format output

from transformers import AutoModelForCausalLM, AutoProcessor
import torch

# Load the 3B model from HuggingFace
# ← Native resolution processing preserved in the processor config
#   DO NOT resize the image before passing: Moon-ViT is resolution-aware
model = AutoModelForCausalLM.from_pretrained(
    "nvidia/LocateAnything-3B",
    torch_dtype=torch.bfloat16,  # ← bfloat16 preferred for H100/A100
    device_map="cuda",
    trust_remote_code=True,      # ← required for custom PBD decoding head
)
processor = AutoProcessor.from_pretrained(
    "nvidia/LocateAnything-3B",
    trust_remote_code=True,
)

# Prepare input: image + natural-language query
from PIL import Image
image = Image.open("street_scene.jpg")  # high-res is fine; native resolution

query = "Locate all pedestrians"

inputs = processor(
    images=image,
    text=query,
    return_tensors="pt",
).to("cuda")

# Run inference in Hybrid Mode (default)
# ← THIS is the trick: the model internally decides per-block
#   whether to use MTP (parallel) or NTP (autoregressive)
#   based on format and spatial consistency checks
#   No user-facing parameter needed: hybrid is default at generate() time
with torch.no_grad():
    outputs = model.generate(
        **inputs,
        max_new_tokens=512,      # ← enough for ~20-30 boxes at 4 coords each
        do_sample=False,         # ← greedy for deterministic localization
        # decoding_mode="hybrid" is the default; also accepts "fast" or "slow"
    )

# Decode the block-format output
# ← output is NOT plain text coordinates; it is a structured block sequence
#   The processor.decode_boxes() parses Semantic + Box blocks into Python dicts
raw_text = processor.batch_decode(outputs, skip_special_tokens=False)[0]
boxes = processor.decode_boxes(raw_text)
# Returns: [{"label": "pedestrian", "box": [x1, y1, x2, y2], "score": ...}, ...]
print(f"Found {len(boxes)} pedestrians")
for b in boxes:
    print(f"  {b['label']}: [{b['box']}]")

The trust_remote_code=True flag is not a convenience option. The PBD decoding head is custom code that the standard HuggingFace generate() does not know about natively. It must be loaded from the model repo.

Snippet 2: Hybrid Mode fallback trigger (from paper design intent)

# Reconstructed from paper Section 3.3 (On-Demand Inference Mechanism)
# This shows the design logic of how Hybrid Mode decides to fall back to NTP

def hybrid_decode_block(model, prefix_ids, block_idx):
    """
    Attempt parallel (MTP) decoding for one box block.
    Fall back to sequential (NTP) decoding if validation fails.
    """

    # ATTEMPT: parallel decode the full box in one step
    # ← MTP head predicts all 4 coordinates simultaneously
    #   from the last verified prefix
    parallel_output = model
.mtp_head(prefix_ids)
    # parallel_output = {"x1": tok, "y1": tok, "x2": tok, "y2": tok}

    # VALIDATE: check two failure conditions
    # Condition 1: Format Irregularity
    # ← malformed token syntax at category boundaries
    #   e.g., x2 token falls in the Semantic vocabulary range
    format_ok = validate_token_format(parallel_output)

    # Condition 2: Spatial Ambiguity
    # ← intermediate coordinate lands between densely arranged objects
    #   where the parallel head is statistically unreliable
    #   (learned threshold, not a fixed rule)
    spatial_ok = validate_spatial_consistency(parallel_output)

    if format_ok and spatial_ok:
        # FAST PATH: accept the parallel box, move to next block
        return parallel_output                # ← 1 forward pass for this box

    else:
        # SLOW PATH: discard the parallel output entirely
        # Revert to last verified prefix and re-decode autoregressively
        # ← THIS is the trick: NTP is applied ONLY to the one bad block,
        #   not to the entire remaining sequence
        #   All subsequent blocks still use MTP
        ntp_output = model.ntp_decode(prefix_ids, block_idx)
        return ntp_output                     # ← 4 sequential steps for this box

The localized NTP re-decode is the key insight of Hybrid Mode. The penalty for a bad parallel block is 4 extra sequential steps for that block only, not a full regression to sequential decoding for the entire sequence. This is why Hybrid Mode hits 13.2 BPS on COCO vs. 16.9 BPS for pure Fast Mode while maintaining near-Slow-Mode accuracy at 51.6 vs. 52.1 F1.

It In Action: End-to-End Worked Example

Input: A high-resolution screenshot of a dense UI (ScreenSpot-Pro benchmark format). Query: "Locate the search input field."

Step 1: Vision encoding (~15ms on H100) Moon-ViT processes the screenshot at native resolution. A 1920x1080 screenshot becomes a grid of visual tokens preserving exact pixel-level spatial detail. No downsampling: this is where "native resolution" matters for GUI grounding, where UI elements can be as small as 16x16 pixels.

Step 2: LM forward pass with block generation Qwen2.5 receives the visual token stream and the tokenized query. It begins emitting block sequences.

Block emission for one hit:

<sem> search_input </sem>
<box> <312> <847> <689> <871> </box>

In parallel, all four coordinate tokens are predicted simultaneously in one MTP forward pass. Compare to NTP, which emits <312>, then waits, then <847>, then waits, then <689>, then <871> across four sequential steps.

Step 3: Hybrid Mode validation The format check passes (all four tokens fall in the coordinate vocabulary). The spatial check passes (no overlapping UI element at this coordinate range detected in context). Block accepted from Fast Mode.

Step 4: Output

[
  {
    "label": "search_input",
    "box": [312, 847, 689, 871],
    "confidence": 0.94
  }
]

Real numbers from ScreenSpot-Pro benchmark: LocateAnything achieves 60.3 mean F1 on ScreenSpot-Pro, beating Qwen3-VL-30B-A3B (a 30B parameter model) with a 3B parameter model. Throughput: 12.7 BPS on a single H100. Textual-digit Qwen3-VL: 1.1 BPS. That is an 11.5x throughput advantage at higher accuracy with one-tenth the parameters.

For dense detection (300 boxes in one image), LocateAnything scales to ~25 BPS because the parallel speedup compounds as box count increases. NTP at 300 boxes suffers severe latency collapse. PBD does not.

Why This Design Works (and What It Trades Away)

Why it works:

The key insight is that a bounding box is not a sequence of four independent numbers. It is one geometric fact: a rectangle. The x1 coordinate implies constraints on x2 (x2 > x1). The y1 coordinate implies constraints on y2 (y2 > y1). Sequential generation ignores these constraints during decoding. PBD enforces them implicitly because the model learns to emit the four coordinates as a jointly conditioned unit during training.

The box-aligned training target is the second critical piece. Generic MTP randomly chunks token sequences, which means bounding box coordinates can be split across chunks, forcing the model to learn arbitrary partial-box distributions. LocateAnything explicitly aligns MTP block boundaries with box boundaries. The model never sees a training example where a chunk boundary falls inside a box. This eliminates a class of spurious correlations that hurt both accuracy and reliability.

The 138M training samples and 785M boxes in LocateAnything-Data close the domain gap. PBD alone does not explain the accuracy gains over Rex-Omni; the training data diversity (six task domains, 12M unique images) is doing a significant share of the work on benchmark generalization.

What it trades away:

The custom PBD decoding head is not compatible with standard HuggingFace generate() without trust_remote_code=True. Deployment in environments with locked inference serving (e.g., some cloud APIs that strip remote code) requires additional integration work.

Hybrid Mode's fallback trigger (format and spatial consistency checks) is a learned threshold, not a provably tight bound. In extreme edge cases (very dense scenes with hundreds of overlapping objects), the frequency of NTP fallbacks can reduce effective throughput toward Slow Mode territory.

The model is 3B parameters. For on-device robotics use cases (the stated primary target for Fast Mode), 3B still requires a mid-tier GPU or NPU. A sub-1B distilled variant has not been released.

Technical Moats

Box-aligned MTP training formulation. The core innovation is not the inference-time parallel decoding; it is the training objective that aligns MTP block boundaries with box boundaries. Any team applying generic MTP to coordinate generation without this alignment gets the accuracy degradation shown in Figure 2 of the paper. The ablation shows generic SDLM-B6 MTP at 5.5 BPS and 47.8 F1 vs. PBD at 16.9 BPS and 52.4 F1 in Fast Mode. The accuracy gap persists even when throughput is matched.

LocateAnything-Data at 138M queries and 785M boxes. Building a training set of this scale and domain diversity (general OD, GUI, OCR, layout, referring, point) is not a weekend project. The data engine that produced it is not open-sourced in the current release. Models trained only on public grounding datasets will not reproduce these benchmark numbers.

Native resolution Moon-ViT encoder. Preserving full spatial detail in the visual token stream is non-negotiable for GUI grounding and OCR localization, where small elements are the entire task. Downsampled encoders lose the signal before it reaches the LM decoder.

Contrarian Insights

Insight 1: The throughput numbers are only impressive if you are bottlenecked at decoding, and most production vision pipelines are not.

12.7 BPS means LocateAnything can decode about 12 boxes per second in single-stream inference on an H100. A real-time robotics application at 30 FPS needs to decode one frame every 33ms, and that frame might contain 20-50 objects. At 12.7 BPS, locating 50 boxes takes ~4 seconds. The throughput advantage matters primarily for batch offline workloads (dataset curation, UI automation at scale) and for small box counts at low latency. For genuine real-time dense detection, even LocateAnything's Fast Mode is not close to frame-rate capable on complex scenes without batching optimizations not described in the paper. The ~25 BPS scaling claim for 300 boxes assumes the latency per box is constant, which the scaling chart shows is approximately true but not guaranteed across scene types.

Insight 2: Beating Qwen3-VL-30B-A3B at GUI grounding with a 3B model says more about the benchmark than the models.

ScreenSpot-Pro is a grounding benchmark for individual UI element localization from a text description. It tests spatial precision on static screenshots. Qwen3-VL-30B's strength is broad multimodal reasoning, chain-of-thought, and multi-step interaction, not high-frequency tight-box localization. LocateAnything is a specialist trained on 16.5% GUI data out of 138M samples. Specialists beat generalists on the specialist's home turf. The benchmark does not test whether LocateAnything can understand what a UI is for, navigate it across turns, or handle dynamic DOM changes. Those are where the 30B generalist wins.

Surprising Takeaway

PBD improves accuracy as well as speed, and that is theoretically expected but practically surprising. The usual ML assumption is that faster decoding trades accuracy. PBD violates this because the accuracy improvement comes from the training objective, not inference. Box-aligned MTP supervision eliminates spurious cross-box correlations that hurt the base NTP model's F1. PBD Slow Mode (pure NTP at inference, but trained with box-aligned MTP supervision) achieves 52.1 F1 on COCO, compared to 50.1 F1 for standard NTP with no PBD training. The parallel inference is free accuracy: you were leaving it on the table by training with the wrong loss structure.

TL;DR For Engineers

  • LocateAnything treats each bounding box (x1, y1, x2, y2) as an atomic unit predicted in one parallel LM forward pass, not four sequential token steps

  • 12.7 BPS on a single H100 vs. 1.1 BPS for Qwen3-VL and 5.0 BPS for Rex-Omni, with higher F1 across all major benchmarks

  • Hybrid Mode falls back to NTP only on the specific box block that fails format or spatial checks, not on the full sequence

  • The accuracy improvement over NTP comes from the box-aligned training objective, not just from inference-time parallelism

  • Training dataset: 138M queries, 785M boxes, 12M images, six task domains. Data scale is doing meaningful work alongside the architecture

Conclusion: The Geometry Was Never a Sequence

Treating a bounding box as a sequence of coordinate tokens was a convenient hack when VLMs first adopted generative detection. LocateAnything is the first model to formally study what happens when you stop pretending and align the decoding structure with the actual geometry.

The result is that you get both speed and accuracy for free, not because of a better training recipe or more compute, but because the previous formulation was fighting the structure of the problem. A box is one thing. Decoding it as one thing is correct.

Whether PBD becomes the default coordinate representation for all VLM-based detection depends on whether LocateAnything-Data gets released and whether the custom decoding head gets upstreamed into standard inference frameworks. Both are open questions as of June 2026.

References

LocateAnything (NVIDIA, 2026) introduces Parallel Box Decoding, predicting complete bounding boxes as atomic four-coordinate units in a single LM forward pass instead of token-by-token. It achieves 12.7 BPS on an H100 (11.5x over Qwen3-VL) while improving localization F1 across LVIS, COCO, ScreenSpot-Pro, and document benchmarks, trained on 138M samples and 785M boxes across six task domains.

Sponsored Ad

If you enjoy practical AI insights, check out SnackOnAI and support the newsletter by subscribing, sharing, and exploring our sponsored ad, it helps us keep building and delivering value 🚀

Is Your Finance Stack Ready for Usage-Based Revenue?

Usage-based and hybrid pricing models are reshaping B2B revenue—but for finance teams, they're also introducing serious complexity around revenue recognition, forecasting, and operations.

Join Rebecca Schwartz, Co-founder of Tabs, and Amit Dhir, Partner at PwC, for a live session on June 10th from 1–2PM EDT. They'll break down how modern pricing decisions ripple through your financial workflows—and how to scale without adding manual overhead.

You'll leave with practical frameworks, real-world examples, and a clear path forward.

Register now to secure your spot—and if you can't make it live, a recording will be available.

Can't wait? Explore more Tabs webinars to catch up on past sessions.

Recommended for you