SnackOnAI Engineering | Senior AI Systems Researcher | Technical Deep Dive | April 30, 2026
Every AI research paper needs figures. Methodology diagrams that explain how the system works. Statistical plots that communicate what the experiments proved. Illustrations that make a reviewer's first pass through the paper tell them something before they read a word. Researchers who cannot draw, and most cannot, spend hours in PowerPoint, Keynote, or Inkscape producing figures that are worse than what a professional illustrator would make. The alternative, hiring an illustrator, costs money and coordination overhead the academic workflow was not designed for.
PaperBanana (Peking University, Google Cloud AI Research, arXiv:2601.23265, January 2026) treats this as an agentic pipeline problem: five specialized LLM and VLM agents, a reference-driven retrieval system, an iterative self-critique loop, and a benchmark of 292 methodology diagrams from NeurIPS 2025 to measure against. The result is a system that outperforms every baseline on all four evaluation dimensions, with a conciseness improvement of 37.2% being the most dramatic signal.
This newsletter dissects PaperBanana as a systems document: what the five-agent pipeline does at each stage, how the generative retrieval approach works, why the Stylist agent's inductive approach to style guidelines solves a problem that manual specification cannot, how the Critic agent's self-critique loop closes, and what the image-generation vs. code-generation tradeoff means for statistical plots.
Scope: PaperBanana's five-agent architecture, PaperBananaBench evaluation, image vs. code generation for plots, style enhancement application, and comparison to FigAgent and LIDA. Not covered: the underlying image generation model (Nano-Banana-Pro) internals, or non-academic illustration systems.
What It Actually Does
PaperBanana is an agentic framework from Peking University and Google Cloud AI Research that takes a methodology description and figure caption as input and produces a publication-ready academic illustration as output. The framework orchestrates five specialized agents, retrieves reference examples from a curated database, and refines output through a self-critique loop.
The evaluation benchmark, PaperBananaBench, contains 584 samples (292 test, 292 reference) curated from NeurIPS 2025 papers. Average source context length: 3,020 words. Average caption length: 70 words. Evaluation is VLM-as-a-Judge across four dimensions: faithfulness (does the diagram accurately represent the described method?), conciseness (are elements necessary or cluttered?), readability (is the visual hierarchy clear?), and aesthetics (does it look like a modern AI paper figure?).
Benchmark results vs. leading baselines:
Dimension | Baseline (best) | PaperBanana | Improvement |
|---|---|---|---|
Faithfulness | baseline | +2.8% | |
Conciseness | baseline | +37.2% | most dramatic |
Readability | baseline | +12.9% | |
Aesthetics | baseline | +6.6% | |
Overall | baseline | +17.0% |
The conciseness improvement (+37.2%) is the most diagnostic signal. Baselines, including vanilla Nano-Banana-Pro with direct prompting, produce diagrams with "outdated color tones and overly verbose content." PaperBanana's Stylist and Critic agents specifically target these failure modes, producing diagrams that are more concise while maintaining faithfulness.
The Architecture

Focus on the two-phase structure. The Linear Planning Phase (Retriever → Planner → Stylist) runs once and produces the content plan and style guidelines. The Iterative Refinement Loop (Visualizer ⇄ Critic) runs multiple rounds until quality is satisfied. The Stylist's inductive style synthesis from reference images, not manual specification, is the key design decision.
The Code
Snippet One: Generative Retrieval (Retriever Agent Pattern)
# PaperBanana Retriever Agent: VLM-based generative retrieval
# This is NOT keyword search. The VLM reasons about which reference
# examples best match the current paper's domain AND diagram type.
import anthropic # or equivalent VLM client
def retrieve_reference_examples(
source_context: str,
caption: str,
reference_database: list[dict], # each: {context, caption, image_path, domain, diagram_type}
n_examples: int = 3,
) -> list[dict]:
"""
Generative retrieval: VLM selects the best reference examples
by reasoning about structural and domain similarity.
Why generative retrieval over embedding-based search?
← Embedding models compress context to a fixed-size vector,
losing fine-grained structural information.
A methodology diagram for a "multi-agent reasoning" paper
structurally resembles one for a "protein folding pipeline"
(both show stages with data flow), but their text embeddings
differ significantly.
The VLM can reason: "this is a pipeline diagram with 4 stages"
and match it to other pipeline diagrams regardless of topic.
"""
# Build candidate metadata for the VLM to reason over
# ← We pass (context, caption) pairs, not images, for efficiency
# The VLM selects based on structural description, images retrieved after
candidate_metadata = [
f"[{i}] Domain: {ref['domain']}, Type: {ref['diagram_type']}\n"
f"Caption: {ref['caption'][:100]}...\n"
f"Context summary: {ref['context'][:200]}..."
for i, ref in enumerate(reference_database)
]
retrieval_prompt = f"""
You are selecting reference academic diagrams to guide figure generation.
Current paper needs a figure with:
- Caption: {caption}
- Research domain: [infer from context below]
- Diagram type: [infer from context and caption]
Source context summary:
{source_context[:500]}...
Available references:
{chr(10).join(candidate_metadata)}
Select the {n_examples} references that BEST match this figure in terms of:
1. Visual diagram structure (pipeline vs. architecture vs. comparison vs. flowchart)
2. Research domain (prioritize structural match over topic similarity)
3. Complexity level (number of components, depth of hierarchy)
Return ONLY the indices of your selections, comma-separated: e.g., "2, 7, 14"
"""
client = anthropic.Anthropic()
response = client.messages.create(
model="claude-opus-4-6",
max_tokens=100,
messages=[{"role": "user", "content": retrieval_prompt}]
)
# Parse selected indices
# ← THIS is the trick: VLM reasoning over metadata enables
# structure-aware retrieval without computing image embeddings
selected_indices = [
int(idx.strip())
for idx in response.content[0].text.split(",")
]
return [reference_database[i] for i in selected_indices[:n_examples]]
The generative retrieval approach is the correct design for academic diagram retrieval because visual structure similarity, not topic similarity, is what matters for style and layout guidance. A "multi-agent reasoning pipeline" and a "protein folding pipeline" share diagram structure despite having nothing in common semantically.
Snippet Two: Stylist Agent, Inductive Style Extraction
# Stylist Agent: inductive style guideline synthesis from reference images
# The key insight: manual style definitions are always incomplete.
# The VLM can extract richer style guidelines from observing examples.
import base64
from pathlib import Path
def extract_style_guidelines(
reference_images: list[str], # paths to reference diagram images
vlm_client,
) -> str:
"""
Inductive style synthesis: VLM observes reference images and
extracts style guidelines WITHOUT any pre-specified schema.
Why inductive rather than deductive (manual specification)?
← Modern AI paper diagrams use conventions that are hard to articulate:
- Specific blue-purple gradient color palettes
- Sans-serif labels at exactly the right font size relative to diagram
- Rounded rectangle boxes vs. sharp corners for different component types
- Specific arrow styles (solid vs. dashed, thick vs. thin, with vs. without labels)
Manual definitions miss these nuances. The VLM extracts them by observation.
"""
# Load and encode reference images for VLM vision input
encoded_images = []
for img_path in reference_images:
with open(img_path, "rb") as f:
img_data = base64.b64encode(f.read()).decode("utf-8")
encoded_images.append({
"type": "image",
"source": {"type": "base64", "media_type": "image/png", "data": img_data}
})
style_prompt = """Analyze these academic diagrams from top AI papers and extract a comprehensive style guide.
For each visual element, document:
1. COLOR PALETTE: Exact color choices for backgrounds, borders, fills, text, arrows
2. TYPOGRAPHY: Font style, weight, size hierarchy (title vs. label vs. annotation)
3. SHAPES: Box styles (rounded corners, sharp, oval), shadow usage, border weight
4. ARROWS: Types used (solid, dashed, dotted), thickness, colors, label placement
5. SPACING: Padding inside boxes, gaps between elements, overall density
6. LAYOUT: Alignment conventions, grid usage, grouping patterns
7. ICONS/VISUAL ELEMENTS: Style of any icons, illustrations, or decorative elements
Return a comprehensive style guide that a designer could follow to produce a
diagram matching these examples, without seeing the examples themselves.
Be specific: hex codes, pixel measurements where visible, exact descriptors."""
message_content = encoded_images + [{"type": "text", "text": style_prompt}]
response = vlm_client.messages.create(
model="claude-opus-4-6",
max_tokens=1500,
messages=[{"role": "user", "content": message_content}]
)
# ← THIS is the trick: the returned style guide is a natural language
# specification that the Visualizer agent uses as a prompt addition.
# No structured schema is needed: the VLM both extracts and applies guidelines.
style_guidelines = response.content[0].text
return style_guidelines
# Example: Critic agent feedback loop (round t)
def critic_review(
generated_image_path: str,
source_context: str,
caption: str,
vlm_client,
) -> tuple[bool, str]:
"""
Returns (satisfied: bool, feedback: str)
← satisfied=True triggers termination of the refinement loop
← feedback is passed to the Visualizer for round t+1
"""
with open(generated_image_path, "rb") as f:
img_data = base64.b64encode(f.read()).decode("utf-8")
review_prompt = f"""You are reviewing this academic diagram for publication readiness.
Compare it against the source context and caption below.
Source context (what the figure should illustrate):
{source_context[:1000]}
Figure caption: {caption}
Evaluate:
1. FAITHFULNESS: Are all key components of the method shown? Are relationships correct?
2. ACCURACY: Are labels correct? Do arrows point in the right direction?
3. COMPLETENESS: Is anything missing that the caption promises?
4. CLARITY: Are there elements that are confusing, redundant, or obscure the main message?
If the figure is publication-ready, respond with: APPROVED
If revisions are needed, respond with: REVISE
Then on the next line, provide specific, actionable feedback for improvement.
Do not comment on aesthetics, only on content correctness and completeness."""
response = vlm_client.messages.create(
model="claude-opus-4-6",
max_tokens=500,
messages=[{"role": "user", "content": [
{"type": "image", "source": {"type": "base64", "media_type": "image/png", "data": img_data}},
{"type": "text", "text": review_prompt}
]}]
)
result = response.content[0].text
satisfied = result.startswith("APPROVED")
feedback = result if not satisfied else ""
return satisfied, feedback
The separation between style (Stylist) and content (Critic) feedback is the critical design decision. The Critic checks only content correctness and completeness. Aesthetics are fixed in the Planning Phase by the Stylist. Without this separation, the Critic's aesthetic preferences would conflict with the Stylist's guidelines across refinement rounds.
It In Action: End-to-End Worked Example
Input: Methodology section of a multi-agent reinforcement learning paper (approximately 3,000 words), figure caption: "Overview of our framework: agents observe shared environment state, compute local policies, and coordinate via a central critic."
Step 1: Retrieval (Retriever Agent)
Reference database: 292 NeurIPS 2025 methodology diagrams
VLM analysis:
Domain inference: "multi-agent RL" → matches "Agent & Reasoning" category
Diagram type: "pipeline with central coordination" → matches architecture diagrams
Structure: hub-and-spoke with multiple input nodes → visual structure priority
Selected references (3):
[47] Multi-agent coordination paper: hub diagram, 4 agents + 1 coordinator
[112] Distributed RL paper: similar color palette, rounded box style
[203] Actor-critic framework: shows policy + value head separation
Retrieval latency: ~2-3s (VLM inference over 292 metadata entries)
Step 2: Planning (Planner Agent)
Plan P (abbreviated):
Components:
- Environment (top center, large box)
- 4 Agent nodes (left and right, satellite arrangement)
- Central Critic (bottom center, distinguished box)
- Arrows: Environment → Agents (state observation, dashed)
- Arrows: Agents → Central Critic (local policies, solid)
- Arrows: Central Critic → Agents (value estimates, dashed)
Layout: radial arrangement, agents equidistant from center
Labels: component names + short description below each
Planning latency: ~3-4s
Step 3: Style Guidelines (Stylist Agent)
Extracted guidelines G (abbreviated):
Colors: primary #4A90D9 (blue), secondary #7B68EE (medium purple),
background #F8F9FA, border #2C3E50
Typography: Helvetica Neue, 14pt component labels, 11pt sub-labels
Boxes: 12px border radius, 1.5pt border, subtle drop shadow (2px offset)
Arrows: 2pt stroke, arrowhead 8px, dashed = information flow, solid = action
Spacing: 40px gap between components, 20px padding inside boxes
Style extraction latency: ~4-5s (VLM processing 3 reference images)
Step 4: Visualization → Critique → Refinement (Iterative Loop)
Round 1:
Visualizer generates initial diagram via Nano-Banana-Pro image generation
Critic feedback: "Central Critic box is not visually distinguished from Agent boxes.
Arrows from Agents to Critic should be labeled with 'local policy'.
Missing: environment observation labels."
Result: REVISE
Round 2:
Visualizer incorporates feedback: distinguished Critic box with darker border,
added arrow labels, added observation labels
Critic review: APPROVED
Total refinement rounds: 2 (typical range: 2-3 for methodology diagrams)
Total end-to-end latency: ~45-90 seconds (Nano-Banana-Pro image gen is primary cost)
Step 5: Output quality assessment (PaperBananaBench metrics)
VLM-as-a-Judge scores vs. human reference illustration:
Faithfulness: 4.2/5 (+0.3 vs. vanilla baseline)
Conciseness: 4.5/5 (+1.4 vs. vanilla baseline — the most dramatic improvement)
Readability: 4.1/5 (+0.5 vs. baseline)
Aesthetics: 3.9/5 (+0.3 vs. baseline)
Human judge correlation with VLM-as-a-Judge: verified in paper experiments
Why This Design Works, and What It Trades Away
The two-phase architecture (Linear Planning → Iterative Refinement) is the correct design because planning and generation are fundamentally different optimization problems. Planning requires reading comprehension (extract the key components and relationships from 3,000 words of methodology text) and structural reasoning (how should these components be arranged to communicate the method?). Generation requires producing a visual artifact that matches a specification. Mixing these into a single prompt produces diagrams that are either faithful to the content but visually poor, or visually polished but missing key components. Separating them into specialized agents, each with their own prompt, context, and objective, improves both dimensions independently.
The Stylist's inductive approach to style guideline synthesis is the most important architectural decision in the Linear Planning Phase. Manual specification of "academic diagram style" is impossible to complete. Color palettes vary by year (modern AI papers use significantly different palettes than papers from 2019-2021). Arrow conventions differ by diagram type. Icon styles have shifted toward flat design. Rather than specifying these rules by hand (which would immediately become outdated), the Stylist agent observes retrieved reference images and extracts the current stylistic conventions from examples. This is the same approach human designers use when asked to "match this style": observe the examples, extract the patterns, apply them.
The image generation approach for methodology diagrams versus the code generation approach for statistical plots is the correct split and the most practically significant design decision in the paper.
What PaperBanana trades away:
Exact data accuracy for statistical plots generated by image generation (not code). The image-generation-for-plots experiment in the paper shows a clear tradeoff: image generation produces more visually appealing plots but underperforms code-based approaches in content fidelity, specifically in accurately representing numerical values. For publication, data accuracy is non-negotiable. PaperBanana uses code generation (Python/matplotlib) for statistical plots precisely for this reason. Image generation is reserved for methodology diagrams where content fidelity is expressed visually (component relationships, data flow directions) rather than numerically.
Multi-round refinement latency. With Nano-Banana-Pro as the image generation backbone, each Visualizer round takes 20-40 seconds. Two to three rounds totals 45-90 seconds per diagram. For a paper with 8-10 figures, this is 6-15 minutes of total generation time. This is acceptable for the final paper submission workflow but not for rapid iteration during research.
Novelty of diagram types outside the training distribution. PaperBananaBench is curated from NeurIPS 2025 papers. Diagram types common in other venues (medical imaging figures, protein structure visualizations, hardware architecture diagrams) may not be well-represented in the reference database. The retrieval quality degrades when the paper's diagram type does not have close structural analogues in the reference set.
Technical Moats
PaperBananaBench is the first benchmark for this task. 292 test cases from NeurIPS 2025, manually curated, covering diverse diagram types. The VLM-as-a-Judge evaluation framework with verified human correlation is both the evaluation infrastructure for PaperBanana and a contribution to the field. Any competing system now has a clear evaluation target. Prior to this benchmark, evaluation was anecdotal.
The reference-driven workflow encodes stylistic knowledge that is otherwise implicit. The 292 reference examples in the benchmark represent the current stylistic norms of top AI venue publications. The Stylist agent's ability to extract these norms inductively means the system automatically stays current: update the reference database with the next year's NeurIPS papers and the style guidelines update accordingly, without any manual curation.
The two-path approach (image generation for diagrams, code generation for plots) is the correct engineering decision, and competing systems that do not make this distinction will underperform on one task or the other. Methodology diagrams require spatial reasoning and visual element composition that image generation models handle well. Statistical plots require exact numerical representation that code generation handles well. Building a single-path system for both tasks accepts a degradation on one. PaperBanana accepts the engineering overhead of two paths in exchange for better overall quality.
Insights
Insight One: PaperBanana is not an automation tool for researchers who cannot draw. It is infrastructure for fully autonomous AI scientists, and the community discussion is dramatically underselling its actual implications.
The paper explicitly frames itself within the autonomous AI scientist context: "Autonomous scientific discovery is a long-standing pursuit." The gap it fills is not "helping researchers make better PowerPoints." The gap is: can a system that autonomously discovers a research result also autonomously produce the visual communication artifacts (methodology diagrams, result plots) needed to publish that result? The answer PaperBanana provides is yes, with +17% overall quality over baselines and parity or better on human-judged aesthetics. A fully autonomous research pipeline that cannot produce publication-ready figures is not actually autonomous. PaperBanana removes that dependency. The community reaction to this paper has focused on the "AI draws diagrams" framing and missed the "autonomous AI scientist completes the loop" framing.
Insight Two: The conciseness improvement (+37.2% over baselines) is not a stylistic improvement. It is a content quality improvement, and the benchmark design makes this distinction meaningful.
Conciseness in PaperBananaBench is evaluated as: are the elements in the generated diagram necessary, or are there redundant components that clutter the visual without adding information? A +37.2% improvement means baseline systems produce significantly more visual noise, components the methodology description does not require, than PaperBanana. This failure mode, visual verbosity, is not correctable by better image generation models alone. It requires the Planning Phase to explicitly reason about what is necessary before generating anything. The Planner agent's role in producing a component list before the Visualizer runs is the mechanism that drives this improvement. Direct prompting of image generation models, without the planning step, produces verbose outputs because the generation model has no explicit constraint on component count. PaperBanana's architecture makes the constraint explicit.
Takeaway
PaperBanana's own methodology diagram in the paper (Figure 2) was generated by PaperBanana itself. The system is self-documenting: the illustration that explains the framework was produced by the framework.
This is not a marketing detail. It is a calibration point. If the system can produce the diagram that accurately illustrates its own architecture at publication quality, the evaluation of its outputs on PaperBananaBench is grounded in a concrete, verifiable example that readers can inspect directly. The paper's self-referential demonstration is the clearest possible argument that the system works as described. Compare this to papers that present human-designed figures to explain systems that are supposed to automate figure design: the gap between what the system can produce and what the paper presents is itself the signal.
TL;DR For Engineers
PaperBanana is a five-agent pipeline: Retriever (VLM generative retrieval over structural metadata), Planner (content plan from methodology text), Stylist (inductive style guidelines from reference images), Visualizer (image generation for diagrams, code generation for plots), Critic (self-critique loop). Linear Planning Phase runs once; Refinement Loop runs 2-3 rounds.
PaperBananaBench: 292 test cases from NeurIPS 2025, VLM-as-a-Judge across faithfulness, conciseness, readability, aesthetics. PaperBanana outperforms all baselines: +37.2% conciseness, +17% overall, +2.8% faithfulness, +12.9% readability, +6.6% aesthetics.
The Stylist's inductive style extraction from reference images, not manual specification, is the correct design for keeping style guidelines current as publication norms evolve.
Image generation (methodology diagrams): high visual quality, appropriate for spatial/relational content. Code generation (statistical plots): mandatory for data accuracy. Using image generation for plots trades fidelity for presentation. Do not do this for publication.
End-to-end latency: ~45-90 seconds per figure. Primary cost is Nano-Banana-Pro image generation. The planning phase adds ~10-15 seconds but is the mechanism for the conciseness improvement.
The Autonomous AI Scientist Now Has a Figure Department
PaperBanana closes one of the last remaining gaps in the autonomous research workflow: visual communication. A system that can discover a novel result, run experiments, analyze data, and write up findings, but cannot produce the figures needed to publish that work, is not autonomous. It is dependent on a human at the final step. PaperBanana removes that dependency. The +17% overall improvement over baselines, the self-referential Figure 2, and the PaperBananaBench benchmark are the evidence. The engineering decisions that produce those improvements, the two-phase architecture, the inductive Stylist, the separated Critic, and the image vs. code path split, are the decisions that other teams building research automation systems should study carefully.
References
PaperBanana: Automating Academic Illustration for AI Scientists, arXiv:2601.23265, Zhu, Meng, Song, Wei, Li, Pfister, Yoon (Peking University, Google Cloud AI Research), January 2026
AI Scientist v2: Towards Fully Automated Open-Ended Scientific Discovery, cited for autonomous AI scientist context
PaperBanana (Peking University + Google Cloud AI Research, arXiv:2601.23265, January 2026) is a five-agent agentic framework for automated academic illustration generation: a Retriever (VLM generative retrieval over structural metadata), Planner (content planning from methodology text), Stylist (inductive style guideline synthesis from reference images), Visualizer (image generation for methodology diagrams, code generation for statistical plots), and Critic (content-accuracy self-critique loop). Evaluated on PaperBananaBench (292 test cases from NeurIPS 2025 papers), it outperforms all baselines across faithfulness (+2.8%), conciseness (+37.2%), readability (+12.9%), and aesthetics (+6.6%), with a +17.0% overall improvement. The Stylist's inductive extraction of style guidelines from reference examples, rather than manual specification, is the key design decision that enables the conciseness improvement; the separation of image generation (methodology diagrams) from code generation (statistical plots) is the key decision that maintains data accuracy.
Sponsored Ad
If you enjoy practical AI insights, check out SnackOnAI and support the newsletter by subscribing, sharing, and exploring our sponsored ad — it helps us keep building and delivering value 🚀
Analytics on Live Data Without Leaving Postgres
When analytics on Postgres slows down, most teams add a second database. TimescaleDB by Tiger Data takes a different approach: extend Postgres with columnar storage and time-series primitives to run analytics on live data, no split architecture, no pipeline lag, no new query language to learn. Start building for free. No credit card required.


