Smolagents: The Agent Framework That Proves JSON Tool Calling Was the Wrong Abstraction All Along

In partnership with

SnackOnAI Engineering | Senior AI Systems Researcher | Technical Deep Dive | May 22, 2026

The agent framework landscape converged on a pattern: the LLM selects a tool name and generates arguments as JSON, the framework parses the JSON, calls the function, and feeds the result back. This pattern works. It is also unnecessarily limited.

Consider storing the output of an generate_image action. In JSON tool-calling, you return some identifier string and hope the LLM can correctly reference it later. In Python code, you write image = generate_image(...) and use image directly in subsequent operations. The object management problem simply does not exist in code.

Smolagents (Hugging Face, Apache 2.0, 27.3k stars, 2.6k forks) is the answer to "what if we took the code-actions insight seriously and built the simplest possible agent framework on top of it?" The core agents.py file is approximately 1,000 lines. The framework supports any LLM (HF Hub, OpenAI, Anthropic, or any LiteLLM-compatible endpoint), allows tools to be shared and loaded from the HF Hub, and provides E2B sandbox support for secure code execution.

This newsletter dissects smolagents as a systems design document: why code is a better action space than JSON, how the CodeAgent and ToolCallingAgent differ at the implementation level, what the multi-agent orchestration pattern looks like in practice, and when the added power of code actions creates risk that the sandboxed execution model has to address.

Scope: smolagents core architecture (CodeAgent, ToolCallingAgent, multi-agent), tool system, model integration, and the code-vs-JSON design choice. Not covered: the full smolagents ecosystem (gradio-tools, RAG integration, document processing) beyond brief mention, or deep dives into specific LLM providers.

What It Actually Does

Smolagents is a Python library for building LLM agents. Three lines to get started:

from smolagents import CodeAgent, DuckDuckGoSearchTool, HfApiModel

agent = CodeAgent(tools=[DuckDuckGoSearchTool()], model=HfApiModel())
agent.run("How many seconds would it take for a leopard at full speed to run through Pont des Arts?")

The agent writes Python code at each step, executes it, observes the result, and continues until the task is complete. The model is the reasoning engine. The Python runtime is the execution environment. Tools are Python functions the agent can call.

Agency is a spectrum, not a binary. Smolagents documents this explicitly:

Agency Level	Description	Pattern
☆☆☆	LLM output has no impact on flow	`process_llm_output(llm_response)`
★☆☆	LLM determines basic control flow	`if llm_decision(): path_a()`
★★☆	LLM determines function execution	`run_function(llm_chosen_tool, args)`
★★★	LLM controls iteration and continuation	`while llm_should_continue(): next_step()`
★★★★	One agentic workflow starts another	`if llm_trigger(): execute_agent()`

The CodeAgent implements the ★★★ pattern. Multi-agent orchestration implements ★★★★.

Two agent types, one framework:

CodeAgent: writes and executes Python at each step. Code can compose multiple tool calls, use loops, store intermediate results, define functions. Runs in E2B sandbox when security is required.
ToolCallingAgent: writes tool calls as JSON/text blobs. Compatible with any model that supports standard function calling. Lower capability ceiling, zero execution risk.

The Architecture, Unpacked

Focus on the Python execution environment in CodeAgent. Variable persistence across steps is the architectural advantage that makes complex multi-step tasks tractable: the agent stores intermediate results as Python variables and references them in subsequent steps. JSON tool-calling must serialize everything to strings between steps, which limits composability.

The Code, Annotated

Snippet One: CodeAgent vs ToolCallingAgent on the Same Task

# Task: find the current population of Tokyo and compute how many people
# that is per square kilometer (area: 2,194 km²)
# This demonstrates why code actions outperform JSON tool-calling for multi-step tasks

from smolagents import CodeAgent, ToolCallingAgent, DuckDuckGoSearchTool, HfApiModel

model = HfApiModel()
tools = [DuckDuckGoSearchTool()]

# ─── CodeAgent approach ──────────────────────────────────────────────────────
code_agent = CodeAgent(tools=tools, model=model)

# The agent generates Python code like this at each step:
#
# Step 1 (LLM generates):
#   population_result = web_search("Tokyo population 2026")
#   print(population_result)
#
# Step 2 (LLM generates, using the variable from step 1):
#   population = 13_960_000  # extracted from search result
#   area_km2 = 2194
#   density = population / area_km2
#   # ← THIS is the trick: the variable 'population_result' from step 1
#   #   is still in scope. The agent can use it directly without
#   #   re-serializing to string and re-parsing.
#   print(f"Population density: {density:.0f} people/km²")

# ─── ToolCallingAgent approach ───────────────────────────────────────────────
tool_agent = ToolCallingAgent(tools=tools, model=model)

# The agent generates JSON like this at each step:
#
# Step 1 (LLM generates):
#   {"tool": "web_search", "args": {"query": "Tokyo population 2026"}}
#   → framework calls web_search, returns string
#   → string added to memory: "The population of Tokyo is approximately 13.96 million"
#
# Step 2 (LLM generates):
#   {"tool": "calculator", "args": {"expression": "13960000 / 2194"}}
#   ← The LLM had to re-parse "13.96 million" from the string result
#   ← No variable persistence: the intermediate value doesn't exist as a number
#   ← If the LLM misreads "13.96 million" as 13.96, the result is wrong
#   ← No composability: can't do complex transformations without a calculator tool

# CODE WINS because:
# 1. Python knows population is 13_960_000 (an integer), not a string
# 2. Division is exact, no calculator tool needed
# 3. Intermediate results are typed Python objects, not strings in LLM memory

The population example is intentionally simple to make the architectural difference visible. In real multi-step tasks (data analysis, file manipulation, API chaining), the advantage compounds: code agents can write loops, catch exceptions, define helper functions, and compose tool outputs in ways that JSON agents literally cannot express.

Snippet Two: Tool Definition, Multi-Agent Orchestration, and Secure Execution

# smolagents tool system and multi-agent patterns
# Source: smolagents documentation and README
from smolagents import CodeAgent, tool, HfApiModel
from smolagents.tools import E2BSandboxTool

# ─── Tool definition ──────────────────────────────────────────────────────────
# Tools are Python functions with a @tool decorator.
# The decorator extracts name, description, and type hints for the LLM's context.
# ← THIS is the design choice: no separate schema definition, no JSON schema files
#   The function signature IS the tool specification

@tool
def get_stock_price(ticker: str) -> float:
    """
    Retrieves the current stock price for a given ticker symbol.

    Args:
        ticker: The stock ticker symbol (e.g., 'AAPL', 'GOOGL')

    Returns:
        The current stock price in USD
    """
    # In production: call a real stock API
    return 175.42  # example

@tool
def calculate_portfolio_value(ticker: str, shares: int, price: float) -> dict:
    """
    Calculates the total value and 1-year gain/loss of a stock position.

    Args:
        ticker: Stock ticker symbol
        shares: Number of shares held
        price: Current price per share

    Returns:
        Dictionary with total_value and percentage fields
    """
    # ← Functions can return complex objects (dict, list, even images)
    # JSON tool-calling agents must serialize these to strings and re-parse later
    # CodeAgent stores the returned dict directly in the Python namespace
    return {"ticker": ticker, "total_value": shares * price, "shares": shares}

# Tool loading from HF Hub (shareable tools):
# from smolagents import load_tool
# weather_tool = load_tool("m-ric/smolagents-weather-tool")
# ← Any tool pushed to HF Hub is loadable with one line

# ─── Multi-agent orchestration ────────────────────────────────────────────────
# The manager agent uses subagents as tools.
# A subagent can be any agent wrapped to accept a string and return a string.

stock_analyst = CodeAgent(
    tools=[get_stock_price, calculate_portfolio_value],
    model=HfApiModel(),
    name="stock_analyst",
    description="Analyzes stock prices and calculates portfolio values",
)

web_researcher = CodeAgent(
    tools=[DuckDuckGoSearchTool()],
    model=HfApiModel(),
    name="web_researcher",
    description="Searches the web for financial news and company information",
)

# ← Multi-agent: manager sees subagents as callable tools
# The manager writes code like:
#   analysis = stock_analyst("What is the value of 100 AAPL shares?")
#   news = web_researcher("Find recent AAPL earnings news")
#   print(f"Analysis: {analysis}\nContext: {news}")
manager_agent = CodeAgent(
    tools=[stock_analyst, web_researcher],  # ← subagents as tools
    model=HfApiModel(),
)

# ─── Secure execution with E2B ────────────────────────────────────────────────
# When code execution must be sandboxed (untrusted tool code, user inputs):

from e2b_code_interpreter import Sandbox

# ← E2B creates a cloud subprocess per agent run
# The agent's Python code runs in an isolated container, not the host process
# Critical for production deployments where agent-generated code could be malicious
secure_agent = CodeAgent(
    tools=[get_stock_price],
    model=HfApiModel(),
    sandbox=Sandbox(),  # ← all code executes in E2B cloud sandbox
)

The @tool decorator's design is intentional minimalism: the function docstring becomes the tool description the LLM reads, and the type hints become the argument schema. There is no separate schema definition step. This is why the entire tool system integrates naturally with existing Python functions: any documented function with type hints is one decorator away from being an agent tool.

It In Action: End-to-End Worked Example

Task: "Find the top 3 Python libraries by GitHub stars released in 2025, and generate a bar chart comparing them."

Agent: CodeAgent with DuckDuckGoSearchTool and matplotlib available

Step 1 (LLM generates code):

# Agent-generated code, Step 1
results = web_search("top Python libraries GitHub stars released 2025")
print(results)

Observation: "Based on GitHub trending data: 1) uv (astral-sh/uv): 38,200 stars,
2) ruff (astral-sh/ruff): 34,100 stars, 3) marimo: 10,800 stars"

Step 2 (LLM generates code, using step 1 result):

# Agent-generated code, Step 2
# ← Variable 'results' from step 1 is still in scope
# LLM parses the output and creates structured data in Python

libraries = [
    {"name": "uv", "stars": 38200},
    {"name": "ruff", "stars": 34100},
    {"name": "marimo", "stars": 10800},
]

# Verify: print the data structure (agent can inspect intermediate state)
for lib in libraries:
    print(f"{lib['name']}: {lib['stars']:,} stars")

Observation:
uv: 38,200 stars
ruff: 34,100 stars
marimo: 10,800 stars

Step 3 (LLM generates code, composes everything):

# Agent-generated code, Step 3
import matplotlib.pyplot as plt

names = [lib["name"] for lib in libraries]  # ← uses 'libraries' from step 2
stars = [lib["stars"] for lib in libraries]

fig, ax = plt.subplots(figsize=(8, 5))
bars = ax.bar(names, stars, color=["#ff6b35", "#004e89", "#1a936f"])
ax.set_title("Top Python Libraries by GitHub Stars (2025)", fontsize=14)
ax.set_ylabel("Stars")

for bar, count in zip(bars, stars):
    ax.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 200,
            f"{count:,}", ha="center", fontsize=11)

plt.tight_layout()
plt.savefig("python_stars_2026.png", dpi=150)
# ← The agent saves the chart as a file
# ← A JSON agent would have needed a separate charting tool with specific parameters
# ← Here: the agent uses the full matplotlib API, no charting tool needed
print("Chart saved to python_stars_2026.png")

Observation: Chart saved to python_stars_2026.png
Final answer: Created bar chart comparing uv (38,200 stars), ruff (34,100 stars),
and marimo (10,800 stars). Chart saved to python_stars_2026.png.

Token usage (approximate):

Step 1: 800 tokens (system prompt + task + generated code)
Step 2: 1,400 tokens (added step 1 code + observation)
Step 3: 2,100 tokens (added step 2 code + observation)
Total: ~4,300 tokens
Execution time: ~12 seconds (search latency dominates)

The same task with a ToolCallingAgent would require a pre-built charting tool with a specific API ({"tool": "create_bar_chart", "args": {"labels": [...], "values": [...], "title": "..."}}). If you need a legend, colors, or any customization not in the tool's parameter schema, you cannot express it.

Why This Design Works, and What It Trades Away

The code-actions thesis has research backing. The paper "Executable Code Actions Elicit Better LLM Agents" (cited in the smolagents blog post) provides systematic evidence that LLMs generate better agent behavior when actions are expressed as code rather than JSON tool calls. The reasoning is straightforward: LLMs are trained on enormous quantities of code. Python expressions are part of their training distribution. JSON schemas for tool calls are much more limited and domain-specific.

The ~1,000 line agents.py implementation is the correct architecture for a general-purpose agent library. More abstract frameworks add complexity that does not pay for itself: custom DSLs for defining agent workflows, complex state machines for multi-step planning, elaborate memory systems. Smolagents' answer is: the LLM is the state machine. The Python execution environment is the memory. The loop is the workflow. These are already well-understood and well-tested abstractions.

The agency spectrum is the correct mental model for deciding when to use agents at all. The smolagents blog is explicit: "For the sake of simplicity and robustness, it's advised to regularize towards not using any agentic behaviour." If a deterministic workflow handles 95% of cases, code it deterministically. Agents are for cases where the workflow cannot be predetermined.

What smolagents trades away:

Security surface. Code execution is inherently more powerful and more dangerous than JSON tool dispatch. A CodeAgent that is given a malicious or adversarial task can execute arbitrary Python on the host machine. The E2B sandbox addresses this for production deployments, but adds latency (network round-trip to E2B) and cost. Teams deploying CodeAgent without sandboxing must trust their prompt and tool design to prevent unintended execution.

Determinism. CodeAgent produces different Python code on every run for the same task. The code is functional (it completes the task) but not identical. Testing, debugging, and auditing agent behavior is harder than auditing a deterministic pipeline. JSON tool-calling agents are more reproducible because the tool dispatch is explicit and parseable.

Model capability threshold. Writing useful Python requires a capable model. For ToolCallingAgent, a weaker model can still select the right tool and fill in argument values. For CodeAgent, the model must write syntactically correct, semantically appropriate Python that composes multiple tool calls, handles results correctly, and terminates cleanly. Smaller or weaker models fail at this much more often.

Technical Moats

The HF Hub tool ecosystem. The push_to_hub() / load_tool() integration makes every smolagent tool a shareable artifact. Any developer who builds a useful tool can publish it to HF Hub, and any other smolagents user can install it in one line. This is the same network effect that made HF Hub the dominant model repository. At 27k stars and growing, the tool ecosystem is building momentum.

The ~1,000 line core. Frameworks that stay minimal survive. Frameworks that grow to 50,000 lines of custom abstractions become maintenance burdens that are eventually replaced by something simpler. Smolagents' explicit design philosophy of keeping core code minimal is both a practical advantage (easy to understand, easy to contribute to) and a signal about the framework's long-term defensibility. The community can read and understand the entire framework in an afternoon.

The LiteLLM integration. Supporting any LLM through a single model parameter removes the provider lock-in that plagues most agent frameworks. A team that starts with HF Inference API can migrate to OpenAI or Anthropic by changing one constructor argument, with zero changes to agent logic or tool code.

Insights

Insight One: The code-vs-JSON debate is not primarily about the agent's capabilities. It is about the ceiling on what tasks an agent can express. JSON tool-calling has a hard ceiling defined by the tool schema. Code has no ceiling except the Python language itself.

A JSON agent can only do what its tools can do, expressed in the parameters those tools accept. A CodeAgent can compose tools in loops, define intermediate computations, handle exceptions, branch on results, and call any Python library that is importable in its execution environment. The JSON agent ceiling is the tool API surface. The CodeAgent ceiling is Python. For simple tasks, this difference is irrelevant. For complex tasks (data analysis, multi-API orchestration, conditional workflows), the difference is the difference between possible and impossible.

Insight Two: Most production agent deployments should be ToolCallingAgent, not CodeAgent, and most developers deploying smolagents for the first time will do the opposite.

The excitement around code actions (and it is justified for the right use cases) creates a framing problem: CodeAgent sounds more capable, therefore use CodeAgent. The correct framing is: CodeAgent is more powerful and more dangerous. Use CodeAgent when the task genuinely requires composing tools in ways that cannot be expressed as a sequence of tool calls with fixed schemas. Use ToolCallingAgent when the task involves calling a fixed set of tools with well-defined inputs and outputs. In production, the vast majority of agent tasks are the latter. The framework makes ToolCallingAgent just as easy to use, but the marketing gravity of "agents that think in code" pulls practitioners toward the more powerful, harder-to-control option.

Takeaway

The multi-step agent loop in smolagents reduces to exactly four lines of pseudocode, and this reduction is intentional, not accidental. It is the entire architecture.

memory = [user_defined_task]
while llm_should_continue(memory):
    action = llm_get_next_action(memory)
    memory += [action, execute_action(action)]

The smolagents blog publishes this explicitly as the canonical description of multi-step agents. Every agent framework that has ever existed implements this same loop with varying degrees of abstraction. LangChain's chains, AutoGPT's planning loop, CrewAI's crew execution, OpenAI's Assistants API, all of them implement this pattern. What makes smolagents notable is not that it implements the pattern differently, but that it implements it with less code than any comparable framework while supporting more backends. The design philosophy is: if you understand those four lines, you understand the framework. Everything else is tool integration and LLM adapter plumbing.

TL;DR For Engineers

Smolagents (HF, Apache 2.0, 27.3k stars, ~1,000 line core) is an agent framework with a clear thesis: Python code is a better action language than JSON. CodeAgent writes and executes Python at each step. ToolCallingAgent writes JSON tool calls. Both share the same four-line agent loop.
Code wins for multi-step tasks because of variable persistence (intermediate results are typed Python objects, not strings), composability (nest function calls, write loops), and generality (anything Python can do). JSON wins for simplicity, reproducibility, and safety.
E2B sandbox for secure code execution. LiteLLM integration for any LLM. HF Hub for shareable tools. push_to_hub() / load_tool() in one line each.
Multi-agent: any agent can be wrapped as a tool callable by a manager agent. The manager writes result = subagent.run(query) in its Python code. Each subagent has independent memory; results pass as strings.
Use ToolCallingAgent for production workflows with fixed tool APIs. Use CodeAgent for tasks requiring composition, loops, or Python libraries not expressible in tool schemas. Most production deployments should default to the former.

The Four Lines Are the Architecture

Smolagents made a bet: keep the core as close to the four-line agent loop as possible, implement code actions because the research says they are better, and trust the HF ecosystem to provide the tools and models. The 27k stars suggest the bet is paying off. The framework is readable, the design is defensible, and the code-actions thesis has empirical backing.

The competition (LangChain, AutoGPT, CrewAI) has more features, more integrations, and more complexity. Smolagents has clarity. For teams who know what they are building and want to own their agent logic, clarity wins.

References

Smolagents GitHub Repository, 27.3k stars, Apache 2.0
Introducing smolagents: Hugging Face Blog, Aymeric Roucher, Merve Noyan, Thomas Wolf, December 2024
Smolagents Documentation
Executable Code Actions Elicit Better LLM Agents, arXiv:2402.01030 — the primary research backing the code-actions design choice
ReAct: Synergizing Reasoning and Acting in Language Models, arXiv:2210.03629 — the foundational ReAct paper that smolagents builds on
Toolformer: Language Models Can Teach Themselves to Use Tools, arXiv:2302.04761
Tree of Thoughts: Deliberate Problem Solving with Large Language Models, arXiv:2305.10601
E2B Code Interpreter — the sandbox execution environment for secure CodeAgent deployment
LiteLLM — the multi-provider LLM integration layer smolagents uses

Smolagents (Hugging Face, Apache 2.0, 27.3k stars) is a Python agent framework built around a single thesis: code is a better action language than JSON for multi-step agents. Two agent types: CodeAgent (writes and executes Python, supports E2B sandboxed execution) and ToolCallingAgent (writes JSON tool calls, compatible with any function-calling LLM). The core multi-step loop is four lines of pseudocode; the entire agents.py is approximately 1,000 lines. Tools are Python functions with @tool decorator; shareable via HF Hub. Multi-agent orchestration treats any agent as a tool callable by a manager. Supports any LLM via LiteLLM integration. Research backing from "Executable Code Actions Elicit Better LLM Agents" (arXiv:2402.01030): code actions outperform JSON tool-calling on multi-step agent tasks.

Sponsored Ad

If you enjoy practical AI insights, check out SnackOnAI and support the newsletter by subscribing, sharing, and exploring our sponsored ad — it helps us keep building and delivering value 🚀

Your prompts are leaving out 80% of what you're thinking.

When you type a prompt, you summarize. When you speak one, you explain. Wispr Flow captures your full reasoning — constraints, edge cases, examples, tone — and turns it into clean, structured text you paste into ChatGPT, Claude, or any AI tool. The difference shows up immediately. More context in, fewer follow-ups out.

89% of messages sent with zero edits. Used by teams at OpenAI, Vercel, and Clay. Try Wispr Flow free — works on Mac, Windows, and iPhone.

Start flowing free