In partnership with

This newsletter dissects the specific attribute schema, the MLflow OTel integration architecture, the tradeoffs in the five-layer observability stack, and what it means that Google Cloud, AWS, Azure, Datadog, and MLflow all accept the same span format.

SnackOnAI Engineering | Senior AI Systems Researcher | Technical Deep Dive | June 28, 2026

The LLM observability market has fragmented badly: LangSmith, Langfuse, Helicone, Arize, Weave, AgentOps, BrainTrust, Phoenix, Honeycomb, Datadog LLM Observability, and MLflow all capture LLM traces. All of them have slightly different data models, different attribute names for the same concepts, and different APIs for ingesting spans. Teams that switch from LangChain to DSPy or from OpenAI to Anthropic find their traces incompatible with their existing dashboards.

OpenTelemetry's GenAI Semantic Conventions, maintained by the CNCF's GenAI Special Interest Group (SIG), solve this at the protocol level. Instead of each vendor inventing total_tokens vs usage.total vs llm.token.count, the standard defines gen_ai.usage.input_tokens and gen_ai.usage.output_tokens. Instead of each framework exporting model vs model_name vs llm.model, the standard defines gen_ai.request.model. When all instrumentation agrees on the schema, any compliant backend can display any compliant trace.

MLflow 3.6.0 is the most significant adoption of this standard by a major ML platform. The integration is bidirectional: MLflow can ingest any OTel GenAI semconv trace from external tools, and MLflow can export its own traces in the OTel GenAI semconv format for consumption by any OTLP-compatible backend. The OTLP endpoint at /v1/traces means MLflow is now a first-class OTel collector target, not just an MLflow-specific sink.

The research framing: AI Observability for LLM Systems (arXiv:2604.26152, April 2026) defines a five-layer observability taxonomy covering the full stack from GPU kernels to model internals. AgentOps (arXiv:2411.05285, Dong et al., Nov 2024) provides the artifact taxonomy for what needs to be traced throughout an agent's full lifecycle. Both papers converge on the same conclusion as the OTel GenAI SIG: the defining open problem is not any individual layer of observability but the integration challenge connecting model-level signals with infrastructure-level signals into coherent operational intelligence.

Scope: MLflow's OTel architecture (OTLP endpoint, dual export, semconv attribute mapping), the GenAI Semantic Conventions attribute schema, the five-layer observability taxonomy from arXiv:2604.26152, and the AgentOps artifact taxonomy from arXiv:2411.05285. Not covered: MLflow's full experiment tracking and model registry features, or any evaluation/quality monitoring beyond observability infrastructure.

What It Actually Does

MLflow's OTel integration has three distinct modes depending on your use case:

Mode 1: MLflow as an OTel ingestion target. Any OTel-instrumented application (in any language: Java, Go, Rust, Python) sends traces to MLflow's OTLP endpoint. MLflow stores and displays them as first-class traces.

Mode 2: MLflow exporting to OTel backends. MLflow instruments your Python/TypeScript code and exports traces to Datadog, Grafana Tempo, Jaeger, AWS X-Ray, or any OTLP-compatible backend in the standard gen_ai.* format.

Mode 3: Dual export. MLflow stores traces in its own format for the MLflow UI and simultaneously exports in OTel GenAI semconv format to an external collector. One trace, two destinations, zero duplication in your code.

The OTel GenAI Semantic Convention attribute schema (the actual field names):

Attribute

Type

What it captures

gen_ai.operation.name

string

"chat", "text_completion", "embeddings"

gen_ai.request.model

string

"gpt-4o", "claude-3-5-sonnet"

gen_ai.usage.input_tokens

int

prompt token count

gen_ai.usage.output_tokens

int

completion token count

gen_ai.input.messages

JSON string

full prompt messages (opt-in, privacy)

gen_ai.output.messages

JSON string

full completion messages (opt-in, privacy)

gen_ai.system_instructions

string

system prompt (opt-in)

gen_ai.response.finish_reasons

string[]

"stop", "tool_calls", "length"

Note: prompt content is not captured by default. Only metadata (model, token counts, latency, finish reasons) is captured unless you explicitly opt in. This is a deliberate privacy design choice in the OTel spec.

The Architecture, Unpacked

Focus on the dual export path: one trace producing two parallel streams with different attribute schemas. This is the mechanism that makes "write once, observe everywhere" work in practice. Your MLflow UI shows the trace with its full MLflow rendering; your Datadog or Grafana backend receives the same trace in the vendor-neutral gen_ai. format.*

The Code, Annotated

Snippet One: Auto-Tracing and Dual Export Configuration

# MLflow OTel integration: minimal setup for full LLM observability
# Source: mlflow/mlflow docs and mlflow.org/blog/opentelemetry-tracing-support/
# This shows the three ways spans enter the system and how dual export works

import os
import mlflow
from openai import OpenAI

# ─── SETUP: CONFIGURE DUAL EXPORT ─────────────────────────────────────────────
# Default: all traces go to MLflow Tracking Server only
# With dual export: traces go to MLflow AND a standard OTel collector

os.environ["OTEL_EXPORTER_OTLP_TRACES_ENDPOINT"] = "http://collector:4318/v1/traces"
# ← This tells MLflow where to send the OTel copy of traces
# The collector can be Datadog Agent, Grafana Agent, Jaeger, AWS ADOT, etc.

os.environ["MLFLOW_TRACE_ENABLE_OTLP_DUAL_EXPORT"] = "true"
# ← THIS is the trick: without this, OTLP export REPLACES MLflow storage
#   With this set, MLflow stores internally AND exports via OTLP simultaneously
#   Your MLflow UI works normally; your OTel backend gets the same data

os.environ["MLFLOW_ENABLE_OTEL_GENAI_SEMCONV"] = "true"
# ← Exports traces in gen_ai.* format (OTel GenAI standard) instead of mlflow.* format
# ← Required for backends that speak OTel GenAI semconv (Datadog, Grafana, etc.)
# ← Without this: MLflow exports its own mlflow.* attribute format
#   which Datadog and others may not render correctly

# MLflow server connection
mlflow.set_tracking_uri("http://localhost:5000")
mlflow.set_experiment("production-llm-pipeline")

# ─── AUTO-TRACING: one line instruments all OpenAI calls ──────────────────────
mlflow.openai.autolog()
# ← This patches the OpenAI client so every chat.completions.create() call
#   automatically becomes a span with:
#     gen_ai.operation.name = "chat"
#     gen_ai.request.model = "gpt-4o"
#     gen_ai.usage.input_tokens = <actual count from API response>
#     gen_ai.usage.output_tokens = <actual count>
#     span.duration = measured wall-clock latency
#   No changes to your OpenAI call code required.

client = OpenAI()

# ─── WRAPPING MULTI-STEP LOGIC ────────────────────────────────────────────────
@mlflow.trace(span_type="AGENT")
# ← Creates a parent span named "research_pipeline" of type AGENT
# ← Every OpenAI call inside this function becomes a CHILD span
# ← The resulting tree: [research_pipeline] → [chat: generate_query] → [chat: summarize]
def research_pipeline(question: str) -> str:
    # Child span 1: query generation
    query_response = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": "Generate a search query for the given question"},
            {"role": "user", "content": question}
        ]
    )
    query = query_response.choices[0].message.content
    # ← This call is auto-traced: gen_ai.usage.input_tokens captured from response.usage

    # Child span 2: synthesis
    summary_response = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": "Summarize search results concisely"},
            {"role": "user", "content": f"Query: {query}\nResults: [simulated results]"}
        ]
    )
    return summary_response.choices[0].message.content

result = research_pipeline("What is the latest research on protein folding?")
# Trace structure created:
# [AGENT: research_pipeline] (100ms total)
#   ├── [LLM: gpt-4o chat] (45ms, 120 input tokens, 35 output tokens)
#   └── [LLM: gpt-4o chat] (55ms, 250 input tokens, 85 output tokens)
# Simultaneously exported to OTel collector in gen_ai.* format

The MLFLOW_TRACE_ENABLE_OTLP_DUAL_EXPORT=true flag is the operational key. Without it, configuring an OTLP endpoint makes MLflow stop writing to its own storage. This is a footgun that breaks the MLflow UI for teams that try to add OTel export. With dual export, you get both, and the attribute translation from mlflow.* to gen_ai.* happens at export time so the MLflow internal representation is unchanged.

Snippet Two: Native OTel Instrumentation and the GenAI Semconv Schema

# Native OTel instrumentation with GenAI Semantic Conventions
# Source: mlflow.org docs + Databricks OTel span attribute docs
# Use this when: auto-tracing doesn't cover your framework, or you need custom spans

from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.http.trace_exporter import OTLPSpanExporter
import json
import time

# ─── SETUP: Direct OTel instrumentation pointing at MLflow ────────────────────
# This is the path for: non-Python languages (Java/Go/Rust), custom frameworks,
# or applications that already have OTel instrumentation for HTTP/DB/etc.

provider = TracerProvider()
exporter = OTLPSpanExporter(
    endpoint="http://mlflow-server:5000/v1/traces",
    headers={"x-mlflow-experiment-id": "123456789"}
    # ← x-mlflow-experiment-id routes this trace to the correct MLflow experiment
    # ← This header is how MLflow knows which experiment a foreign OTel trace belongs to
)
provider.add_span_processor(BatchSpanProcessor(exporter))
trace.set_tracer_provider(provider)

tracer = trace.get_tracer("my-llm-agent")

def trace_llm_call(messages: list[dict], model: str = "gpt-4o") -> dict:
    """
    Example of manually setting OTel GenAI Semantic Convention attributes.
    This is what auto-tracing does for you automatically.
    Use this pattern for: custom LLM wrappers, non-OpenAI providers,
    or when you need to add business-specific attributes alongside the standard ones.
    """
    with tracer.start_as_current_span("llm-call") as span:
        # ← Required OTel GenAI semconv attributes (metadata, always safe to log)
        span.set_attribute("gen_ai.operation.name", "chat")     # type of operation
        span.set_attribute("gen_ai.request.model", model)       # model identifier
        span.set_attribute("gen_ai.system", "openai")           # provider name

        start = time.time()
        # ... your LLM call here ...
        response = {"content": "...", "usage": {"prompt_tokens": 120, "completion_tokens": 45}}
        latency_ms = (time.time() - start) * 1000

        # ← Token counts: these ARE always safe to log (no PII)
        span.set_attribute("gen_ai.usage.input_tokens",  response["usage"]["prompt_tokens"])
        span.set_attribute("gen_ai.usage.output_tokens", response["usage"]["completion_tokens"])
        # ← These flow to dashboards: "average input tokens per call", "total cost per day"

        # ← Content logging is OPT-IN (privacy risk: prompts may contain PII)
        # Only enable in dev/debug, never by default in production
        if os.getenv("MLFLOW_ENABLE_CONTENT_TRACING"):
            span.set_attribute("gen_ai.input.messages",  json.dumps(messages))   # ← opt-in
            span.set_attribute("gen_ai.output.messages", json.dumps([             # ← opt-in
                {"role": "assistant", "content": response["content"]}
            ]))

        return response


def trace_agent_tool_call(tool_name: str, tool_input: dict, tool_output: dict):
    """
    AgentOps taxonomy (arXiv:2411.05285): beyond LLM calls, you need to trace
    tool invocations, memory access, and state transitions in agents.
    """
    with tracer.start_as_current_span(f"tool-call:{tool_name}") as span:
        span.set_attribute("gen_ai.operation.name", "tool_call")  # ← semconv type

        # AgentOps artifacts (arXiv:2411.05285): these are the required trace artifacts
        # for full agent observability:
        # - Input/output: what the agent sent to the tool, what it received
        span.set_attribute("tool.name", tool_name)
        span.set_attribute("tool.input",  json.dumps(tool_input))   # ← what was called
        span.set_attribute("tool.output", json.dumps(tool_output))  # ← what came back

        # - Resource consumption: critical for cost attribution
        # - Safety markers: flag sensitive operations (write_file, execute_code)
        if tool_name in ["write_file", "execute_code", "send_email"]:
            span.set_attribute("agent.safety.sensitive_operation", True)
            # ← This is where AgentOps taxonomy meets practical monitoring:
            #   filter for safety.sensitive_operation = true to audit dangerous calls


# ─── COMBINING OTEL AUTO-INSTRUMENTATION WITH MLFLOW TRACING ─────────────────
# MLflow 3.6.0 supports combining both in one trace:
# FastAPI HTTP span (from OTel auto-instrumentation) + MLflow LLM spans = single trace

# This produces ONE unified trace from HTTP request through to LLM calls:
# [http_request: POST /api/chat] (300ms)
#   └── [AGENT: process_request] (280ms)
#         ├── [LLM: gpt-4o chat] (150ms, 120 input tokens)
#         └── [tool-call: search_web] (80ms)

The opt-in content logging design is a critical correctness point. By default, gen_ai.input.messages and gen_ai.output.messages are NOT recorded even when auto-tracing is enabled. This is not a missing feature; it is an explicit privacy design decision in the OTel GenAI spec. Production LLM pipelines often contain PII in prompts. Teams that assume all prompt content is automatically captured will find their token-based metrics working but their content-based debugging empty unless they explicitly set the content logging opt-in.

It In Action: End-to-End Agent Observability Pipeline

Task: Set up a complete observability pipeline for a production RAG agent, capturing traces from the HTTP layer through LLM calls through tool invocations, with dual export to MLflow and Datadog.

Step 1: Infrastructure setup

# Start MLflow tracking server with OTLP endpoint enabled
mlflow server \
  --host 0.0.0.0 \
  --port 5000
# MLflow 3.6.0 OTLP endpoint is automatic: http://localhost:5000/v1/traces

# Configure environment for dual export
export OTEL_EXPORTER_OTLP_TRACES_ENDPOINT="https://trace.agent.datadoghq.com/v0.4/traces"
export MLFLOW_TRACE_ENABLE_OTLP_DUAL_EXPORT="true"
export MLFLOW_ENABLE_OTEL_GENAI_SEMCONV="true"
export MLFLOW_TRACKING_URI="http://localhost:5000"

Step 2: RAG agent with full observability

import mlflow
from openai import OpenAI

mlflow.set_experiment("rag-agent-production")
mlflow.openai.autolog()           # auto-trace all OpenAI calls
mlflow.langchain.autolog()        # auto-trace LangChain retriever

@mlflow.trace(span_type="AGENT")
def rag_query(user_question: str) -> str:
    """Full RAG pipeline: retrieval → synthesis → return"""
    # Auto-traced: each LLM call becomes a child span
    docs = retriever.get_relevant_documents(user_question)   # LangChain: auto-traced
    context = "\n".join(d.page_content for d in docs)

    response = openai.chat.completions.create(               # OpenAI: auto-traced
        model="gpt-4o",
        messages=[
            {"role": "system", "content": "Answer using the context provided."},
            {"role": "user", "content": f"Context: {context}\n\nQuestion: {user_question}"}
        ]
    )
    return response.choices[0].message.content

Step 3: Trace produced (real structure from one call)

[AGENT: rag_query]  duration: 847ms  status: OK
├── [CHAIN: retriever.get_relevant_documents]  duration: 312ms
│     ├── [EMBEDDING: text-embedding-3-small]
│     │     gen_ai.usage.input_tokens: 12
│     │     gen_ai.usage.output_tokens: 0  (embeddings don't have output tokens)
│     └── [TOOL: vector_store_search]
│           tool.name: "pinecone_query"
│           matches_returned: 5
│
└── [LLM: gpt-4o chat]  duration: 512ms
      gen_ai.operation.name: "chat"
      gen_ai.request.model: "gpt-4o"
      gen_ai.usage.input_tokens: 1847
      gen_ai.usage.output_tokens: 124
      gen_ai.response.finish_reasons: ["stop"]

Step 4: Dual export destinations and what each sees

MLflow Tracking Server (http://localhost:5000):
  Format: mlflow.* attributes
  UI: full trace tree with custom MLflow rendering
  Search: mlflow.search_traces(filter_string="attributes.gen_ai.request.model = 'gpt-4o'")
  Use for: debugging, evaluation, experiment comparison

Datadog (via OTLP exporter):
  Format: gen_ai.* attributes (OTel GenAI semconv)
  Metrics from spans:
    gen_ai.usage.input_tokens  → cost dashboard: $X per 1M tokens
    gen_ai.usage.output_tokens → separate cost line
    span.duration              → p50/p90/p99 latency by model
  Alerts: p99 latency > 2s, error_rate > 1%, cost > $X/hour
  Use for: SRE alerting, cost attribution, SLO tracking

Step 5: Querying the trace data

# MLflow SDK query: find all slow gpt-4o calls in the last 24 hours
import mlflow

traces = mlflow.search_traces(
    experiment_names=["rag-agent-production"],
    filter_string="attributes.`gen_ai.request.model` = 'gpt-4o' AND execution_time_ms > 1000",
    max_results=50,
)
# Returns: DataFrame with all matching traces
# Each row: trace_id, start_time, duration, span_count, attributes

# Total token cost for the day
import pandas as pd
total_input_tokens  = traces["attributes.gen_ai.usage.input_tokens"].sum()
total_output_tokens = traces["attributes.gen_ai.usage.output_tokens"].sum()
cost_usd = (total_input_tokens / 1_000_000 * 2.50) + (total_output_tokens / 1_000_000 * 10.00)
print(f"GPT-4o cost today: ${cost_usd:.2f}")
# Output: GPT-4o cost today: $4.73

The token cost query demonstrates why the GenAI Semantic Conventions matter for operations. gen_ai.usage.input_tokens and gen_ai.usage.output_tokens are standardized field names that MLflow, Datadog, and any other OTel-compatible backend all understand. You write the aggregation once; it works against any backend you route traces to.

Why This Design Works, and What It Trades Away

The attribute translation approach (MLflow stores in mlflow.*, exports in gen_ai.*) is the correct design for a system that needs to preserve backward compatibility while adopting a new standard. MLflow's internal mlflow.* attribute format existed before OTel GenAI semconv was mature. Requiring all existing MLflow users to change their attribute format would have broken backward compatibility. Translation at export time means existing MLflow traces and dashboards continue to work, while new integrations get the standard format.

The OTLP endpoint at /v1/traces is the correct integration point for language-agnostic tracing. The alternative, requiring every language (Java, Go, Rust) to have an MLflow SDK, would never achieve the breadth of ecosystem support that OTel auto-instrumentation already has. By accepting standard OTLP input, MLflow becomes a valid target for any instrumented application in any language, effectively expanding its addressable ecosystem to everything the OTel community has already instrumented.

The content-off-by-default design in the OTel GenAI semconv is correct for production deployments. User prompts frequently contain PII (names, addresses, health information). Capturing this by default would create compliance problems (GDPR, HIPAA) for any enterprise deploying LLM applications. The OTel design externalizes this decision to the operator, which is where it belongs.

What this design trades away:

The five-layer observability taxonomy from arXiv:2604.26152 reveals the gap: OTel GenAI semconv covers Layer 3 (trace-level) cleanly. Layers 1 (GPU/infrastructure) and 2 (system-level metrics) are handled by standard OTel metrics instrumentation. But Layers 4 (internal state, propositional probes, activations) and 5 (confidence calibration, uncertainty) are not part of the OTel schema and may never be, because they require model internals access that the OTel SDK is not designed to provide. The paper identifies "connecting model-level confidence signals with infrastructure-level anomalies" as the defining open problem. MLflow + OTel solves Layers 1-3. Nobody has solved Layers 4-5 in a production-ready open-source tool.

The AgentOps taxonomy (arXiv:2411.05285) identifies artifacts that span tracing alone cannot capture: long-term memory access patterns, agent state transitions between tasks, the relationship between a decision made in tool call N and the failure that occurred in tool call N+15. Distributed traces represent point-in-time spans. Agent behavior is longitudinal. The OTel span model is built for request-response patterns; agent reasoning traces are trees with cycles, backtracking, and conditional branches that do not map cleanly onto the span parent-child model.

Technical Moats

The OTel GenAI SIG's semantic conventions are an active standards process. The CNCF GenAI SIG is developing conventions for multi-agent systems covering tasks, actions, agent teams, memory access, and artifact tracking (as of OTel GenAI semconv 1.37+). This is not a committee writing theory; it is practitioners from Google, Microsoft, Datadog, and others defining the specific attribute names that will appear in spans for agent memory writes, sub-agent spawning, and decision point logging. MLflow's early adoption of the incoming standards positions it to be compatible with these additions without schema migration work.

Dual export without SDK changes. The MLFLOW_TRACE_ENABLE_OTLP_DUAL_EXPORT environment variable is operationally significant because it means zero code changes to switch from single-destination to dual-destination tracing. Teams can add Datadog observability to existing MLflow-instrumented applications by setting two environment variables, not by refactoring their instrumentation. This is the correct deployment model for production observability tools: changes in telemetry routing should not require application code changes.

OTel as the interoperability bridge for agent frameworks. MLflow's ingest from Google ADK, LiveKit Agents, and Spring AI works because those frameworks emit OTel-compliant gen_ai.* spans. MLflow does not need to maintain integration code for each of these frameworks; any framework that implements the OTel GenAI semconv becomes automatically compatible. The investment in the OTel standard is shared across the entire ecosystem rather than requiring each tool to maintain N×M bilateral integrations.

Insights

Insight One: The OTel GenAI Semantic Conventions are quietly becoming the LLM observability lingua franca, and most teams adopting them are not explicitly choosing to do so. When you call mlflow.openai.autolog() or instrument with LangChain, the output spans are now in the OTel GenAI semconv format. When Datadog's Agent Observability says "supports OTel 1.37+ GenAI semantic convention-compliant spans," it means the same spans. When Google ADK and LiveKit emit traces that MLflow accepts, it works because both sides speak the same protocol. The adoption is happening bottom-up through tool integrations, not top-down through explicit standardization decisions. Teams that understand this can deliberately choose to route their traces to different backends, run A/B tests across observability platforms, and avoid vendor lock-in. Teams that do not understand it are already locked in to the OTel schema without knowing it.

Insight Two: The five-layer taxonomy from arXiv:2604.26152 exposes the part of LLM observability that MLflow + OTel explicitly does NOT solve, and this is the right architectural choice. Layers 1-3 (infrastructure, system, trace) are well-handled by OTel. Layers 4-5 (model internal state, confidence calibration) require model internals access via interpretability tools, probing classifiers, or model-internal hooks that the OTel SDK was never designed to support. The paper from MIT on confidence calibration via reinforcement learning, the UC Berkeley work on propositional probes, and the OpenAI work on chain-of-thought monitorability are all Layer 4-5 problems. MLflow + OTel is the correct solution for Layers 1-3 and the wrong solution for Layers 4-5. Any engineer building production LLM observability should understand this boundary explicitly: you will not catch hallucination or miscalibration in your OTel spans no matter how many attributes you log.

Surprising Takeaway

Claude Code exports OTel metrics and log events, with trace support currently in beta. VS Code Copilot emits traces, metrics, and events for every agent interaction. OpenAI Codex exports structured log events and OTel metrics for API requests, tool calls, and sessions. This means the AI coding tools that engineers use every day are themselves OTel-instrumented, and any OTLP-compatible backend can receive their telemetry. An engineer with MLflow running locally could point their VS Code Copilot OTel exporter at the MLflow OTLP endpoint and see traces for every Copilot interaction in the same MLflow UI they use to debug their own LLM applications. The observability substrate that MLflow is building is not just for the LLM applications you are shipping to users; it is the same standard used by the LLM tools you are using to build those applications.

TL;DR For Engineers

  • MLflow 3.6.0 added an OTLP endpoint at /v1/traces, dual export via MLFLOW_TRACE_ENABLE_OTLP_DUAL_EXPORT=true, and OTel GenAI semconv export via MLFLOW_ENABLE_OTEL_GENAI_SEMCONV=true. MLflow is now a bidirectional OTel participant: ingests from any OTel-instrumented app in any language, exports to any OTLP-compatible backend.

  • The OTel GenAI Semantic Conventions define the standard attribute schema: gen_ai.operation.name, gen_ai.request.model, gen_ai.usage.input_tokens, gen_ai.usage.output_tokens. These are adopted by Google Cloud, AWS, Azure, Datadog, Grafana, and MLflow. Content logging (gen_ai.input.messages, gen_ai.output.messages) is opt-in for privacy reasons.

  • Auto-tracing: mlflow.openai.autolog(), mlflow.langchain.autolog(), mlflow.dspy.autolog(). Manual tracing: @mlflow.trace decorator. Native OTel: set OTEL_EXPORTER_OTLP_TRACES_ENDPOINT to MLflow's endpoint and add x-mlflow-experiment-id header.

  • The five-layer observability taxonomy (arXiv:2604.26152) shows what OTel covers (Layers 1-3: infrastructure, system, trace) and what it cannot (Layers 4-5: model internal state, confidence calibration). Hallucination detection and miscalibration monitoring are not solved by OTel span instrumentation.

  • The AgentOps taxonomy (arXiv:2411.05285) identifies artifacts spans cannot capture: long-range memory access patterns, state transition relationships across many tool calls. The span model is built for request-response; agent reasoning is longitudinal and tree-structured with backtracking.

The Observability Substrate Is Set. The Integration Problem Is Not.

MLflow + OTel GenAI Semantic Conventions solve the plumbing: standard attribute names, OTLP transport, bidirectional ingestion and export, language-agnostic instrumentation. This is the correct foundation and the field is converging on it quickly.

What remains unsolved, as both research papers identify, is the cross-layer integration: connecting a spike in gen_ai.usage.input_tokens with a downstream drop in eval scores with a concurrent GPU memory pressure alert. The telemetry for all three signals is available. The causal link between them is not. That is the observability problem for 2026 and beyond, and no tool, including MLflow, has solved it yet.

References

Summary

MLflow 3.6.0 made OpenTelemetry the default substrate for LLM tracing by shipping a bidirectional OTel integration: an OTLP ingestion endpoint at /v1/traces that accepts traces from any OTel-instrumented application in any language, dual export (MLFLOW_TRACE_ENABLE_OTLP_DUAL_EXPORT=true) that routes traces simultaneously to both MLflow's internal storage and any OTLP-compatible backend, and native support for OTel GenAI Semantic Conventions (gen_ai.* attribute format) adopted across Google Cloud, AWS, Azure, and Datadog. The OTel GenAI spec standardizes the attribute schema for LLM observability (model name, token counts, operation type), with content logging opt-in for privacy. The five-layer AI observability taxonomy (arXiv:2604.26152) identifies what OTel covers (infrastructure through trace-level) and what it cannot (model internal state, confidence calibration), while the AgentOps taxonomy (arXiv:2411.05285) highlights that span-based tracing was designed for request-response patterns, not the longitudinal reasoning trees of production agents.

Sponsored Ad

If you enjoy practical AI insights, check out SnackOnAI and support the newsletter by subscribing, sharing, and exploring our sponsored ad — it helps us keep building and delivering value 🚀

The Lithium Boom is Heating Up

Lithium stock prices have more than doubled in the past year in response to ballooning costs and shortages. $ALB climbed 185%. $SQM, 133%.

This $1B unicorn’s patented technology can recover up to 3X more lithium than traditional methods. That’s earned investment from leaders like General Motors.

Now they’re preparing for commercial production just as experts project 5X demand growth by 2040. EnergyX is tapping into 100,000+ acres of lithium deposits in Chile, a potential $1.1B annual revenue opportunity at projected market prices.

Energy Exploration Technologies, Inc. (“EnergyX”) has engaged Beehiiv to publish this communication in connection with EnergyX’s ongoing Regulation A offering. Beehiiv has been paid in cash and may receive additional compensation. Beehiiv and/or its affiliates do not currently hold securities of EnergyX.

This compensation and any current or future ownership interest could create a conflict of interest. Please consider this disclosure alongside EnergyX’s offering materials. EnergyX’s Regulation A offering has been qualified by the SEC. Offers and sales may be made only by means of the qualified offering circular. Before investing, carefully review the offering circular, including the risk factors. The offering circular is available at invest.energyx.com/.

Comparisons to other companies are for informational purposes only and should not imply similar results. Past performance is not indicative of future results. Market shortfall are forward‑looking estimates and are subject to substantial uncertainty.

Recommended for you