Design Philosophy¶
The design decisions behind AgentTel — what trade-offs were considered and why we landed where we did.
Core Principle¶
Telemetry should carry enough context for AI agents to reason and act autonomously. Every design choice flows from this: if an agent receives a span, it should be able to answer "what is this?", "is it healthy?", "who owns it?", and "what should I do?" without additional lookups.
Configuration Over Annotations¶
The Trade-off¶
Operational metadata — runbook URLs, escalation levels, SLO targets — changes more frequently than code. Embedding it in @AgentOperation annotations couples operational concerns to the development lifecycle: changing a runbook URL requires a code change, rebuild, and redeploy.
The Decision¶
AgentTel supports both YAML configuration and annotations, with config taking priority:
agenttel:
operations:
"[POST /api/payments]":
retryable: true
runbook-url: https://wiki/runbooks/process-payment
escalation-level: page_oncall
- YAML config is the recommended path — operational metadata lives in
application.ymloragenttel.yml, deployable via ConfigMap without code changes - Annotations remain available for teams that prefer co-located metadata or need compile-time validation
- When both exist, config wins — this lets platform teams override developer-set defaults
Why Not Annotations Only?¶
Developers write code, but SREs and platform teams own operational context. Forcing developers to encode escalation policies they don't own creates stale data risk and adoption friction.
Library + Zero-Code: Two Integration Paths¶
The Trade-off¶
A library dependency (Spring Boot starter) enables deep integration — annotations, AOP, auto-configuration — but requires code changes. A javaagent extension requires zero code changes but can't capture application-specific knowledge like "this endpoint is idempotent."
The Decision¶
AgentTel provides both:
| Mode | Module | Code Changes | Depth |
|---|---|---|---|
| Spring Boot Starter | agenttel-spring-boot-starter |
Add dependency + YAML config | Full: annotations, AOP, auto-config |
| JavaAgent Extension | agenttel-javaagent |
Zero — just a JVM flag + YAML | Topology, baselines, decisions from config |
The javaagent extension uses OTel's AutoConfigurationCustomizerProvider SPI to register the same SpanProcessor and ResourceProvider as the Spring Boot starter, but reads config from agenttel.yml instead of Spring's property binding.
Why Both?¶
Different teams have different constraints. A platform team rolling out observability across 200 services needs zero-code. A payments team building a critical service wants the full annotation-driven experience. AgentTel shouldn't force a choice.
Topology on Resource, Not Spans¶
The Trade-off¶
Putting topology attributes (team, tier, domain, on_call_channel) on every span makes each span self-contained — an agent can reason about any span in isolation. But topology is identical for every span from the same service, so this duplicates data on every export.
The Decision¶
Topology lives on OTel Resource attributes, set once per service instance at startup via AgentTelResourceProvider. Baselines, decisions, and anomaly scores remain on span attributes because they vary per operation.
| Level | What | Why |
|---|---|---|
| Resource (once per service) | Topology: team, tier, domain, dependencies | Same for every span — set once |
| Span (per operation) | Baselines, decisions, anomaly scores | Varies by endpoint |
At 10K spans/second, this avoids duplicating ~150 bytes of topology data per span (~1.5 MB/s saved).
Why Not All on Spans?¶
Self-contained spans are convenient but wasteful. OTel backends already associate Resource attributes with every span from that service — the data is available without duplication. Agents querying via MCP tools get the full picture because AgentContextProvider merges Resource and span data.
Operation Profiles: Convention Over Configuration¶
The Trade-off¶
Without profiles, every operation needs its own full config block — a service with 50 endpoints would have hundreds of lines of repetitive YAML. But auto-deriving everything from conventions (e.g., "all GETs are retryable") risks making wrong assumptions.
The Decision¶
Profiles define reusable operational defaults. Operations reference a profile and optionally override specific values:
agenttel:
profiles:
critical-write:
retryable: false
escalation-level: page_oncall
expected-latency-p99: 500ms
operations:
"[POST /api/payments]":
profile: critical-write
runbook-url: https://wiki/runbooks/process-payment # override
Resolution order: profile defaults < per-operation overrides
Why Not Pure Convention?¶
Conventions work for simple cases but break for domain-specific knowledge. A POST endpoint might be idempotent (payment with idempotency key) or not (event emission). Only the team that owns the service knows. Profiles balance brevity with explicit intent.
Span Enrichment Architecture¶
The Trade-off¶
Enrichment can happen at three points: SDK level (in-process), export time (in-process but deferred), or collector level (server-side). Each has different trade-offs for latency, accuracy, and coupling.
The Decision¶
AgentTel uses a two-phase enrichment model:
-
AgentTelSpanProcessor.onStart()— sets topology, baselines, and decision attributes when the span begins. These are immediately available to application code and downstream processors. -
AgentTelEnrichingSpanExporter— runs at export time to add computed attributes that require the full span: error classification, causality analysis, severity assessment, baseline confidence.
Phase 1 runs on the request thread (fast, attribute-setting only). Phase 2 runs on the export thread (can do heavier computation without blocking requests).
Why Not Collector-Side?¶
Collector-side enrichment is language-agnostic and centralized, but it requires the collector to know about your service topology, baselines, and operational decisions — which means another config surface to manage. SDK-level enrichment keeps everything in one place (the application's config) and works with any OTel backend without a custom collector pipeline.
Future Considerations¶
These are areas under consideration but not yet implemented:
- Service catalog integration — pull operational metadata from Backstage, OpsLevel, or Cortex instead of YAML config
- OTel Collector processor — a Go-based collector processor for language-agnostic enrichment at the collector level
- Sampling-aware enrichment — skip enrichment for spans that will be sampled away
- Convention-over-configuration — auto-derive operational defaults from HTTP method, route patterns, and framework metadata