Design Considerations¶
This document captures open design questions, trade-offs, and future direction for AgentTel's instrumentation approach. It is intended to guide the next iteration of the library.
1. Library Dependency vs. Zero-Code Agent¶
Current State¶
Users add agenttel-spring-boot-starter as a compile-time dependency, configure via application.yml, and optionally annotate code with @AgentOperation.
Pros¶
- Deep integration: annotations capture application-specific knowledge (retryable, idempotent, runbook URLs) that no external agent could infer
- Type-safe: compile-time checks on attribute names, enum values, annotation parameters
- Auto-configuration: Spring Boot starter makes setup low-friction for Spring apps
- No sidecar or extra process: runs in-process alongside the application
Cons¶
- Requires code changes: developers must add the dependency and (optionally) annotations — this is more friction than vendor agents (Datadog, New Relic) which attach as javaagents at runtime with zero code changes
- Tied to build lifecycle: upgrading AgentTel requires a code change, rebuild, and redeploy
- Not language-agnostic: the library approach is Java/JVM-specific; each language needs its own implementation
Future Direction¶
Consider offering a hybrid model: - Javaagent mode: A javaagent that attaches at runtime and auto-discovers topology from OTel resource attributes, Spring bean metadata, and HTTP route patterns. This would provide topology enrichment with zero code changes. - Library mode (current): For teams that want the full annotation-driven experience with compile-time safety - OTel Collector processor: A server-side component (written in Go) that enriches spans at the collector level using external configuration. This would be fully language-agnostic.
2. Operational Knowledge Hardcoded in Source Code¶
Current State¶
@AgentOperation annotations embed operational metadata directly in source code:
@AgentOperation(
expectedLatencyP50 = "45ms",
expectedLatencyP99 = "200ms",
retryable = true,
runbookUrl = "https://wiki/runbooks/process-payment",
escalationLevel = EscalationLevel.PAGE_ONCALL,
safeToRestart = false
)
Pros¶
- Co-located: operational knowledge lives next to the code it describes, making it easy to discover
- Versioned: annotation values are tracked in git alongside the code
- IDE support: autocomplete, refactoring, and compile-time validation
- Works today: no external infrastructure needed
Cons¶
- Operational metadata changes more frequently than code: runbook URLs move, SLO targets shift, escalation policies evolve — but code doesn't get redeployed for operational changes
- Wrong audience: developers write the code, but SREs and platform teams own the operational context (runbooks, escalation, SLOs). Annotations force developers to encode knowledge they may not have.
- Stale data risk: if a runbook URL changes and nobody updates the annotation, agents act on outdated information
- Coupling: application code becomes coupled to operational concerns
Future Direction¶
Externalize operational metadata so it can change without code deployments:
-
YAML/config-driven approach (near-term): Extend
This keeps metadata in config, deployable via ConfigMap or config service without code changes.application.ymlto accept per-operation decision metadata, similar to how topology and dependencies are already configured: -
Service catalog integration (medium-term): Pull operational metadata from an external service catalog (Backstage, OpsLevel, Cortex) at startup or on a refresh interval. The catalog becomes the single source of truth.
-
Inferred from observed data (long-term): Baselines should come from actual traffic patterns (the
RollingBaselineProvideralready does this). Decision metadata like "retryable" could potentially be inferred from retry patterns in traces. -
Annotations as optional overrides: Keep annotations for cases where developers want to explicitly declare operational intent, but make them optional — the system works fully from external config.
3. Source Code Bloat from Instrumentation¶
Current State¶
Each annotated endpoint adds 5-10 lines of annotation metadata:
@AgentOperation(
expectedLatencyP50 = "45ms",
expectedLatencyP99 = "200ms",
expectedErrorRate = 0.001,
retryable = true,
idempotent = true,
runbookUrl = "https://wiki/runbooks/process-payment",
fallbackDescription = "Returns cached pricing",
escalationLevel = EscalationLevel.PAGE_ONCALL,
safeToRestart = false
)
@PostMapping
public ResponseEntity<PaymentResult> processPayment(...) {
Pros¶
- Explicit: every enrichment is visible and intentional
- Discoverable: grep for
@AgentOperationto find all instrumented endpoints - Familiar: follows the same pattern as
@Transactional,@Cacheable, etc.
Cons¶
- Visual noise: a service with 50+ endpoints would have hundreds of lines of annotation metadata
- Repetitive: many operations share similar operational profiles (same escalation level, same retry policy)
- Maintenance burden: more annotation parameters = more things to keep current
- Barrier to adoption: teams may resist adding "yet another annotation" to their controllers
Future Direction¶
- Config-driven (no annotations needed): As described in section 2, move to YAML configuration. Zero source code changes needed.
- Profile-based annotations: Define operation profiles (e.g.,
@AgentOperation(profile = "critical-write")) that map to a set of defaults, reducing per-method annotation verbosity. - Convention-over-configuration: Auto-derive operational defaults from existing metadata. For example, a
POSTendpoint is likely not idempotent, aGETis likely retryable. - Auto-detection: The
AgentTelAnnotationBeanPostProcessoralready scans Spring MVC annotations. It could go further — inferring operation names without any AgentTel-specific annotations at all.
4. Span Size and Telemetry Overhead¶
Current State¶
AgentTel adds up to 15 extra attributes per span:
| Category | Attributes | Approx. Bytes |
|---|---|---|
| Topology | team, tier, domain, on_call_channel | ~150 |
| Baseline | latency_p50_ms, latency_p99_ms, error_rate, source | ~120 |
| Decision | retryable, idempotent, runbook_url, fallback_available, fallback_description, escalation_level, safe_to_restart | ~300 |
| Total | 15 attributes | ~500-800 bytes/span |
At 10K spans/second, this is an additional 5-8 MB/s of telemetry data.
Pros¶
- Agent-actionable: every span carries enough context for an AI agent to reason about it without additional lookups
- Self-contained: no need to join span data with external metadata at query time
- Standard OTel: uses native span attributes, compatible with any OTel backend
Cons¶
- Redundant data: topology attributes (team, tier, domain, on_call_channel) are identical for every span from the same service. They are already available as OTel Resource attributes — duplicating them on every span wastes bandwidth and storage.
- Storage cost at scale: at high throughput, the extra bytes per span compound significantly in backends like Jaeger, Tempo, or Elasticsearch
- Collector/exporter bandwidth: more data per span means more network traffic to the collector
- Query performance: larger spans can slow down trace search and analysis in some backends
Future Direction¶
-
Move topology to Resource attributes (near-term): OTel Resource attributes are set once per service instance and attached to every span by the SDK — not repeated per span. Topology (team, tier, domain, on_call_channel) belongs here. This removes 4 redundant attributes from every span.
-
Selective enrichment (near-term): Not every span needs decision metadata. Enrich only spans for operations that have registered
@AgentOperationmetadata. Internal framework spans (Spring dispatcher, Tomcat, etc.) should only get topology via Resource. -
Enrichment at the collector (medium-term): Instead of adding attributes at the SDK level, an OTel Collector processor could enrich spans server-side using external configuration. This moves the cost from the application to the collector infrastructure and allows centralized management.
-
Sampling-aware enrichment (long-term): If spans are going to be sampled away, there's no point enriching them. Integrate with OTel's sampling decisions to skip enrichment for spans that won't be exported.
-
Compression: Some backends compress span data. The text-heavy attributes (runbook URLs, fallback descriptions) compress well. Measure actual wire-size impact rather than assuming worst case.
Summary: Evolution Path¶
Current (v0.1) Near-term (v0.2) Long-term (v1.0)
─────────────────────────────────────────────────────────────────────────────────────
Library dependency → ✅ Library + javaagent mode → + OTel Collector processor
Annotations in code → ✅ YAML config (no code) → Service catalog integration
All attrs on every span → ✅ Topology as Resource attrs → Selective + collector-side
Full annotation params → ✅ Operation profiles → Convention-over-config
Implemented in v0.2:
- agenttel.operations YAML config — define per-operation baselines and decision metadata entirely in application.yml, making @AgentOperation annotations optional. When both are present, YAML config takes priority.
- Topology moved to OTel Resource attributes — agenttel.topology.* attributes are set once per service instance via AgentTelResourceProvider (OTel SPI), no longer duplicated on every span.
- agenttel-javaagent-extension module — zero-code OTel javaagent extension. Drop into -Dotel.javaagent.extensions path with an agenttel.yml config file. No Spring dependency, works with any JVM app.
- Operation profiles — reusable sets of operational defaults (agenttel.profiles in YAML, @AgentOperation(profile = "...") in annotations). Define common patterns once, reference from operations.
The core principle remains: telemetry should carry enough context for AI agents to reason and act autonomously. The question is how that context gets into the telemetry — and the answer should evolve from "developers annotate code" to "the platform injects it automatically."