Design Considerations¶

This document captures open design questions, trade-offs, and future direction for AgentTel's instrumentation approach. It is intended to guide the next iteration of the library.

1. Library Dependency vs. Zero-Code Agent¶

Current State¶

Users add agenttel-spring-boot-starter as a compile-time dependency, configure via application.yml, and optionally annotate code with @AgentOperation.

Pros¶

Deep integration: annotations capture application-specific knowledge (retryable, idempotent, runbook URLs) that no external agent could infer
Type-safe: compile-time checks on attribute names, enum values, annotation parameters
Auto-configuration: Spring Boot starter makes setup low-friction for Spring apps
No sidecar or extra process: runs in-process alongside the application

Cons¶

Requires code changes: developers must add the dependency and (optionally) annotations — this is more friction than vendor agents (Datadog, New Relic) which attach as javaagents at runtime with zero code changes
Tied to build lifecycle: upgrading AgentTel requires a code change, rebuild, and redeploy
Not language-agnostic: the library approach is Java/JVM-specific; each language needs its own implementation

Future Direction¶

Consider offering a hybrid model: - Javaagent mode: A javaagent that attaches at runtime and auto-discovers topology from OTel resource attributes, Spring bean metadata, and HTTP route patterns. This would provide topology enrichment with zero code changes. - Library mode (current): For teams that want the full annotation-driven experience with compile-time safety - OTel Collector processor: A server-side component (written in Go) that enriches spans at the collector level using external configuration. This would be fully language-agnostic.

2. Operational Knowledge Hardcoded in Source Code¶

Current State¶

@AgentOperation annotations embed operational metadata directly in source code:

@AgentOperation(
    expectedLatencyP50 = "45ms",
    expectedLatencyP99 = "200ms",
    retryable = true,
    runbookUrl = "https://wiki/runbooks/process-payment",
    escalationLevel = EscalationLevel.PAGE_ONCALL,
    safeToRestart = false
)

Pros¶

Co-located: operational knowledge lives next to the code it describes, making it easy to discover
Versioned: annotation values are tracked in git alongside the code
IDE support: autocomplete, refactoring, and compile-time validation
Works today: no external infrastructure needed

Cons¶

Operational metadata changes more frequently than code: runbook URLs move, SLO targets shift, escalation policies evolve — but code doesn't get redeployed for operational changes
Wrong audience: developers write the code, but SREs and platform teams own the operational context (runbooks, escalation, SLOs). Annotations force developers to encode knowledge they may not have.
Stale data risk: if a runbook URL changes and nobody updates the annotation, agents act on outdated information
Coupling: application code becomes coupled to operational concerns

Future Direction¶

Externalize operational metadata so it can change without code deployments:

YAML/config-driven approach (near-term): Extend application.yml to accept per-operation decision metadata, similar to how topology and dependencies are already configured:
```
agenttel:
  operations:
    "POST /api/payments":
      retryable: true
      runbook-url: https://wiki/runbooks/process-payment
      escalation-level: page_oncall
```
This keeps metadata in config, deployable via ConfigMap or config service without code changes.
Service catalog integration (medium-term): Pull operational metadata from an external service catalog (Backstage, OpsLevel, Cortex) at startup or on a refresh interval. The catalog becomes the single source of truth.
Inferred from observed data (long-term): Baselines should come from actual traffic patterns (the RollingBaselineProvider already does this). Decision metadata like "retryable" could potentially be inferred from retry patterns in traces.
Annotations as optional overrides: Keep annotations for cases where developers want to explicitly declare operational intent, but make them optional — the system works fully from external config.

3. Source Code Bloat from Instrumentation¶

Current State¶

Each annotated endpoint adds 5-10 lines of annotation metadata:

@AgentOperation(
    expectedLatencyP50 = "45ms",
    expectedLatencyP99 = "200ms",
    expectedErrorRate = 0.001,
    retryable = true,
    idempotent = true,
    runbookUrl = "https://wiki/runbooks/process-payment",
    fallbackDescription = "Returns cached pricing",
    escalationLevel = EscalationLevel.PAGE_ONCALL,
    safeToRestart = false
)
@PostMapping
public ResponseEntity<PaymentResult> processPayment(...) {

Pros¶

Explicit: every enrichment is visible and intentional
Discoverable: grep for @AgentOperation to find all instrumented endpoints
Familiar: follows the same pattern as @Transactional, @Cacheable, etc.

Cons¶

Visual noise: a service with 50+ endpoints would have hundreds of lines of annotation metadata
Repetitive: many operations share similar operational profiles (same escalation level, same retry policy)
Maintenance burden: more annotation parameters = more things to keep current
Barrier to adoption: teams may resist adding "yet another annotation" to their controllers

Future Direction¶

Config-driven (no annotations needed): As described in section 2, move to YAML configuration. Zero source code changes needed.
Profile-based annotations: Define operation profiles (e.g., @AgentOperation(profile = "critical-write")) that map to a set of defaults, reducing per-method annotation verbosity.
Convention-over-configuration: Auto-derive operational defaults from existing metadata. For example, a POST endpoint is likely not idempotent, a GET is likely retryable.
Auto-detection: The AgentTelAnnotationBeanPostProcessor already scans Spring MVC annotations. It could go further — inferring operation names without any AgentTel-specific annotations at all.

4. Span Size and Telemetry Overhead¶

Current State¶

AgentTel adds up to 15 extra attributes per span:

Category	Attributes	Approx. Bytes
Topology	team, tier, domain, on_call_channel	~150
Baseline	latency_p50_ms, latency_p99_ms, error_rate, source	~120
Decision	retryable, idempotent, runbook_url, fallback_available, fallback_description, escalation_level, safe_to_restart	~300
Total	15 attributes	~500-800 bytes/span

At 10K spans/second, this is an additional 5-8 MB/s of telemetry data.

Pros¶

Agent-actionable: every span carries enough context for an AI agent to reason about it without additional lookups
Self-contained: no need to join span data with external metadata at query time
Standard OTel: uses native span attributes, compatible with any OTel backend

Cons¶

Redundant data: topology attributes (team, tier, domain, on_call_channel) are identical for every span from the same service. They are already available as OTel Resource attributes — duplicating them on every span wastes bandwidth and storage.
Storage cost at scale: at high throughput, the extra bytes per span compound significantly in backends like Jaeger, Tempo, or Elasticsearch
Collector/exporter bandwidth: more data per span means more network traffic to the collector
Query performance: larger spans can slow down trace search and analysis in some backends

Future Direction¶

Move topology to Resource attributes (near-term): OTel Resource attributes are set once per service instance and attached to every span by the SDK — not repeated per span. Topology (team, tier, domain, on_call_channel) belongs here. This removes 4 redundant attributes from every span.
Selective enrichment (near-term): Not every span needs decision metadata. Enrich only spans for operations that have registered @AgentOperation metadata. Internal framework spans (Spring dispatcher, Tomcat, etc.) should only get topology via Resource.
Enrichment at the collector (medium-term): Instead of adding attributes at the SDK level, an OTel Collector processor could enrich spans server-side using external configuration. This moves the cost from the application to the collector infrastructure and allows centralized management.
Sampling-aware enrichment (long-term): If spans are going to be sampled away, there's no point enriching them. Integrate with OTel's sampling decisions to skip enrichment for spans that won't be exported.
Compression: Some backends compress span data. The text-heavy attributes (runbook URLs, fallback descriptions) compress well. Measure actual wire-size impact rather than assuming worst case.

Summary: Evolution Path¶

Current (v0.1)                    Near-term (v0.2)                 Long-term (v1.0)
─────────────────────────────────────────────────────────────────────────────────────
Library dependency        →   ✅ Library + javaagent mode   →   + OTel Collector processor
Annotations in code       →   ✅ YAML config (no code)     →   Service catalog integration
All attrs on every span   →   ✅ Topology as Resource attrs →   Selective + collector-side
Full annotation params    →   ✅ Operation profiles         →   Convention-over-config

Implemented in v0.2: - agenttel.operations YAML config — define per-operation baselines and decision metadata entirely in application.yml, making @AgentOperation annotations optional. When both are present, YAML config takes priority. - Topology moved to OTel Resource attributes — agenttel.topology.* attributes are set once per service instance via AgentTelResourceProvider (OTel SPI), no longer duplicated on every span. - agenttel-javaagent-extension module — zero-code OTel javaagent extension. Drop into -Dotel.javaagent.extensions path with an agenttel.yml config file. No Spring dependency, works with any JVM app. - Operation profiles — reusable sets of operational defaults (agenttel.profiles in YAML, @AgentOperation(profile = "...") in annotations). Define common patterns once, reference from operations.

The core principle remains: telemetry should carry enough context for AI agents to reason and act autonomously. The question is how that context gets into the telemetry — and the answer should evolve from "developers annotate code" to "the platform injects it automatically."