Skip to content

Agent Layer

The agenttel-agent module provides the interface layer between AI agents and your application's telemetry. It packages real-time system state into structured formats that LLMs can consume, and provides tools for agents to query health, diagnose incidents, and execute remediation.


Overview

graph TB
    Agent["AI Agent / LLM<br/><small>Claude, GPT, custom agent</small>"]

    subgraph MCP["MCP Server (9 tools)"]
        Tools1["get_service_health · get_incident_context<br/>list_remediation_actions · execute_remediation<br/>get_recent_agent_actions"]
        Tools2["get_slo_report · get_executive_summary<br/>get_trend_analysis · get_cross_stack_context"]
    end

    ACP["AgentContextProvider<br/><small>Single entry point for all agent queries</small>"]

    subgraph Components["Components"]
        HA["Health<br/>Aggregator"]
        IC["Incident<br/>Context"]
        RR["Remediation<br/>Registry"]
        AT["Action<br/>Tracker"]
        CF["Context<br/>Formatter"]
    end

    subgraph Reporting["Reporting"]
        SLO["SLO Report<br/>Generator"]
        TR["Trend<br/>Analyzer"]
        ES["Executive<br/>Summary"]
        CS["Cross-Stack<br/>Context"]
    end

    Agent -->|"JSON-RPC (MCP Protocol)"| MCP
    MCP --> ACP
    ACP --> HA
    ACP --> IC
    ACP --> RR
    ACP --> AT
    ACP --> CF
    ACP --> SLO
    ACP --> TR
    ACP --> ES
    ACP --> CS

    style Agent fill:#4338ca,stroke:#6366f1,color:#fff
    style MCP fill:#6366f1,stroke:#818cf8,color:#fff
    style ACP fill:#7c3aed,stroke:#a78bfa,color:#fff
    style Components fill:#1e1b4b,stroke:#4338ca,color:#e0e7ff
    style Reporting fill:#1e1b4b,stroke:#4338ca,color:#e0e7ff
    style HA fill:#818cf8,stroke:#a5b4fc,color:#1e1b4b
    style IC fill:#818cf8,stroke:#a5b4fc,color:#1e1b4b
    style RR fill:#818cf8,stroke:#a5b4fc,color:#1e1b4b
    style AT fill:#818cf8,stroke:#a5b4fc,color:#1e1b4b
    style CF fill:#818cf8,stroke:#a5b4fc,color:#1e1b4b
    style SLO fill:#818cf8,stroke:#a5b4fc,color:#1e1b4b
    style TR fill:#818cf8,stroke:#a5b4fc,color:#1e1b4b
    style ES fill:#818cf8,stroke:#a5b4fc,color:#1e1b4b
    style CS fill:#818cf8,stroke:#a5b4fc,color:#1e1b4b

MCP Server

The MCP (Model Context Protocol) server exposes telemetry as tools that AI agents can invoke over HTTP using the JSON-RPC 2.0 protocol.

Starting the Server

McpServer mcp = new AgentTelMcpServerBuilder()
    .port(8081)
    .contextProvider(agentContextProvider)
    .remediationExecutor(remediationExecutor)
    .build();

mcp.start();

Endpoints

Endpoint Method Description
POST /mcp JSON-RPC 2.0 MCP tool listing and invocation
GET /health HTTP Server health check

JSON-RPC Methods

Method Description
initialize MCP handshake — returns protocol version and capabilities
tools/list List all available tools with their schemas
tools/call Invoke a specific tool with arguments

Built-in Tools

get_service_health

Returns current service health including operation metrics, dependency status, and SLO budget.

Parameters:

Name Type Required Description
format string No "text" (default) or "json"

Example text output:

SERVICE: payment-service | STATUS: DEGRADED | 2025-01-15T14:30:00Z
OPERATIONS:
  POST /api/payments: err=5.2% p50=312ms p99=1200ms [ELEVATED]
  GET /api/prices: err=0.1% p50=12ms p99=45ms
DEPENDENCIES:
  postgres: err=0.0% avg=8ms
  stripe-api: err=12.3% avg=2100ms
SLOs:
  payment-availability: budget=22.0% burn=0.8x

get_incident_context

Returns a complete incident diagnosis for a specific operation — what's happening, what changed, what's affected, and what to do.

Parameters:

Name Type Required Description
operation_name string Yes The operation to diagnose

Example output:

=== INCIDENT inc-a3f2b1c4 ===
SEVERITY: HIGH
TIME: 2025-01-15T14:30:00Z
SUMMARY: POST /api/payments experiencing elevated error rate (5.2%)

## WHAT IS HAPPENING
Operation: POST /api/payments
Error Rate: 5.2% (baseline: 0.1%)
Latency P50: 312ms (baseline: 45ms)
Anomaly Score: 0.85
Service Health: DEGRADED
Patterns: ERROR_RATE_SPIKE

## WHAT CHANGED
Last Deploy: v2.1.0 at 2025-01-15T14:00:00Z
  [deployment] Deployed version v2.1.0 (2025-01-15T14:00:00Z)
  [config_change] Updated rate limit to 500 rps (2025-01-15T13:45:00Z)

## WHAT IS AFFECTED
Scope: operation_specific
User-Facing: YES
Affected Ops: POST /api/payments
Affected Deps: stripe-api
Affected Consumers: checkout-service

## SUGGESTED ACTIONS
Escalation: page_oncall
  - [HIGH] rollback_deployment: Rollback to previous version (NEEDS APPROVAL)
  - [MEDIUM] enable_circuit_breakers: Circuit break stripe-api

## SIMILAR PAST INCIDENTS
  inc-2024-dec-03: stripe-api timeout → Increased timeout to 10s

list_remediation_actions

Lists available remediation actions for a specific operation.

Parameters:

Name Type Required Description
operation_name string Yes Operation to get actions for

execute_remediation

Executes a remediation action. Actions requiring approval need the approved_by field.

Parameters:

Name Type Required Description
action_name string Yes Name of the action to execute
reason string Yes Reason for executing this action
approved_by string No Required for actions needing approval

get_recent_agent_actions

Returns the audit trail of recent agent decisions and actions.

get_slo_report

Returns SLO compliance report across all tracked operations — budget remaining, burn rate, and compliance status.

Parameters:

Name Type Required Description
format string No "text" (default) or "json"

Example text output:

=== SLO REPORT ===
Generated: 2025-01-15T14:30:00Z
Total SLOs: 2

SUMMARY: 1 healthy, 1 at risk, 0 violated

  [HEALTHY] payment-availability
    Target: 99.90%  Actual: 99.95%  Budget: 50.0%  Burn: 0.5x  Requests: 10000  Failed: 5
  [AT_RISK] payment-latency-p99
    Target: 200ms  Actual: 312ms  Budget: 22.0%  Burn: 0.8x  Requests: 10000  Failed: 520

get_executive_summary

Returns a high-level executive summary of service health (~300 tokens), optimized for LLM context windows.

Parameters: None.

Example output:

=== EXECUTIVE SUMMARY ===
Service: payment-service | Status: DEGRADED | 2025-01-15T14:30:00Z

STATUS: 1 operation degraded. POST /api/payments error rate elevated (5.2%).

TOP ISSUES:
  1. POST /api/payments: err=5.2% (baseline 0.1%), p50=312ms (baseline 45ms)

SLO BUDGET: 1/2 healthy, 1 at risk (payment-latency-p99: 22% remaining)

OPERATIONS: 2 tracked, 10,000 total requests

get_trend_analysis

Returns latency, error rate, and throughput trends for an operation over a time window with direction indicators.

Parameters:

Name Type Required Description
operation_name string Yes Operation name to analyze trends for
window_minutes string No Time window in minutes (default: "30")

Example output:

=== TREND ANALYSIS: POST /api/payments ===
Window: 30 minutes | Samples: 12

LATENCY P50: 45ms → 312ms  ↑ RISING (+593%)
LATENCY P99: 200ms → 1200ms  ↑ RISING (+500%)
ERROR RATE: 0.1% → 5.2%  ↑ RISING (+5100%)
THROUGHPUT: 180 rpm → 165 rpm  ↓ FALLING (-8%)

ASSESSMENT: Operation is degrading. Latency and error rate are both rising sharply.

get_cross_stack_context

Returns correlated frontend and backend context for an operation — traces the full user-to-database path when agenttel-web is connected.

Parameters:

Name Type Required Description
operation_name string Yes Backend operation name to get cross-stack context for

Example output (with frontend connected):

=== CROSS-STACK CONTEXT: POST /api/payments ===

## FRONTEND (User Experience)
  Route: /checkout/payment
  Page Load P50: 850ms (baseline: 800ms)
  API Call P50: 520ms (baseline: 300ms)
  Journey: checkout (step 4/5)
  Funnel Health: 62% completion (baseline: 65%)
  Anomalies: slow_page_load
  Affected Users: ~120 in last 15 min

## BACKEND (payment-service)
  Operation: POST /api/payments
  Error Rate: 5.2% (baseline: 0.1%)
  Latency P50: 312ms (baseline: 45ms)
  Deviation: ELEVATED

## SLO STATUS
  payment-availability: 99.95% (target: 99.9%) budget=50.0%
  payment-latency-p99: 312ms (target: 200ms) budget=22.0%

## CORRELATION
  Frontend → Backend trace linking: active
  Browser trace IDs correlated with backend spans via W3C Trace Context

Example output (without frontend):

## FRONTEND (User Experience)
  Status: No frontend telemetry connected
  Note: Connect agenttel-web SDK to enable cross-stack correlation.

Registering Custom Tools

McpServer server = builder.build();

server.registerTool(
    new McpToolDefinition(
        "search_logs",
        "Search recent application logs for a pattern",
        Map.of("query", new ParameterDefinition("string", "Search query"),
               "timeframe", new ParameterDefinition("string", "Time range (e.g., '1h', '30m')")),
        List.of("query")
    ),
    args -> logService.search(args.get("query"), args.get("timeframe"))
);

Service Health Aggregation

ServiceHealthAggregator maintains real-time health metrics computed from span data.

Recording Metrics

ServiceHealthAggregator health = new ServiceHealthAggregator(rollingBaselines, sloTracker);

// Called from SpanProcessor or interceptor
health.recordSpan("POST /api/payments", 312.0, false);
health.recordDependencyCall("stripe-api", 2100.0, true);

Querying Health

// Full service summary
ServiceHealthSummary summary = health.getHealthSummary("payment-service");
// summary.status()       → DEGRADED
// summary.operations()   → List<OperationSummary>
// summary.dependencies() → List<DependencySummary>

// Single operation
Optional<OperationSummary> op = health.getOperationHealth("POST /api/payments");
// op.errorRate()      → 0.052
// op.latencyP50Ms()   → 312.0
// op.deviationStatus() → "elevated"

Health Status Determination

Condition Status
Any SLO with < 10% budget remaining CRITICAL
Any operation with > 10% error rate (100+ requests) CRITICAL
Any operation with > 1% error rate (100+ requests) DEGRADED
Any dependency with > 50% error rate (10+ calls) DEGRADED
Any SLO with < 50% budget remaining DEGRADED
None of the above HEALTHY

Incident Context Builder

IncidentContextBuilder assembles a complete incident package from current system state.

Structure

Every IncidentContext contains four sections designed for LLM reasoning:

Section Record Contents
What Is Happening WhatIsHappening Operation name, current vs baseline metrics, detected patterns, anomaly score
What Changed WhatChanged Recent deployments, config changes, with timestamps
What Is Affected WhatIsAffected Affected operations, dependencies, consumers, impact scope, user-facing flag
What To Do WhatToDo Runbook URL, escalation level, suggested actions with confidence and approval requirements

Plus: severity (LOW/MEDIUM/HIGH/CRITICAL) and similar past incidents.

Severity Determination

Condition Severity
Service health is CRITICAL CRITICAL
Cascade failure pattern detected CRITICAL
Error rate > 10% HIGH
Service health is DEGRADED MEDIUM
Default LOW

Change Tracking

IncidentContextBuilder builder = new IncidentContextBuilder(
    healthAggregator, topology, rollingBaselines, remediationRegistry);

// Record changes for correlation
builder.recordDeployment("v2.1.0", "2025-01-15T14:00:00Z");
builder.recordConfigChange("Updated rate limit to 500 rps");

// Record historical incidents for pattern matching
builder.recordHistoricalIncident("inc-2024-dec-03", "2024-12-03T10:00:00Z",
    "Increased timeout to 10s", "stripe-api timeout");

Remediation Framework

Defining Actions

RemediationAction rollback = RemediationAction.builder("rollback_deployment", "POST /api/payments")
    .description("Rollback to previous known-good version")
    .type(RemediationAction.ActionType.ROLLBACK)
    .requiresApproval(true)
    .command("kubectl rollout undo deployment/payment-service")
    .build();

RemediationAction circuitBreak = RemediationAction.builder("circuit_break_stripe", "POST /api/payments")
    .description("Enable circuit breaker on stripe-api dependency")
    .type(RemediationAction.ActionType.CIRCUIT_BREAKER)
    .requiresApproval(false)
    .build();

Registering Actions

RemediationRegistry registry = new RemediationRegistry();

// Operation-specific actions
registry.register(rollback);
registry.register(circuitBreak);

// Global actions (apply to all operations)
registry.registerGlobal(RemediationAction.builder("enable_debug_logging", "*")
    .description("Enable DEBUG logging for 5 minutes")
    .type(RemediationAction.ActionType.CUSTOM)
    .requiresApproval(false)
    .build());

Executing Actions

RemediationExecutor executor = new RemediationExecutor(registry, actionTracker);

// Auto-approved action
RemediationResult result = executor.execute("circuit_break_stripe", "stripe-api error rate at 12%");

// Action requiring approval
RemediationResult result = executor.executeApproved(
    "rollback_deployment",
    "Error rate spike after v2.1.0 deployment",
    "oncall-engineer@company.com"
);

Action Types

Type Description
RESTART Rolling restart of service instances
SCALE Horizontal or vertical scaling
ROLLBACK Deployment rollback
CIRCUIT_BREAKER Enable/modify circuit breaker
RATE_LIMIT Adjust rate limiting
CACHE_FLUSH Flush application caches
CUSTOM Domain-specific action

Agent Action Tracking

Every decision and action taken by an AI agent is recorded as an OpenTelemetry span for full auditability.

Recording Actions

AgentActionTracker tracker = new AgentActionTracker(openTelemetry);

// Simple action record
tracker.recordAction("scale_up", "High latency detected",
    Map.of("instances", "3", "reason", "p50 > 2x baseline"));

// Decision with rationale
tracker.recordDecision(
    "response_strategy",
    "Error rate rising but not critical — prefer conservative approach",
    "increase_timeout",
    List.of("increase_timeout", "add_retry", "circuit_break", "rollback")
);

// Traced action (captures success/failure)
String result = tracker.traceAction("compute_recommendation", "Need action plan", () -> {
    // Complex computation...
    return "scale_up";
});

Span Attributes

Each tracked action creates a span with:

Attribute Description
agenttel.agent.action.name Action identifier
agenttel.agent.action.reason Why the action was taken
agenttel.agent.action.status "completed", "success", or "failed"
agenttel.agent.action.type "action", "decision", or "traced_action"
agenttel.agent.decision.rationale Reasoning (for decisions)
agenttel.agent.decision.chosen Selected option (for decisions)
agenttel.agent.decision.options All options considered (for decisions)

Context Formatters

ContextFormatter produces prompt-optimized output in multiple formats, each designed for a specific context window budget.

Compact Health (~200 tokens)

Use as a system prompt prefix or quick status check.

String compact = ContextFormatter.formatHealthCompact(healthSummary);

Full Incident (~800 tokens)

Use when an agent needs to diagnose and act on an incident.

String full = ContextFormatter.formatIncidentFull(incidentContext);

Compact Incident (~100 tokens)

Use for notifications or alert summaries.

String brief = ContextFormatter.formatIncidentCompact(incidentContext);

JSON Health

Use for structured tool results that agents can parse.

String json = ContextFormatter.formatHealthAsJson(healthSummary);

AgentContextProvider

The single entry point for all agent queries. Wires together all components, including the reporting layer.

AgentContextProvider provider = new AgentContextProvider(
    healthAggregator,
    incidentContextBuilder,
    remediationRegistry,
    topology,
    patternMatcher,
    rollingBaselines,
    actionTracker
);

// Wire in reporting components
provider.setReportingComponents(
    sloReportGenerator,
    trendAnalyzer,
    executiveSummaryBuilder,
    crossStackContextBuilder
);

// Core queries
String health = provider.getHealthSummary();
String incident = provider.getIncidentContext("POST /api/payments");
String actions = provider.getAvailableActions("POST /api/payments");
String audit = provider.getRecentActions();

// Reporting queries
String sloReport = provider.getSloReport();
String trends = provider.getTrendAnalysis("POST /api/payments", 30);
String executive = provider.getExecutiveSummary();
String crossStack = provider.getCrossStackContext("POST /api/payments");

// JSON for structured tool results
String healthJson = provider.getHealthSummaryJson();

// Raw objects for programmatic access
IncidentContext ctx = provider.getIncidentContextObject("POST /api/payments");

Integration with Spring Boot

The agent layer can be integrated with Spring Boot auto-configuration:

@Configuration
public class AgentConfig {

    @Bean
    public ServiceHealthAggregator serviceHealthAggregator(
            RollingBaselineProvider baselines, SloTracker sloTracker) {
        return new ServiceHealthAggregator(baselines, sloTracker);
    }

    @Bean
    public AgentActionTracker agentActionTracker(OpenTelemetry otel) {
        return new AgentActionTracker(otel);
    }

    @Bean
    public RemediationRegistry remediationRegistry() {
        RemediationRegistry registry = new RemediationRegistry();
        // Register your actions...
        return registry;
    }

    @Bean
    public McpServer mcpServer(AgentContextProvider provider,
                                RemediationExecutor executor) throws IOException {
        McpServer server = new AgentTelMcpServerBuilder()
            .port(8081)
            .contextProvider(provider)
            .remediationExecutor(executor)
            .build();
        server.start();
        return server;
    }
}