Skip to content

Anomaly Detection

Real-time deviation detection on both backend (JVM) and frontend (browser) — using z-score analysis, pattern matching, and rolling baselines.


Why Anomaly Detection for Agents?

Traditional alerting relies on static thresholds ("alert if error rate > 5%"). This creates two problems for AI agents:

  • Alert fatigue — thresholds are either too sensitive (noisy) or too loose (miss real issues)
  • No context — an alert says "threshold breached" but not "how abnormal is this compared to recent behavior?"

AgentTel's anomaly detection compares every span against what's actually normal for that operation, using rolling baselines learned from live traffic. An agent gets a continuous anomaly score, not a binary alert.


How It Works

Z-Score Detection

For every operation span, AgentTel computes a z-score — how many standard deviations the observed latency is from the rolling baseline mean:

z_score = (observed_latency - baseline_mean) / baseline_stddev

If abs(z_score) > threshold (default: 3.0), the span is classified as anomalous.

Z-Score Meaning Anomaly Score
< 3.0 Normal 0.0
3.0 Threshold (3 sigma) 0.75
4.0 Significant deviation 1.0
> 4.0 Extreme deviation 1.0 (capped)

The anomaly score is normalized to 0.0–1.0 using min(1.0, abs(z_score) / 4.0), giving agents a continuous severity signal.

Pattern Matching

Beyond individual span anomalies, the PatternMatcher detects higher-level incident patterns from accumulated span data:

Pattern Value How It's Detected
Cascade Failure cascade_failure Multiple dependent services failing simultaneously
Latency Degradation latency_degradation Sustained latency increase beyond baseline
Error Rate Spike error_rate_spike Sudden increase in error rate beyond baseline
Memory Leak memory_leak Steadily increasing latency with increasing error rate
Thundering Herd thundering_herd Sudden spike in request rate after recovery
Cold Start cold_start High latency on first requests after deployment

Pattern detection triggers an agenttel.anomaly.detected event with the pattern field set, which an agent can use to select the right playbook.


Baselines

Anomaly detection requires knowing what "normal" looks like. AgentTel provides three baseline sources, chained with fallback:

Learned automatically from live traffic using a lock-free ring buffer sliding window:

agenttel:
  baselines:
    source: rolling
    rolling-window-size: 1000    # samples per operation
    rolling-min-samples: 10      # minimum before baseline is valid

The RollingBaselineProvider tracks per-operation metrics: - P50 and P99 latency - Mean and standard deviation (used for z-score) - Error rate - Throughput (requests per second)

Cold Start

Until rolling-min-samples observations are collected, the rolling baseline is not considered valid. During this period, static baselines (from config) are used as fallback.

Static Baselines

Declared in YAML config or @AgentOperation annotations:

agenttel:
  operations:
    "[POST /api/payments]":
      expected-latency-p50: "45ms"
      expected-latency-p99: "200ms"
      expected-error-rate: 0.001

Default Baselines

If neither rolling nor static baselines are available, hardcoded defaults are used (P50=100ms, P99=500ms, error_rate=0.01). These are intentionally conservative to avoid false positives.

Resolution order: rolling baselines > static baselines > defaults


Configuration

agenttel:
  anomaly-detection:
    enabled: true          # master switch (default: true)
    z-score-threshold: 3.0 # deviation threshold (default: 3.0)
  baselines:
    source: rolling        # "rolling" or "static" (default: rolling)
    rolling-window-size: 1000
    rolling-min-samples: 10
Property Default Description
anomaly-detection.enabled true Enable/disable anomaly detection
anomaly-detection.z-score-threshold 3.0 Z-score threshold for anomaly classification
baselines.source rolling Primary baseline source
baselines.rolling-window-size 1000 Ring buffer size per operation
baselines.rolling-min-samples 10 Minimum samples before baseline is valid

See Configuration Reference for the full list.

Python Usage

The Python SDK provides the same anomaly detection:

from agenttel.anomaly.detector import AnomalyDetector
from agenttel.anomaly.patterns import PatternMatcher
from agenttel.baseline.rolling import RollingBaselineProvider

# Anomaly detector with configurable threshold
detector = AnomalyDetector(z_score_threshold=3.0)
result = detector.detect(current_value=312.0, baseline_mean=45.0, baseline_stddev=20.0)
# result.is_anomaly = True, result.z_score = 13.35, result.score = 1.0

# Rolling baselines learned from traffic
baselines = RollingBaselineProvider(window_size=1000, min_samples=10)
baselines.record("POST /api/payments", latency_ms=45.0, is_error=False)

# Pattern matching
matcher = PatternMatcher()
patterns = matcher.detect(
    current_latency=312.0, baseline_p50=45.0,
    current_error_rate=0.052, baseline_error_rate=0.001,
    failing_dependencies=["stripe-api", "postgres"]
)
# patterns = [AnomalyPattern.LATENCY_DEGRADATION, AnomalyPattern.CASCADE_FAILURE]

When using FastAPI integration, anomaly detection is automatic:

from agenttel.fastapi import instrument_fastapi

app = FastAPI()
instrument_fastapi(app)  # Anomaly detection runs on every span

Span Attributes

When an anomaly is detected, these attributes appear on the span:

Attribute Type Example Description
agenttel.anomaly.detected boolean true Whether this span is anomalous
agenttel.anomaly.score double 0.85 Severity score (0.0–1.0)
agenttel.anomaly.latency_z_score double 4.2 Z-score deviation
agenttel.anomaly.pattern string error_rate_spike Detected incident pattern (if any)

See Attribute Dictionary — Anomaly for the complete reference.


Events

Anomaly detection emits structured events via the OTel Logs API:

agenttel.anomaly.detected

Emitted when a span's latency z-score exceeds the threshold OR when the PatternMatcher detects an incident pattern.

{
  "operation": "POST /api/payments",
  "latency_ms": 312.0,
  "anomaly_score": 0.85,
  "z_score": 4.2
}

An agent subscribed to these events can immediately invoke get_incident_context to get the full diagnosis.

See Event Catalog — anomaly.detected for the full event specification.


Frontend Anomaly Detection

The browser SDK runs its own anomaly detection for user-experience patterns:

Pattern Detection Attributes
Rage clicks 3+ clicks on same target in 2s client.anomaly.pattern=rage_click
API failure cascade 3+ API failures in 5s client.anomaly.pattern=api_cascade
Slow page load Exceeds P99 baseline client.anomaly.pattern=slow_load
Error loop Same error 3+ times in 10s client.anomaly.pattern=error_loop

See Frontend Telemetry for the full browser SDK guide.


Further Reading