Sample investigation

This is a real investigation performed on RunLLM’s production infrastructure. Potentially sensitive information has been redacted.

Elevated 500 rate — /api/external-tools/65/config (Jira)

Triggered by RunLLM Predictive Issue Detection
SOURCE: PREDICTIVE DETECTION

Over the trailing 45 minutes, predictive detection measured HTTP 500s on GET /api/external-tools/65/config climbing from a near-zero baseline to 18–26% of requests in a 22-minute window (14:41–15:03 UTC). The spike isolated to external tool 65 (Jira) and customer_id CUST-8291-X; other tools and tenants stayed flat in the same interval.

Generate hypotheses

Select a hypothesis to update the tool calls and report below.

Hypothesis A — Schema Drift

Tool call: Grafana · Prometheus query

POST /api/ds/query HTTP/1.1
Host: grafana.runllm.internal
Content-Type: application/json
Cookie: grafana_session=***

{
  "queries": [
    {
      "datasource": { "type": "prometheus", "uid": "PBFA97CFB590B2093" },
      "expr": "sum(rate(http_server_requests_seconds_count{cluster=\"runllm-prod\",job=\"app/server\",route=\"/api/external-tools/65/config\",status=\"500\"}[5m]))",
      "refId": "A",
      "legendFormat": "500s — route"
    }
  ],
  "from": "now-2h",
  "to": "now"
}

Result (summary)

Non-zero 500 rate on the config read route beginning shortly after deploy marker 37ceda3; peak ~20% of requests in the incident window vs flat baseline the prior day.

See full data
{
  "results": {
    "A": {
      "frames": [
        {
          "schema": {
            "name": "http_server_requests_seconds_count",
            "fields": [
              { "name": "Time", "type": "time", "typeInfo": { "frame": "time.Time" } },
              { "name": "Value", "type": "number", "labels": { "route": "/api/external-tools/65/config", "status": "500" } }
            ]
          },
          "data": {
            "values": [
              [1704129600000,1704130200000,1704130800000,1704131400000,1704132000000],
              [0,0.02,0.19,0.24,0.21]
            ]
          }
        }
      ],
      "error": null
    }
  }
}

Generate report

Generated
Actions 5
Evidence sources
  • Grafana — request & route logs (runllm-prod / app/server) tied to deploy window and 500 rate

Full report

Alert summary

A GET /api/external-tools/65/config request in GKE (runllm-prod, app/server, us-central1) returned HTTP 500 due to a Pydantic validation failure when parsing Jira external tool config from Vault (GCP Errors).

Investigation plan

Treat this as a schema-drift check: align deploy time and PR 37ceda3 with the error window, compare Vault payloads for tool 65 to the tightened JiraToolConfig model (ConfigDict(extra="forbid")), and use Grafana route metrics to confirm the blast radius matches a single read path.

Root cause hypothesis

The handler crashes in read_from_vault(...)JiraToolConfig(**data) with ConfigDict(extra="forbid"). The raised pydantic_core.ValidationError lists unrecognized keys: jira_pat, api_version, and fast_thread_token. Commit 37ceda3 (PR #4275) tightened the JiraToolConfig schema; Grafana route metrics show the failure rate at ~20% on the day of deploy vs 0% the prior day.

Remediation suggestions

Implement backward-compatible read handling: map legacy Vault keys to the new fields, or migrate stored payloads to the updated schema before enforcing extra="forbid" on read paths.

Add feedback — this is a static demo of RunLLM.

Evaluating an AI SRE?

One question matters

What's your agent's accuracy on novel incidents?

Book a Demo