Sample investigation

This is a real investigation performed on RunLLM’s production infrastructure. Potentially sensitive information has been redacted.

Elevated 500 rate — `/api/external-tools/65/config` (Jira)

—

Triggered by RunLLM Predictive Issue Detection

SOURCE: PREDICTIVE DETECTION

Over the trailing 45 minutes, predictive detection measured HTTP 500s on GET /api/external-tools/65/config climbing from a near-zero baseline to 18–26% of requests in a 22-minute window (14:41–15:03 UTC). The spike isolated to external tool 65 (Jira) and customer_id CUST-8291-X; other tools and tenants stayed flat in the same interval.

—

Generate hypotheses

Select a hypothesis to update the tool calls and report below.

—

Hypothesis A — Schema Drift

Tool call: Grafana · Prometheus query

POST /api/ds/query HTTP/1.1
Host: grafana.runllm.internal
Content-Type: application/json
Cookie: grafana_session=***

{
  "queries": [
    {
      "datasource": { "type": "prometheus", "uid": "PBFA97CFB590B2093" },
      "expr": "sum(rate(http_server_requests_seconds_count{cluster=\"runllm-prod\",job=\"app/server\",route=\"/api/external-tools/65/config\",status=\"500\"}[5m]))",
      "refId": "A",
      "legendFormat": "500s — route"
    }
  ],
  "from": "now-2h",
  "to": "now"
}

Result (summary)

Non-zero 500 rate on the config read route beginning shortly after deploy marker 37ceda3; peak ~20% of requests in the incident window vs flat baseline the prior day.

See full data

{
  "results": {
    "A": {
      "frames": [
        {
          "schema": {
            "name": "http_server_requests_seconds_count",
            "fields": [
              { "name": "Time", "type": "time", "typeInfo": { "frame": "time.Time" } },
              { "name": "Value", "type": "number", "labels": { "route": "/api/external-tools/65/config", "status": "500" } }
            ]
          },
          "data": {
            "values": [
              [1704129600000,1704130200000,1704130800000,1704131400000,1704132000000],
              [0,0.02,0.19,0.24,0.21]
            ]
          }
        }
      ],
      "error": null
    }
  }
}

—

Hypothesis B — Bad Vault Data

Tool call: Vault · KV v2 read

$ export VAULT_ADDR=https://vault.runllm.internal:8200
$ vault kv get -mount=secret -format=json external-tools/65/jira

Result (summary)

JSON decodes cleanly. Top-level keys include jira_pat, api_version, and fast_thread_token — the same names surfaced in the Pydantic extra_forbidden list.

See full data

{
  "request_id": "a4f2c8e1-0c3b-4f91-9d2a-7e6b5c4d3a21",
  "lease_id": "",
  "data": {
    "data": {
      "jira_pat": "vault:v1:AbCdEfGhIjKlMnOpQrStUvWxYz0123456789+/==",
      "api_version": "3",
      "fast_thread_token": "vault:v1:ZzYyXxWwVuUtTsSrRqQpPoOnNmMlLkKjJ",
      "site_url": "https://acme.atlassian.net",
      "updated_at": "2025-11-02T18:44:12Z"
    },
    "metadata": {
      "created_time": "2024-06-11T09:15:03.771Z",
      "version": 14,
      "custom_metadata": null
    }
  }
}

—

Tool call: Jira Cloud REST · credential probe

GET /rest/api/3/myself HTTP/1.1
Host: acme.atlassian.net
Authorization: Bearer ***
Accept: application/json
X-Atlassian-Token: no-check

Result (summary)

200 OK. Account id and email match the expected automation user; no auth or connectivity failure — inconsistent with corrupted or missing secret material.

See full data

HTTP/1.1 200 OK
Content-Type: application/json
X-Request-Id: 8f3c1a2b-4d5e-6f70-8a9b-0c1d2e3f4a5b

{
  "account_id": "712020:deadbeef-cafe-babe-0000-111122223333",
  "email": "runllm-jira-bot@acme.corp",
  "display_name": "RunLLM Jira Integration",
  "active": true,
  "time_zone": "Etc/UTC"
}

—

Hypothesis C — Tool Type Bug

Tool call: MySQL · read-only replica

mysql> SELECT et.id,
    ->        et.tool_type_id,
    ->        tt.slug AS tool_slug,
    ->        eta.vault_secret_path
    -> FROM external_tools et
    -> INNER JOIN tool_types tt ON tt.id = et.tool_type_id
    -> LEFT JOIN external_tool_auth eta ON eta.external_tool_id = et.id
    -> WHERE et.id = 65\G

Result (summary)

Row exists; tool_slug = jira. vault_secret_path points at the same KV prefix validated under Hypothesis B — no evidence of a mismatched tool type or orphan foreign key.

See full data

*************************** 1. row ***************************
               id: 65
     tool_type_id: 12
        tool_slug: jira
vault_secret_path: secret/data/external-tools/65/jira
        row_count: 1

—

Generate report

Generated —

Actions 5

Evidence sources

Grafana — request & route logs (runllm-prod / app/server) tied to deploy window and 500 rate

Full report

Alert summary

A GET /api/external-tools/65/config request in GKE (runllm-prod, app/server, us-central1) returned HTTP 500 due to a Pydantic validation failure when parsing Jira external tool config from Vault (GCP Errors).

Investigation plan

Treat this as a schema-drift check: align deploy time and PR 37ceda3 with the error window, compare Vault payloads for tool 65 to the tightened JiraToolConfig model (ConfigDict(extra="forbid")), and use Grafana route metrics to confirm the blast radius matches a single read path.

Root cause hypothesis

The handler crashes in read_from_vault(...) → JiraToolConfig(**data) with ConfigDict(extra="forbid"). The raised pydantic_core.ValidationError lists unrecognized keys: jira_pat, api_version, and fast_thread_token. Commit 37ceda3 (PR #4275) tightened the JiraToolConfig schema; Grafana route metrics show the failure rate at ~20% on the day of deploy vs 0% the prior day.

Remediation suggestions

Implement backward-compatible read handling: map legacy Vault keys to the new fields, or migrate stored payloads to the updated schema before enforcing extra="forbid" on read paths.

Add feedback — this is a static demo of RunLLM.

Elevated 500 rate — `/api/external-tools/65/config` (Jira)

Alert summary

Investigation plan

Root cause hypothesis

Remediation suggestions

Alert summary

Investigation plan

Hypothesis validation

Conclusion

Alert summary

Investigation plan

Hypothesis validation

Conclusion

One question matters