Production Incident Response

AI → Human

Structured incident handling with AI triage and human oversight.

5 nodes · 5 edgesdevops

eventagenthumancli

Visual

Alert Triggeredevent

PagerDuty/Grafana alert fires.

↓sequential→ AI Triage

AI Triageagent

AI analyzes logs, metrics, and recent deploys.

↓sequential→ Engineer Investigation

↓timeout→ Engineer Investigation

Engineer Investigationhuman

On-call engineer validates AI triage.

↓conditional→ Apply Mitigation

Apply Mitigationcli

Rollback, scale up, or apply hotfix.

↓sequential→ Generate Postmortem

Generate Postmortemagent

AI drafts postmortem from .osoplog data.

Open in Editor

uc-incident-response.osop.yaml

osop_version: "1.0"
id: "incident-response"
name: "Production Incident Response"
description: "Structured incident handling with AI triage and human oversight."

nodes:
  - id: "alert"
    type: "event"
    name: "Alert Triggered"
    description: "PagerDuty/Grafana alert fires."

  - id: "triage"
    type: "agent"
    subtype: "llm"
    name: "AI Triage"
    description: "AI analyzes logs, metrics, and recent deploys."
    security:
      risk_level: "medium"

  - id: "investigate"
    type: "human"
    subtype: "input"
    name: "Engineer Investigation"
    description: "On-call engineer validates AI triage."

  - id: "mitigate"
    type: "cli"
    subtype: "script"
    name: "Apply Mitigation"
    description: "Rollback, scale up, or apply hotfix."
    security:
      risk_level: "high"
      approval_gate: true

  - id: "postmortem"
    type: "agent"
    subtype: "llm"
    name: "Generate Postmortem"
    description: "AI drafts postmortem from .osoplog data."

edges:
  - from: "alert"
    to: "triage"
    mode: "sequential"
  - from: "triage"
    to: "investigate"
    mode: "sequential"
  - from: "investigate"
    to: "mitigate"
    mode: "conditional"
    when: "investigation.confirmed == true"
  - from: "mitigate"
    to: "postmortem"
    mode: "sequential"
  - from: "triage"
    to: "investigate"
    mode: "timeout"
    timeout_sec: 300
    label: "Escalate if >5min"

←

AI Quality Inspection

AI Anomaly Detection & Human Triage

→