Back to Blogs
Outage Agent blog header showing production incident handling

Building an Outage Agent That Handles Repeated Production Incidents

TL;DR

The Outage Agent is a production-ready AI system that accepts alert messages through an API endpoint, reasons about them using Groq, enriches its understanding with Exa when necessary, and decides whether to handle the issue or escalate it to humans. Tensorlake Applications orchestrates the workflow with durable execution and built-in observability.

Production alerts are easy to generate but hard to interpret. An error spike or a sudden surge in failures does not immediately tell you whether the issue is harmless, already known, or serious enough to wake someone up. This constant need to manually triage alerts is what makes on-call work exhausting.

The Outage Agent is designed to handle that first decision. When an alert comes in, it analyzes the situation in context, looks for similar past incidents, checks external signals when needed, and decides whether the issue can be handled safely or must be escalated. Engineers stay in the loop, but only when it actually matters.

Built on Tensorlake Applications, the agent runs as a reliable serverless workflow that tracks every step and improves as it sees more incidents.

Here is the repo: https://github.com/tensorlakeai/examples/tree/main/outage-agent

Why Tensorlake Applications?#

Building an outage agent is less about intelligence and more about reliable execution. Alerts arrive at unpredictable times, workflows span multiple steps, and failures can happen anywhere along the way. Tensorlake Applications handles this complexity by turning the agent into a structured, durable workflow.

Key reasons for using Tensorlake Applications include:

  • Application-level entry point with @application

    The @application decorator defines the outage agent as a deployable service with a clear starting point. This makes the agent callable through a stable API endpoint rather than a script that must be run manually.

  • Durable workflow steps with @function

    Each logical step such as reasoning, searching, or decision making is wrapped in a @function decorator. Tensorlake automatically persists state between these steps and retries them safely if something fails.

  • Serverless execution by default

    Once deployed, the application scales automatically as alerts come in. No infrastructure setup or manual scaling is required.

  • Secure and observable runtime

    Secrets are injected securely at runtime, and every function call is fully logged and traceable.

    While this demo does not execute LLM-generated code, Tensorlake Applications supports secure code sandboxing by design. Each @function runs in its own isolated container, allowing agents that generate code to execute it safely, without access to secrets or the main agentic loop by default.

The End-to-End Flow of an Outage#

Once an alert comes in, the agent moves through a clear, step-by-step flow to reach a decision.

Step 1: Alert Submission

An alert is submitted to a Tensorlake application endpoint, which acts as the single entry point into the system.

Step 2: Workflow Orchestration

Tensorlake assigns a request ID, stores the input, and orchestrates the entire outage workflow. It manages execution order, retries failures, and keeps the run observable from start to finish.

Step 3: Context and Reasoning

The Outage Agent gathers logs, metrics, past incidents, and optional external signals from Exa through its own application code. It then assembles this context and sends it to Groq for reasoning.

Step 4: Decision and Execution

Based on confidence and risk, the Outage Agent decides whether the issue can be handled safely or must be escalated. Any actions are executed through controlled tools defined in the agent's code, never directly by the LLM.

Step 5: Verification and Memory

The Outage Agent verifies outcomes and records the full incident details, making them available for observability, debugging, and future incident analysis.

Tensorlake ties everything together by acting as the execution and coordination layer. It separates the UI from reasoning, enforces safe execution boundaries around LLMs, and provides durability and observability by default. This allows the Outage Agent to behave like a real production system rather than a fragile demo.

Outage Agent workflow diagram showing the end-to-end flow

Requirements#

Install dependencies: pip install -r requirements.txt

The project requires three API keys:

  • Tensorlake API key: For authentication to your deployed endpoint
  • Groq API key: For LLM reasoning (Llama-3.3-70B)
  • Exa API key: For real-time web search

Local Development and Testing of Outage Agent#

Before deploying the Outage Agent to Tensorlake Cloud, it is useful to run and test everything locally. Tensorlake Applications is designed so that the same code runs locally and in the cloud without modification, which makes development and debugging much easier.

Setting Environment Variables#

The project relies on three API keys. For local development, these keys must be set as environment variables so the agent can authenticate and call external services.

Set them based on your operating system.

macOS / Linux

export TENSORLAKE_API_KEY="your_tensorlake_key"
export GROQ_API_KEY="your_groq_key"
export EXA_API_KEY="your_exa_key"

Windows (PowerShell)

$env:TENSORLAKE_API_KEY="your_tensorlake_key"
$env:GROQ_API_KEY="your_groq_key"
$env:EXA_API_KEY="your_exa_key"

Windows (Command Prompt)

set TENSORLAKE_API_KEY=your_tensorlake_key
set GROQ_API_KEY=your_groq_key
set EXA_API_KEY=your_exa_key

Once the variables are set, the agent is ready to run locally.

Next, we'll walk through the core logic in outage_agent.py.

Repo: https://github.com/tensorlakeai/examples/blob/main/outage-agent/outage_agent.py

Note on implementation details

  • gather_internal_context is a placeholder; in production it would query systems like Kubernetes, Datadog, or internal observability tools for live logs and metrics.

Running the Agent Locally#

Once the environment variables are set, you can run the Outage Agent directly from the repository.

python outage_agent.py

This runs the full outage workflow locally using a sample alert. The agent analyzes the alert, gathers context, reasons about the issue, and produces a decision exactly as it would in production. No deployment is required at this stage.

Local testing with test_outage_agent_local.py

For more controlled testing, the repository includes a dedicated test script.

python test_outage_agent_local.py

This script sends a predefined alert through the local Tensorlake application runtime and prints the structured output. It is useful for validating changes to the agent logic and confirming that reasoning, escalation decisions, and outputs behave as expected before moving to deployment.

Local testing output showing the outage agent response

Once local testing is complete, the agent is ready to be deployed as a live endpoint in the Tensorlake cloud.

Deployment to Tensorlake Cloud#

Login#

Authenticate with Tensorlake from your terminal:

tensorlake login

Set Secrets#

Store the Groq and Exa API keys securely. These are injected at runtime and never hardcoded.

tensorlake secrets set GROQ_API_KEY "your_groq_key"
tensorlake secrets set EXA_API_KEY "your_exa_key"

Verify that the secrets are set correctly:

tensorlake secrets list

Deploy the Agent#

Deploy the application using the same outage_agent.py file:

tensorlake deploy outage_agent.py

On successful deployment, Tensorlake returns a permanent endpoint for the agent:

https://api.tensorlake.ai/applications/outage_agent
Deployment success message showing the outage agent endpoint

The Endpoint: What It Is and What It Does#

The endpoint is the live entry point to the Outage Agent in the cloud.

It:

  • Accepts POST requests containing raw alert text
  • Returns a request_id immediately
  • Runs the full outage workflow asynchronously
  • Persists progress and retries failed steps automatically
  • Logs every step for observability and debugging

The final result can be retrieved by polling:

/requests/{request_id}/output

💡 To invoke the endpoint directly, use the following cURL command:

curl https://api.tensorlake.ai/applications/outage_agent \
  -H "Authorization: Bearer $TENSORLAKE_API_KEY" \
  -H "Content-Type: application/json" \
  --data '{"alert":"Authentication failures increased 800% in the last 10 minutes"}'

Before running the command, make sure to:

  • Replace the endpoint URL if your deployed application name is different

  • Set TENSORLAKE_API_KEY to your own API key

  • Update the JSON payload

    {"alert":"Authentication failures increased 800% in the last 10 minutes"}
    

    with the alert message you want the agent to analyze

The response will return a request_id, which can be used to track and retrieve the agent's output.

API response showing request_id from the outage agent endpoint

Testing the Remote Agent#

To validate the deployment, use the provided test script: here

python test_remote_outage_agent.py

This script sends a sample alert to the deployed endpoint, confirms that a request_id is created, and verifies that the workflow completes successfully.

Remote testing output showing the outage agent response

At this point, the Outage Agent is live and ready to be triggered by a UI or an alerting system.

The Streamlit User Interface#

The Outage Agent includes a simple Streamlit-based interface that makes it easy to interact with the system without calling the API directly. The UI is designed for testing, demos, and manual triage, while all execution and decision making happens in the backend.

You can find the agent interface code here.

The Streamlit app allows users to:

  • Enter their own TensorLake API key
  • Submit alert messages for analysis
  • View real-time progress with a spinner while the agent runs
  • See clear green or red banners indicating whether escalation is required
  • Reuse previous alerts with a single click

To run the interface locally:

streamlit run streamlit_app.py
Streamlit UI interface for the outage agent

Agent Outcomes and Escalation Logic#

For every alert, the agent produces one of two outcomes.

🟢 Handled Automatically (Green)

This outcome is used when the agent has high confidence, recognizes a known pattern, and determines that the risk is low. The response includes the suspected root cause, the actions taken or recommended, and a verification signal showing improvement. These issues are safe for the agent to handle without human involvement, such as a transient spike in authentication failures.

Dashboard showing handled automatically outcome

🔴 Escalation Required (Red)

This outcome is returned when the issue carries significant risk or uncertainty. Scenarios such as potential data loss, payment failures, or security breaches require immediate human review. In these cases, the agent still provides analysis and context but explicitly avoids taking action and forces escalation.

Dashboard showing escalation required outcome

This approach reduces on-call noise while ensuring that critical issues are never handled blindly.

Observability and Debugging#

Every execution of the Outage Agent is fully observable through the Tensorlake dashboard. Each run is tracked with a unique request ID, along with its timestamp, input, and final output.

The dashboard provides visibility into every step of the workflow, including detailed logs, execution duration, and success or failure status. If a tool call or reasoning step fails, the full execution trace makes it easy to pinpoint where and why the failure occurred.

Tensorlake observability dashboard showing request details

Because each step is durable, progress is never lost. Even during development and testing, failures did not require restarting the outage agent from scratch, which made iteration faster and debugging significantly easier.

Final Verdict#

Tensorlake Applications turns what would normally be a complex and fragile agent into a simple, reliable service. Durable execution, serverless scaling, and built-in observability remove the operational overhead, allowing the Outage Agent to focus purely on decision-making and safety.

With a few commands, the agent can be deployed as a live endpoint and start triaging alerts immediately. Engineers stay in control, but no longer need to investigate every signal by hand.

Clone the repository, deploy in minutes, and reduce on-call noise without taking unnecessary risks.

→ Quickstart: https://docs.tensorlake.ai/applications/quickstart

Your AI on-call engineer is ready to handle the noise and wake you only when it truly matters.

Arindam Majumder

Arindam Majumder

Developer Advocate at Tensorlake

I’m a developer advocate, writer, and builder who enjoys breaking down complex tech into simple steps, working demos, and content that developers can act on. My blogs have crossed a million views across platforms, and I create technical tutorials on YouTube focused on AI, agents, and practical workflows. I contribute to open source, explore new AI tooling, and build small apps and prototypes to show developers what’s possible with today’s models.

This website uses cookies to enhance your browsing experience. By clicking "Accept All Cookies", you consent to the use of ALL cookies. By clicking "Decline", only essential cookies will be used. Read our Privacy Policy for more details.