I’ve been thinking a lot about how to actually run AI agents in production lately. Not the model selection part or the prompt engineering… the simple infrastructure question of how does this thing get hosted and triggered? Recently I have been neck deep into exploring this, as well as some of the emerging frameworks for hosting and deploying agents.

When you strip away the hype, an agent is just a program that calls an LLM and takes actions. But the deployment model you choose shapes everything: how it handles failures, how it scales, what it costs, and how debuggable it is when things go sideways. After reading through a bunch of what others are writing (Anthropic’s guide on building effective agents, Temporal’s take on shippable AI systems, Chip Huyen’s deep dive on agents) and building a few things myself, I’ve landed on seven hosting patterns that keep showing up.

These aren’t mutually exclusive. Real systems combine them. But I find it useful to name them explicitly because the hosting pattern you pick has real consequences for reliability, cost, and operational complexity.

1. Scheduled Agent (Cron)

The simplest pattern. Your agent runs on a timer, does its work, writes results somewhere, and exits. No persistent process, no event bus, no complexity.

This works well for agents that need to periodically check something, summarize recent activity, or generate reports. The agent is stateless between runs — it reads what it needs from a database or API, does its thing, and shuts down.

# scheduled_agent.py — runs via cron, e.g. "0 */6 * * *"
import anthropic
import json
from datetime import datetime, timedelta

client = anthropic.Anthropic()

def check_recent_incidents():
    # Pull last 6 hours of alerts from your monitoring system
    incidents = fetch_incidents(since=datetime.now() - timedelta(hours=6))
    if not incidents:
        return

    response = client.messages.create(
        model="claude-sonnet-4-5-20250514",
        max_tokens=1024,
        messages=[{
            "role": "user",
            "content": f"Summarize these incidents and flag anything that needs follow-up:\n{json.dumps(incidents)}"
        }]
    )

    summary = response.content[0].text
    post_to_slack("#ops-summary", summary)
    save_to_db(summary, incidents)

if __name__ == "__main__":
    check_recent_incidents()

When to use it: Data summarization, periodic monitoring, report generation, cleanup tasks. Anything where “check every N minutes/hours” is good enough.

Trade-offs: No real-time responsiveness. If you need the interval to be very short, you’re approaching a daemon and should probably just run one. State management is on you — the agent needs to figure out what’s new since last run.

2. Event-Driven Agent (Reactive)

The agent activates in response to an external event: a webhook, a queue message, a database change. It processes the event, takes action, and exits. The key difference from cron is that work happens when something happens, not on a timer.

This is the bread and butter of most production agent deployments I’ve seen. A customer submits a support ticket, a PR gets opened, a payment fails — something happens, and an agent responds.

# event_driven_agent.py — triggered by SQS message
import json
import anthropic

client = anthropic.Anthropic()

TOOLS = [{
    "name": "create_jira_ticket",
    "description": "Create a Jira ticket for the engineering team",
    "input_schema": {
        "type": "object",
        "properties": {
            "title": {"type": "string"},
            "description": {"type": "string"},
            "priority": {"type": "string", "enum": ["low", "medium", "high", "critical"]}
        },
        "required": ["title", "description", "priority"]
    }
}]

def handle_event(event):
    """Process an incoming support ticket event."""
    ticket = json.loads(event["body"])

    response = client.messages.create(
        model="claude-sonnet-4-5-20250514",
        max_tokens=1024,
        system="You are a support triage agent. Analyze tickets and create Jira issues for engineering when needed.",
        tools=TOOLS,
        messages=[{
            "role": "user",
            "content": f"New support ticket:\nSubject: {ticket['subject']}\nBody: {ticket['body']}\nCustomer tier: {ticket['tier']}"
        }]
    )

    for block in response.content:
        if block.type == "tool_use" and block.name == "create_jira_ticket":
            create_jira_ticket(**block.input)

# AWS Lambda handler
def lambda_handler(event, context):
    for record in event["Records"]:
        handle_event(record)

When to use it: Support triage, PR review, alerting enrichment, any workflow kicked off by an external system producing events.

Trade-offs: You need event infrastructure (queues, webhooks, event buses). Retries and dead-letter handling matter here… if your agent fails mid-processing, what happens to the event? Serverless platforms like Lambda handle a lot of this, but the timeout constraints (15 minutes max on Lambda) can bite you with complex agent loops.

3. Persistent Long-Running Agent (Daemon)

The agent runs continuously as a process, maintaining state in memory. It might listen on a socket, poll a queue, or manage a conversation over time. Unlike cron or event-driven agents, it doesn’t exit between tasks.

This is the model that tools like Letta (formerly MemGPT) use: a server process that maintains agent state, memory, and context across interactions. It’s also what you get when you run a chatbot backend that keeps conversation history in-process.

# daemon_agent.py — long-running process maintaining state
import asyncio
import anthropic
from collections import defaultdict

client = anthropic.Anthropic()

class ConversationAgent:
    def __init__(self):
        self.conversations: dict[str, list] = defaultdict(list)
        self.user_preferences: dict[str, dict] = {}

    def chat(self, user_id: str, message: str) -> str:
        self.conversations[user_id].append({"role": "user", "content": message})

        system = "You are a helpful assistant with memory of prior conversations."
        if prefs := self.user_preferences.get(user_id):
            system += f"\nUser preferences: {prefs}"

        response = client.messages.create(
            model="claude-sonnet-4-5-20250514",
            max_tokens=1024,
            system=system,
            messages=self.conversations[user_id][-20:]  # sliding window
        )

        reply = response.content[0].text
        self.conversations[user_id].append({"role": "assistant", "content": reply})
        return reply

agent = ConversationAgent()

# Expose via HTTP, websocket, etc.
from fastapi import FastAPI
app = FastAPI()

@app.post("/chat")
async def chat(user_id: str, message: str):
    return {"response": agent.chat(user_id, message)}

When to use it: Chatbots, interactive assistants, agents that need fast response times with maintained context, monitoring agents that watch streams of data.

Trade-offs: State lives in memory, so a process restart means state loss unless you’re checkpointing externally. Scaling horizontally requires sticky sessions or shared state. Resource consumption is constant, whether the agent is busy or idle. Letta’s approach of checkpointing state to a database at each step is a good hybrid, you get the speed of in-process state with durability.

4. Workflow-Orchestrated Agent (Pipeline)

The agent’s work is decomposed into durable, checkpointed steps managed by an external orchestrator. Each step is an activity that can be retried independently. If the process crashes, execution resumes from the last checkpoint.

This is the pattern I’ve been exploring with Temporal, which coined the term durable execution. inference.sh has good documentation on applying it to agents specifically, and frameworks like LangGraph provide something similar with their persistence layer. You write the orchestration logic as a regular function, but each step that does real work (calling an LLM, invoking a tool, hitting an API) runs as a separate retryable activity. The orchestrator checkpoints after each step, so if the process dies, it picks back up where it left off.

// workflow-agent.ts — Temporal workflow with durable LLM calls
import { proxyActivities } from '@temporalio/workflow';
import type * as activities from './activities';

const { callLLM, searchKnowledgeBase, sendEmail, createTicket } =
  proxyActivities<typeof activities>({
    startToCloseTimeout: '60 seconds',
    retry: { maximumAttempts: 3 },
  });

export async function customerOnboardingAgent(customer: Customer): Promise<OnboardingResult> {
  // Step 1: Analyze customer profile (retryable, checkpointed)
  const analysis = await callLLM(
    `Analyze this customer and recommend an onboarding path: ${JSON.stringify(customer)}`
  );

  // Step 2: Search for relevant docs (retryable, checkpointed)
  const docs = await searchKnowledgeBase(analysis.recommendedTopics);

  // Step 3: Generate personalized welcome (retryable, checkpointed)
  const welcome = await callLLM(
    `Create a personalized onboarding email using these docs: ${JSON.stringify(docs)}`
  );

  // Step 4: Send and track (retryable, checkpointed)
  await sendEmail(customer.email, welcome);
  await createTicket({ type: 'onboarding', customerId: customer.id, status: 'started' });

  return { customerId: customer.id, path: analysis.recommendedPath };
}

When to use it: Multi-step agent workflows where failure mid-way is expensive (token costs, side effects already committed), long-running operations that span minutes to hours, anything that needs audit trails and observability.

Trade-offs: Requires orchestration infrastructure (Temporal, Restate, or similar). Adds latency from checkpointing. More code than a simple script. But you get retries, observability, and resumability for free, which is hard to beat when your agent is doing real work in production. Temporal’s blog on shippable AI systems covers five specific patterns that make this practical.

5. Agent-as-API (Service)

The agent is exposed as a synchronous or streaming HTTP endpoint. A client sends a request, the agent processes it (possibly with multiple LLM calls and tool uses), and returns a response. It’s just a service.

This is the most familiar pattern for anyone who’s built web APIs. The difference from a daemon is that each request is independent — there’s no shared in-process state between requests (or if there is, it’s fetched from a database). Cloud providers are converging on this with Azure AI Foundry Agent Service, Google Cloud Run for agents, and Amazon Bedrock AgentCore.

# agent_service.py — agent exposed as a streaming API
from fastapi import FastAPI
from fastapi.responses import StreamingResponse
import anthropic

app = FastAPI()
client = anthropic.Anthropic()

TOOLS = [{
    "name": "lookup_order",
    "description": "Look up order details by order ID",
    "input_schema": {
        "type": "object",
        "properties": {"order_id": {"type": "string"}},
        "required": ["order_id"]
    }
}]

def run_agent(query: str, user_id: str):
    """Run an agentic loop, yielding text chunks."""
    messages = [{"role": "user", "content": query}]
    context = load_user_context(user_id)  # from DB, not memory

    while True:
        response = client.messages.create(
            model="claude-sonnet-4-5-20250514",
            max_tokens=1024,
            system=f"You are a customer service agent.\nContext: {context}",
            tools=TOOLS,
            messages=messages,
        )

        if response.stop_reason == "end_turn":
            for block in response.content:
                if hasattr(block, "text"):
                    yield block.text
            break

        # Handle tool calls
        for block in response.content:
            if block.type == "tool_use":
                result = execute_tool(block.name, block.input)
                messages.append({"role": "assistant", "content": response.content})
                messages.append({"role": "user", "content": [
                    {"type": "tool_result", "tool_use_id": block.id, "content": str(result)}
                ]})

@app.post("/agent")
async def agent_endpoint(query: str, user_id: str):
    return StreamingResponse(run_agent(query, user_id), media_type="text/plain")

When to use it: Customer-facing agents, internal tools, anything that needs to fit into an existing request/response architecture. This is the default if you’re integrating agents into an existing service mesh.

Trade-offs: Bound by HTTP timeout constraints. Complex multi-step agents may exceed typical load balancer timeouts (30-60 seconds), so streaming or async patterns become necessary. No built-in durability — if the process dies mid-request, the work is lost. For longer operations, you often end up adding a job queue behind the API, at which point you’re really doing the workflow pattern with HTTP as the front door.

6. Self-Scheduling Agent (Adaptive)

The agent determines its own next execution time based on results. After completing a run, it evaluates what it found and schedules itself accordingly. Quiet period? Check again in an hour. Something interesting happening? Check again in five minutes.

This is a more sophisticated version of the cron pattern. Instead of a fixed interval, the agent is adaptive. I find this pattern compelling for monitoring and research use cases where the rate of change in the environment varies.

# self_scheduling_agent.py — determines its own next run time
import anthropic
import json

client = anthropic.Anthropic()

TOOLS = [{
    "name": "schedule_next_run",
    "description": "Schedule when this agent should run next",
    "input_schema": {
        "type": "object",
        "properties": {
            "delay_minutes": {"type": "integer", "description": "Minutes until next run"},
            "reason": {"type": "string", "description": "Why this interval was chosen"}
        },
        "required": ["delay_minutes", "reason"]
    }
}]

def run_monitoring_cycle():
    metrics = fetch_current_metrics()
    previous = load_previous_analysis()

    response = client.messages.create(
        model="claude-sonnet-4-5-20250514",
        max_tokens=1024,
        system="""You are a monitoring agent. Analyze metrics, report anomalies,
        and schedule your next check. Use shorter intervals when things look
        abnormal, longer intervals when stable.""",
        tools=TOOLS,
        messages=[{
            "role": "user",
            "content": f"Current metrics:\n{json.dumps(metrics)}\n\nPrevious analysis:\n{previous}"
        }]
    )

    next_delay = 60  # default: 1 hour
    for block in response.content:
        if hasattr(block, "text"):
            save_analysis(block.text)
        if block.type == "tool_use" and block.name == "schedule_next_run":
            next_delay = block.input["delay_minutes"]
            log(f"Next run in {next_delay}m: {block.input['reason']}")

    # Re-schedule ourselves
    schedule_job("run_monitoring_cycle", delay_minutes=next_delay)

if __name__ == "__main__":
    run_monitoring_cycle()

When to use it: Monitoring where the checking frequency should adapt to conditions, research agents that crawl data sources at varying rates, any situation where a fixed interval is either too frequent (wasteful) or too infrequent (miss events).

Trade-offs: You need a job scheduler that supports dynamic delays (Celery, Cloud Tasks, SQS delay queues, or a cron service with an API). The agent can get itself into trouble — scheduling too aggressively eats tokens and API quota, scheduling too conservatively misses events. Guardrails on min/max intervals are essential.

7. Multi-Agent Mesh (Distributed)

Independent agents communicating via events, shared state, or direct invocation. Each agent has its own domain, its own deployment, and its own lifecycle. They coordinate through a shared infrastructure layer.

This is the most complex pattern and the one most prone to over-engineering. Most multi-agent projects should probably just be a single agent with better tools. But when you genuinely have separate domains that need to collaborate — say, a security agent, a compliance agent, and a deployment agent that all weigh in on a release — the mesh pattern emerges naturally.

The interesting development here is that standards are forming around how agents discover and talk to each other. Google’s A2A (Agent2Agent) protocol is the one I’ve been paying closest attention to. Agents publish “Agent Cards” — JSON metadata describing their capabilities — and communicate over standard HTTP using JSON-RPC. It supports sync request/response, streaming via SSE, and async push notifications. Over 150 organizations have adopted it so far. Where Anthropic’s MCP connects agents to tools and resources, A2A handles the peer-to-peer communication between agents themselves. They’re complementary, and production multi-agent systems will likely use both.

# multi_agent_mesh.py — agents communicating via event bus
import anthropic
import json

client = anthropic.Anthropic()

# Each agent subscribes to relevant events and publishes decisions

class SecurityAgent:
    def handle_event(self, event: dict):
        if event["type"] != "release.proposed":
            return

        response = client.messages.create(
            model="claude-sonnet-4-5-20250514",
            max_tokens=512,
            system="You are a security review agent. Analyze changes for security risks.",
            messages=[{"role": "user", "content": json.dumps(event["payload"])}]
        )

        publish_event({
            "type": "review.security",
            "release_id": event["payload"]["release_id"],
            "decision": parse_decision(response.content[0].text),
            "analysis": response.content[0].text
        })

class ComplianceAgent:
    def handle_event(self, event: dict):
        if event["type"] != "release.proposed":
            return

        response = client.messages.create(
            model="claude-sonnet-4-5-20250514",
            max_tokens=512,
            system="You are a compliance agent. Check changes against regulatory requirements.",
            messages=[{"role": "user", "content": json.dumps(event["payload"])}]
        )

        publish_event({
            "type": "review.compliance",
            "release_id": event["payload"]["release_id"],
            "decision": parse_decision(response.content[0].text),
            "analysis": response.content[0].text
        })

class ReleaseCoordinator:
    """Gathers reviews and makes the final call."""
    def __init__(self):
        self.reviews: dict[str, list] = {}

    def handle_event(self, event: dict):
        if not event["type"].startswith("review."):
            return

        release_id = event["release_id"]
        self.reviews.setdefault(release_id, []).append(event)

        if len(self.reviews[release_id]) >= 2:  # all reviewers reported
            all_approved = all(r["decision"] == "approved" for r in self.reviews[release_id])
            publish_event({
                "type": "release.approved" if all_approved else "release.blocked",
                "release_id": release_id,
                "reviews": self.reviews.pop(release_id)
            })

When to use it: When you have genuinely separate domains of expertise that need to collaborate, when different agents need different models or tool sets, when you want independent scaling and deployment per agent.

Trade-offs: Operational complexity goes through the roof. You now have N agents to deploy, monitor, and debug instead of one. Failure modes multiply — what if the security agent is down when a release is proposed? You need timeouts, fallbacks, and dead-letter handling for inter-agent communication. Multi-agent systems fail at 41-86% rates in production, and the primary failure mode is coordination breakdown. Start with a single agent and split only when you have a clear reason.

Picking a Pattern

There’s no universal answer here. I keep coming back to a simple heuristic:

PatternBest WhenAvoid When
Scheduled (Cron)Periodic checks, reports, summariesYou need real-time response
Event-DrivenReacting to external triggersEvents are infrequent and batch is fine
Persistent DaemonFast response with maintained contextState loss on restart is unacceptable
Workflow-OrchestratedMulti-step, failure-prone operationsThe task is simple enough for a script
Agent-as-APIFits existing service architectureOperations regularly exceed HTTP timeouts
Self-SchedulingVariable-rate monitoring/researchYou can’t set good min/max guardrails
Multi-Agent MeshGenuinely separate domains collaboratingA single agent with good tools would work

In practice, most production systems I’ve seen combine two or three of these. An event-driven agent that kicks off a workflow-orchestrated pipeline is common. An agent-as-API that delegates to a self-scheduling background agent is another natural combination. The Anthropic guide has good advice here: start with the simplest pattern that could work, and add complexity only when you have evidence it’s needed.

What I’m Still Figuring Out

In my experience building traditional services, the hosting pattern almost always evolves. You start with a cron job that polls for work. Then someone needs faster response times, so you add an event subscription. Before long the cron is gone and you have a daemon consuming from a queue. This is a well-worn path — I’ve done it a dozen times and the migration is straightforward because the core logic (fetch data, process it, write results) stays the same. You’re just swapping the trigger.

I’m curious whether agent hosting follows the same trajectory. A few things I want to explore:

  • Cron to event-driven: This should translate directly. Your scheduled agent already reads state and acts on it — wiring it to a queue instead of a timer is mechanical. But does the agent logic change? If it was summarizing the last 6 hours of incidents, now it’s processing one incident at a time. The prompt and context window management look different.
  • Event-driven to workflow-orchestrated: This is where I think the real upgrade path lives. You start with a Lambda that handles a single event, then realize you need retries, checkpointing, and multi-step coordination. Wrapping that same logic in a Temporal workflow or similar orchestrator is the natural next step, and the agent code itself barely changes.
  • Single agent to multi-agent: This is the one I’m least sure about. With services, splitting a monolith into microservices has a known playbook (and known pitfalls). With agents, when does it actually make sense to split one agent into collaborating specialists? I suspect the answer is “later than you think” but I want to build some real examples to test that.

I’ll keep exploring this space and plan to write follow-up posts as I prototype these migration paths. The patterns above are a starting point for thinking about it, but the interesting work is in how they compose and evolve as your requirements change.

Further Reading