Home/Guides/How to Build an AI Agent

GuideEngineeringUpdated May 2026

How to Build an AI Agent That Works in Production (2026)

Most AI agents people are building right now are not agents. They are workflows pretending to be agents — and the confusion costs teams weeks of unnecessary engineering. Anthropic's December 2024 research note Building Effective Agents draws the line clearly: a workflow is a predefined LLM-and-tool chain where each step is decided in advance. An agent is a system where the LLM dynamically decides what to do next until the task is done. Workflows are simpler, cheaper, more reliable, and easier to debug. They are the right answer 80 percent of the time.

This guide covers what you actually need to build a production AI agent: the four components every agent has, real TypeScript code for each, the build vs buy decision framework that most teams skip, honest cost math at production scale, and the five failure modes that will appear in every production deployment if you do not plan for them upfront.

Everything in this guide assumes you have already determined you need a real agent — one where the LLM must dynamically decide between multiple tools based on what it discovers at runtime. If you are not certain that is what you need, read the workflow vs agent section first. Most teams who think they need an agent actually need a well-structured workflow with one or two LLM calls, which is dramatically easier to build, maintain, and debug.

The guide uses a customer support triage agent as the running example throughout — it is concrete enough to illustrate real decisions but general enough that the patterns apply across most agent use cases. All code is TypeScript using the Anthropic SDK, with notes on where GPT-4o or other model APIs differ.

Before you read further: the brutal build-vs-buy decision is the next section. Most teams who think they need to build an agent should buy one instead. Search the Index first.

Build or buy: be honest with yourself

Engineering time is the most expensive resource in any organisation. A senior developer at $200K fully loaded costs about $4,000 per week. A simple agent will take 2-4 weeks to build to MVP and another 4-8 weeks to harden for production. That is $24K-48K of engineering cost before maintenance, monitoring, and iteration. An off-the-shelf agent that solves 80% of your use case at $200/month pays for itself for a decade against that math.

Don't build if

✗An existing tool covers 70% or more of your need.
✗Your use case is generic (cold email, scheduling, support).
✗You are doing it because "AI agents are hot" — not because of a real problem.
✗You don't have an engineer with production LLM experience on the team.
✗You can't articulate what success looks like in one sentence.

Build if

✓No existing agent solves your specific case after a real search.
✓The integrations you need don't exist as off-the-shelf tools.
✓The agent IS the product or a core competitive moat.
✓You need data flows that off-the-shelf tools can't support.
✓Your security requirements rule out third-party hosted agents.

Workflow or agent? The distinction that saves weeks

From Anthropic's research: workflows are systems where LLMs and tools are orchestrated through predefined code paths. Agents are systems where LLMs dynamically direct their own processes and tool usage. The implication is huge — if your problem can be solved with a fixed sequence of LLM calls and known tools, you do not need an agent. You need a workflow, which is dramatically simpler to build and maintain.

Workflow example

Process incoming support emails: extract sentiment, classify into one of 5 categories, look up the customer record, draft a templated reply.

Same 4 steps every time. Predictable. Easy to debug. Cheap to run. Don't build an agent for this.

Agent example

Triage support tickets where the LLM decides: do I have enough info? Should I check the order system, the billing system, the shipping API, or escalate?

Variable paths. The LLM picks tools based on what it discovers. This needs a real agent.

We'll use the second example — a customer support triage agent — as the running case throughout this guide.

The 4 components every agent has

Strip away the framework hype and every agent is just these four pieces. Get any one wrong and the whole thing fails.

Reasoning engine

The LLM that decides what to do next. Claude, GPT, or Gemini.

Tools

Functions the LLM can call to take action — query a DB, send an email, hit an API.

Memory

What persists between calls — conversation history, prior tool results, long-term facts.

Reasoning loop

The orchestration: receive input, plan, call tool, evaluate, repeat — with validation and limits.

Decision 1: Pick your model

Three serious contenders matter for production agents in 2026. Pricing below is per million tokens (input / output) as of mid-2026 — verify current rates before budgeting. The recommendation: start with the cheapest model that passes your evals. Most teams default to flagship models out of habit and burn 5x more than necessary.

Model	Pricing (per 1M tokens)	Best for
Claude Haiku 4.5	$1 / $5	Default starting point. Fast, cheap, strong tool use.
Claude Sonnet 4.5	$3 / $15	Complex reasoning, long-context tasks, careful judgment.
GPT-4o	$2.50 / $10	Broad ecosystem, vision, function calling maturity.
Gemini 2.5 Pro	$1.25 / $10	Massive context windows (1M+), multimodal.

Practical rule: build your eval set first, run it on the cheapest viable model, then upgrade only when you hit a wall. Most agent tasks do not need flagship reasoning.

Decision 2: Design your tools

Tools are what turn an LLM into an agent. The model needs three things to use a tool reliably: a clear name, a description that explains when to use it (not just what it does), and a strict input schema. Here's a real tool definition for our customer support agent — using Anthropic's tool use format, but the structure is similar across providers:

const tools = [
  {
    name: "lookup_order",
    description: "Look up a customer order by email or order ID. Use this when the customer asks about delivery status, items in their order, or refund eligibility.",
    input_schema: {
      type: "object",
      properties: {
        email: { type: "string", description: "Customer email" },
        order_id: { type: "string", description: "Order ID like ORD-12345" }
      },
      required: []
    }
  },
  {
    name: "check_shipping_status",
    description: "Get real-time shipping status from the carrier. Use AFTER lookup_order has confirmed the order exists and shipped.",
    input_schema: {
      type: "object",
      properties: {
        tracking_number: { type: "string" }
      },
      required: ["tracking_number"]
    }
  },
  {
    name: "escalate_to_human",
    description: "Hand off to a human agent. Use when the issue involves refunds over $500, account security, or the customer explicitly asks for a person.",
    input_schema: {
      type: "object",
      properties: {
        reason: { type: "string", description: "Why escalation is needed" },
        priority: { type: "string", enum: ["low", "medium", "high"] }
      },
      required: ["reason", "priority"]
    }
  }
]

Three tool design rules that matter:

Descriptions explain when to use, not just what. — The model picks tools based on the description. "Look up order by email" is weak. "Use this when the customer asks about delivery status" tells the LLM the trigger.
Fewer tools beats more tools. — Past 10-15 tools, models start choosing wrong. If you need more, group related ones into a single tool with an action parameter.
Schemas should be strict and minimal. — Loose schemas create parameter hallucinations. Mark fields required only when truly required. Use enums for fixed value sets.

Decision 3: Memory

Memory has three tiers and most teams overbuild this. Use the simplest tier that meets your needs — vector DBs are not free.

Tier 1: In-context

When: Single conversation, fits in the context window.

How: Just pass message history with each call. No infrastructure. Works for most agents under 10 turns.

Tier 2: Session-persistent

When: Multi-session conversations or workflows that span hours/days.

How: Store messages array in Postgres or Redis keyed by session ID. Retrieve on next call. Cheap, fast, deterministic.

Tier 3: Long-term semantic

When: Agent needs to recall facts across thousands of past interactions.

How: Vector DB (Supabase pgvector, Pinecone, Weaviate) with embedding search. Adds complexity and cost — only build this when Tier 2 demonstrably fails.

Decision 4: The reasoning loop

The loop is where most agents fail. Here's the pattern that actually works in production — note the iteration cap, the schema validation, and the explicit error handling. Skip any of these and your agent will eventually hit a state that costs you real money:

async function runAgent(userMessage: string) {
  const messages = [{ role: "user", content: userMessage }]
  const MAX_ITERATIONS = 10
  let iterations = 0

  while (iterations < MAX_ITERATIONS) {
    const response = await anthropic.messages.create({
      model: "claude-haiku-4-5",
      max_tokens: 1024,
      tools,
      messages
    })

    // Agent finished — return the final message
    if (response.stop_reason === "end_turn") {
      return response
    }

    // Agent wants to call a tool
    if (response.stop_reason === "tool_use") {
      const toolUse = response.content.find(c => c.type === "tool_use")
      if (!toolUse) throw new Error("Expected tool_use block")

      // Validate the tool exists and params match schema
      const result = await executeToolSafely(toolUse.name, toolUse.input)

      // Append assistant turn + tool result, then loop
      messages.push({ role: "assistant", content: response.content })
      messages.push({
        role: "user",
        content: [{
          type: "tool_result",
          tool_use_id: toolUse.id,
          content: result
        }]
      })
    }

    iterations++
  }

  throw new Error("Agent exceeded max iterations — possible loop")
}

The 5 production failure modes

These will all happen to you. Plan for them on day one — bolting them on after launch is a much harder retrofit.

1. Hallucinated tool calls

Symptom: Agent invents tool names or parameters that don't exist in your schema.

Fix: Validate every tool call against your schema before execution. On failure, return a structured error to the model so it can self-correct on the next iteration.

2. Infinite loops

Symptom: Agent calls the same tool over and over, or oscillates between two tools, never reaching end_turn.

Fix: Hard iteration cap (10-15 is reasonable for most agents). Track tool call patterns and break loops if the same call repeats more than 3 times.

3. Parameter type errors

Symptom: Agent passes a string where you expected a number, or omits a required field.

Fix: Strict JSON schema validation on every tool call. Coerce types where safe, error explicitly otherwise. Always include type constraints in your schema descriptions.

4. Context window blow-out

Symptom: After 20+ turns the agent slows to a crawl and costs spike. Worst case: it errors out from exceeding the context limit.

Fix: Summarise old messages once history exceeds a threshold (e.g. > 50K tokens). Replace the oldest 20 messages with a 200-token summary. Lose detail, keep coherence.

5. Cost runaway

Symptom: A single misbehaving conversation hits $50 in API costs before you notice.

Fix: Per-conversation cost cap that hard-stops the loop. Real-time cost tracking exposed to whoever's on call. Alerting at $X per session.

Cost math: what you'll actually pay

Here's a real calculation for our customer support triage agent at production scale. Assumptions: 1,000 tickets per day, average 3 tool calls per ticket, ~4,000 input tokens (system prompt + tool definitions + conversation context) and ~500 output tokens per LLM call.

Per ticket: 3 LLM calls × (4,000 input + 500 output tokens)
= 12,000 input + 1,500 output tokens per ticket

1,000 tickets/day = 12M input + 1.5M output tokens/day
Monthly (30 days) = 360M input + 45M output tokens

Claude Haiku 4.5: (360 × $1) + (45 × $5) = $585/mo
GPT-4o: (360 × $2.50) + (45 × $10) = $1,350/mo
Claude Sonnet 4.5: (360 × $3) + (45 × $15) = $1,755/mo

Two things this calculation hides: prompt caching can cut input costs 70-90% if you reuse a stable system prompt, and context growth on long conversations can double these numbers if you don't implement summarisation. Budget conservatively — your real cost will be 1.5-2x your napkin math the first time.

Pre-deploy checklist

Before you put this in front of users, every box below should be ticked. Skipping any of these is how teams end up with $5K surprise bills or angry customers.

☐Hard iteration cap is set and tested.
☐Every tool input is validated against its schema before execution.
☐Per-session cost cap with hard-stop in place.
☐Cost and latency dashboards live before launch, not after.
☐Eval set of 50+ representative inputs runs on every code change.
☐Human-in-the-loop checkpoints for any irreversible action (refunds, sends, deletions).
☐Errors and tool failures are logged with full context for debugging.
☐Context summarisation kicks in before hitting model limits.
☐You've manually traced 10 representative runs end-to-end.
☐Rollback plan exists if production behaviour diverges from eval.

When you're done iterating

Agents are never "finished." But there's a point where further tuning has diminishing returns and you should stop touching it. You're done iterating when: your eval set passes consistently above your target threshold, the failure modes you see in production are the same as the failure modes you see in eval (no surprises), per-conversation cost is stable and predictable, and the agent can run unattended for a week without you needing to check on it.

Past that point, your time is better spent building the next agent than tuning this one.

Frequently Asked Questions

What is the difference between an AI workflow and an AI agent?

A workflow is a predefined sequence where an LLM is called at fixed steps with predictable inputs and outputs. An agent is a system where an LLM dynamically decides which tools to call and in what order until it accomplishes a goal. Anthropic published this distinction in their December 2024 research note "Building Effective Agents." Most teams should build workflows first and only graduate to true agents when the workflow cannot handle the variability they actually encounter.

When should I build an AI agent vs use an existing one?

Build only when no existing agent solves your specific use case, the integrations you need do not exist as off-the-shelf tools, or you are building proprietary functionality that becomes a competitive moat. Buy when an existing agent covers 70% or more of your needs, your use case is generic (cold email, customer support, scheduling), or the engineering cost would exceed two years of subscriptions to an existing tool.

How much does it cost to run an AI agent in production?

Cost depends on three variables: model pricing, average tokens per call, and call volume. A customer support triage agent processing 1,000 tickets per day with 4,000 input tokens and 500 output tokens per call would cost approximately $585 per month on Claude Haiku 4.5, around $1,350 per month on GPT-4o, and roughly $1,755 per month on Claude Sonnet 4.5. Most teams underestimate context window growth on multi-turn agents — costs can double or triple in long conversations if you do not actively manage context.

What is the most common reason AI agents fail in production?

The most common failure mode is hallucinated tool calls — the agent invents tool names or parameters that do not exist in your schema. The fix is strict schema validation before execution, retry with corrective feedback, and a hard iteration cap to prevent infinite loops. Other common failures include context window blow-out on long conversations, parameter type errors, and unbounded cost from runaway loops.

When should I use multiple agents instead of one?

Use multiple agents when a single agent cannot handle all the tasks reliably within one context window, when tasks are genuinely parallelisable and can run simultaneously, or when different tasks require meaningfully different system prompts and tool sets that would conflict if combined. Most teams reach for multi-agent architecture too early. A well-built single agent with a clear scope and good tool definitions handles the majority of production use cases. Only add a second agent when you have hit a concrete limitation of the first, not in anticipation of limitations you have not encountered yet.

AI Coding Agents

Tools to help you build →

Multi-Agent Orchestration

When one agent isn't enough →

How to Evaluate an Agent

If you decide to buy →

No-Code Builders

Skip the engineering →

All agents listed are editorially reviewed by The AI Agent Index. See our editorial methodology. Pricing data verified mid-2026 — confirm current rates before budgeting.

Sources & References

1.
Stack Overflow Developer Survey 2024 — Stack Overflow