Blueprint/Chapter 16
Chapter 16

Error Handling & Reliability

By Zavier SandersSeptember 21, 2025

Production reliability patterns: retries, fallbacks, graceful degradation, and monitoring.

Prefer something you can ship today? Start with theQuickstart: Ship One Agent with Mastra— then come back here to deepen the concepts.

The Reliability Challenge

Agentic systems face unique reliability challenges beyond traditional applications.

The Core Problems

1. LLM Non-Determinism

// Same prompt, different outputs
const result1 = await agent.generate("Summarize this article");
const result2 = await agent.generate("Summarize this article");
// result1 !== result2 (even with temperature=0)

2. External API Failures

// Your agent depends on multiple services
Google Calendar APIDown (503)
HubSpot CRM APIRate limited (429)
OpenAI APITimeout
Slack APISuccess
// Result: Partial failure - what do you do?

3. Rate Limits and Quotas

// OpenAI: 10,000 TPM (tokens per minute)
// HubSpot: 100 requests per 10 seconds
// Google Calendar: 1,000,000 queries per day

// Your agent makes 50 API calls in a workflow
// One rate limit breaks the entire pipeline

Why This Matters

Without proper error handling:

  • 🔴 Agent crashes leave users confused
  • 🔴 Partial data corruption (some tools succeed, others fail)
  • 🔴 Cascading failures across systems
  • 🔴 No visibility into what went wrong
  • 🔴 Manual intervention required for recovery

With proper error handling:

  • ✅ Graceful degradation
  • ✅ Automatic retries
  • ✅ Clear error messages
  • ✅ Partial success handling
  • ✅ Self-healing systems

Retry Strategies

Exponential Backoff

The gold standard for retrying failed requests.

// lib/retry.ts
export class RetryStrategy {
  async withExponentialBackoff<T>(
    fn: () => Promise<T>,
    options: RetryOptions = {}
  ): Promise<T> {
    const {
      maxRetries = 3,
      baseDelay = 1000,
      maxDelay = 30000,
      backoffMultiplier = 2,
      retryableErrors = [429, 500, 502, 503, 504],
    } = options;

    let lastError: Error;

    for (let attempt = 0; attempt <= maxRetries; attempt++) {
      try {
        return await fn();
      } catch (error) {
        lastError = error as Error;

        // Don't retry on non-retryable errors
        if (!this.isRetryable(error, retryableErrors)) {
          throw error;
        }

        // Last attempt - throw
        if (attempt === maxRetries) {
          throw new Error(
            `Max retries (${maxRetries}) exceeded. Last error: ${lastError.message}`
          );
        }

        // Calculate delay with exponential backoff
        const delay = Math.min(
          baseDelay * Math.pow(backoffMultiplier, attempt),
          maxDelay
        );

        // Add jitter to prevent thundering herd
        const jitter = Math.random() * 0.3 * delay;
        const totalDelay = delay + jitter;

        console.log(
          `Retry attempt ${attempt + 1}/${maxRetries} after ${Math.round(totalDelay)}ms`
        );

        await this.sleep(totalDelay);
      }
    }

    throw lastError!;
  }

  private isRetryable(error: any, retryableCodes: number[]): boolean {
    // HTTP errors
    if (error.status && retryableCodes.includes(error.status)) {
      return true;
    }

    // Network errors
    if (error.code === 'ECONNRESET' || error.code === 'ETIMEDOUT') {
      return true;
    }

    // OpenAI specific errors
    if (error.type === 'insufficient_quota' || error.type === 'server_error') {
      return true;
    }

    return false;
  }

  private sleep(ms: number): Promise<void> {
    return new Promise((resolve) => setTimeout(resolve, ms));
  }
}

// Usage
const retry = new RetryStrategy();

const result = await retry.withExponentialBackoff(
  () => openai.chat.completions.create({ /* ... */ }),
  { maxRetries: 5 }
);

Circuit Breaker

Prevent cascading failures by "breaking" the circuit after repeated failures.

// lib/circuit-breaker.ts
export class CircuitBreaker {
  private state: 'CLOSED' | 'OPEN' | 'HALF_OPEN' = 'CLOSED';
  private failureCount = 0;
  private lastFailureTime?: Date;
  private successCount = 0;

  constructor(
    private options: {
      failureThreshold: number; // Open after N failures
      resetTimeout: number; // Try again after N ms
      successThreshold: number; // Close after N successes in HALF_OPEN
    } = {
      failureThreshold: 5,
      resetTimeout: 60000, // 1 minute
      successThreshold: 2,
    }
  ) {}

  async execute<T>(fn: () => Promise<T>): Promise<T> {
    if (this.state === 'OPEN') {
      // Check if we should try again
      if (this.shouldAttemptReset()) {
        this.state = 'HALF_OPEN';
        console.log('Circuit breaker: HALF_OPEN - attempting reset');
      } else {
        throw new Error(
          `Circuit breaker is OPEN. Too many failures. Retry after ${this.getRetryAfter()}ms`
        );
      }
    }

    try {
      const result = await fn();

      // Success handling
      if (this.state === 'HALF_OPEN') {
        this.successCount++;
        
        if (this.successCount >= this.options.successThreshold) {
          this.reset();
          console.log('Circuit breaker: CLOSED - service recovered');
        }
      } else {
        this.reset();
      }

      return result;
    } catch (error) {
      this.recordFailure();
      throw error;
    }
  }

  private recordFailure(): void {
    this.failureCount++;
    this.lastFailureTime = new Date();
    this.successCount = 0;

    if (this.failureCount >= this.options.failureThreshold) {
      this.state = 'OPEN';
      console.error(
        `Circuit breaker: OPEN - ${this.failureCount} consecutive failures`
      );
    }
  }

  private shouldAttemptReset(): boolean {
    if (!this.lastFailureTime) return false;
    
    const timeSinceFailure = Date.now() - this.lastFailureTime.getTime();
    return timeSinceFailure >= this.options.resetTimeout;
  }

  private getRetryAfter(): number {
    if (!this.lastFailureTime) return 0;
    
    const elapsed = Date.now() - this.lastFailureTime.getTime();
    return Math.max(0, this.options.resetTimeout - elapsed);
  }

  private reset(): void {
    this.state = 'CLOSED';
    this.failureCount = 0;
    this.successCount = 0;
    this.lastFailureTime = undefined;
  }

  getState() {
    return {
      state: this.state,
      failureCount: this.failureCount,
      successCount: this.successCount,
    };
  }
}

// Usage
const openAICircuit = new CircuitBreaker();

const result = await openAICircuit.execute(async () => {
  return await openai.chat.completions.create({ /* ... */ });
});

Combined: Retry + Circuit Breaker

// lib/resilient-call.ts
export class ResilientCall {
  private circuitBreaker: CircuitBreaker;
  private retryStrategy: RetryStrategy;

  constructor() {
    this.circuitBreaker = new CircuitBreaker();
    this.retryStrategy = new RetryStrategy();
  }

  async execute<T>(fn: () => Promise<T>): Promise<T> {
    return this.circuitBreaker.execute(async () => {
      return this.retryStrategy.withExponentialBackoff(fn);
    });
  }
}

// Usage in agents
const resilient = new ResilientCall();

const result = await resilient.execute(() =>
  openai.chat.completions.create({ /* ... */ })
);

Fallback Patterns

Model Fallbacks

When primary model fails, fall back to alternatives.

// lib/model-fallback.ts
export class ModelFallback {
  private models = [
    { provider: 'openai', model: 'gpt-4o', priority: 1 },
    { provider: 'openai', model: 'gpt-4o-mini', priority: 2 },
    { provider: 'anthropic', model: 'claude-3-5-sonnet', priority: 3 },
  ];

  async generate(prompt: string): Promise<string> {
    let lastError: Error;

    for (const config of this.models) {
      try {
        console.log(`Trying ${config.provider}/${config.model}...`);
        
        const result = await this.callModel(config, prompt);
        
        console.log(`✓ Success with ${config.provider}/${config.model}`);
        return result;
      } catch (error) {
        lastError = error as Error;
        console.warn(
          `✗ Failed with ${config.provider}/${config.model}: ${lastError.message}`
        );
        
        // Continue to next model
      }
    }

    throw new Error(
      `All models failed. Last error: ${lastError!.message}`
    );
  }

  private async callModel(
    config: { provider: string; model: string },
    prompt: string
  ): Promise<string> {
    switch (config.provider) {
      case 'openai':
        return this.callOpenAI(config.model, prompt);
      case 'anthropic':
        return this.callAnthropic(config.model, prompt);
      default:
        throw new Error(`Unknown provider: ${config.provider}`);
    }
  }

  private async callOpenAI(model: string, prompt: string): Promise<string> {
    const response = await openai.chat.completions.create({
      model,
      messages: [{ role: 'user', content: prompt }],
    });
    
    return response.choices[0].message.content || '';
  }

  private async callAnthropic(model: string, prompt: string): Promise<string> {
    // Anthropic implementation
    return '';
  }
}

Tool Fallbacks

When tool execution fails, try alternatives.

// lib/tool-fallback.ts
export class ToolFallback {
  async searchWeb(query: string): Promise<SearchResult[]> {
    const strategies = [
      { name: 'Tavily', fn: () => this.searchWithTavily(query) },
      { name: 'Brave', fn: () => this.searchWithBrave(query) },
      { name: 'DuckDuckGo', fn: () => this.searchWithDDG(query) },
    ];

    for (const strategy of strategies) {
      try {
        console.log(`Searching with ${strategy.name}...`);
        const results = await strategy.fn();
        
        if (results.length > 0) {
          return results;
        }
      } catch (error) {
        console.warn(`${strategy.name} search failed:`, error);
      }
    }

    // All failed - return empty results with warning
    console.error('All search providers failed');
    return [];
  }

  async sendNotification(message: string): Promise<void> {
    const channels = [
      { name: 'Slack', fn: () => this.sendSlack(message) },
      { name: 'Email', fn: () => this.sendEmail(message) },
      { name: 'Webhook', fn: () => this.sendWebhook(message) },
    ];

    let succeeded = false;

    for (const channel of channels) {
      try {
        await channel.fn();
        console.log(`✓ Sent via ${channel.name}`);
        succeeded = true;
        break; // Success - stop trying
      } catch (error) {
        console.warn(`${channel.name} failed:`, error);
      }
    }

    if (!succeeded) {
      throw new Error('Failed to send notification via any channel');
    }
  }
}

Default Responses

Provide safe defaults when all else fails.

// lib/default-responses.ts
export class DefaultResponseHandler {
  async generateSafely(
    agent: Agent,
    prompt: string,
    options?: {
      defaultResponse?: string;
      timeout?: number;
    }
  ): Promise<string> {
    const timeout = options?.timeout || 30000;
    const defaultResponse =
      options?.defaultResponse ||
      "I'm having trouble processing your request right now. Please try again later.";

    try {
      // Race between generation and timeout
      const result = await Promise.race([
        agent.generate(prompt),
        this.timeoutPromise(timeout),
      ]);

      if (typeof result === 'string' && result.length > 0) {
        return result;
      }

      // Empty response - use default
      return defaultResponse;
    } catch (error) {
      console.error('Agent generation failed:', error);
      
      // Log error for monitoring
      this.logError(error, { prompt, agent: agent.name });
      
      return defaultResponse;
    }
  }

  private timeoutPromise(ms: number): Promise<never> {
    return new Promise((_, reject) =>
      setTimeout(() => reject(new Error('Timeout')), ms)
    );
  }

  private logError(error: any, context: any): void {
    // Send to error tracking service
    console.error('Error context:', context);
  }
}

Error Recovery

Graceful Degradation

Handle partial failures without breaking the entire workflow.

// lib/graceful-degradation.ts
export class MeetingBriefingWithFallback {
  async generateBriefing(meetingId: string): Promise<Briefing> {
    const results = await this.gatherContextWithFallbacks(meetingId);

    // Generate briefing with whatever data we got
    const briefing = await this.generateFromPartialData(results);

    // Add warnings about missing data
    briefing.warnings = this.generateWarnings(results);

    return briefing;
  }

  private async gatherContextWithFallbacks(
    meetingId: string
  ): Promise<PartialContext> {
    const results: PartialContext = {
      meeting: null,
      crm: null,
      support: null,
      slack: null,
      errors: [],
    };

    // Try to get meeting details
    try {
      results.meeting = await this.getMeetingDetails(meetingId);
    } catch (error) {
      results.errors.push({
        source: 'calendar',
        error: (error as Error).message,
      });
    }

    // Try to get CRM data
    try {
      results.crm = await this.getCRMData(meetingId);
    } catch (error) {
      results.errors.push({
        source: 'crm',
        error: (error as Error).message,
      });
    }

    // Try to get support tickets
    try {
      results.support = await this.getSupportTickets(meetingId);
    } catch (error) {
      results.errors.push({
        source: 'support',
        error: (error as Error).message,
      });
    }

    // Try to get Slack history
    try {
      results.slack = await this.getSlackHistory(meetingId);
    } catch (error) {
      results.errors.push({
        source: 'slack',
        error: (error as Error).message,
      });
    }

    return results;
  }

  private async generateFromPartialData(
    context: PartialContext
  ): Promise<Briefing> {
    const prompt = this.buildPromptFromPartial(context);
    
    const result = await agent.generate(prompt);

    return {
      summary: result.summary,
      keyPoints: result.keyPoints,
      dataCompleteness: this.calculateCompleteness(context),
    };
  }

  private generateWarnings(context: PartialContext): string[] {
    const warnings: string[] = [];

    if (!context.crm) {
      warnings.push('⚠️ CRM data unavailable - missing account context');
    }

    if (!context.support) {
      warnings.push('⚠️ Support tickets unavailable - may miss recent issues');
    }

    if (!context.slack) {
      warnings.push('⚠️ Slack history unavailable - missing recent conversations');
    }

    if (context.errors.length > 0) {
      warnings.push(
        `⚠️ ${context.errors.length} data sources failed to load`
      );
    }

    return warnings;
  }

  private calculateCompleteness(context: PartialContext): number {
    const sources = [
      context.meeting,
      context.crm,
      context.support,
      context.slack,
    ];
    
    const available = sources.filter(Boolean).length;
    return (available / sources.length) * 100;
  }
}

Partial Success Handling

// lib/partial-success.ts
export class BatchProcessor {
  async processBatch<T, R>(
    items: T[],
    processor: (item: T) => Promise<R>
  ): Promise<BatchResult<R>> {
    const results: R[] = [];
    const errors: BatchError[] = [];

    for (let i = 0; i < items.length; i++) {
      const item = items[i];
      
      try {
        const result = await processor(item);
        results.push(result);
      } catch (error) {
        errors.push({
          index: i,
          item,
          error: (error as Error).message,
        });
        
        // Continue processing despite error
        console.warn(`Failed to process item ${i}:`, error);
      }
    }

    return {
      results,
      errors,
      successCount: results.length,
      failureCount: errors.length,
      totalCount: items.length,
      successRate: (results.length / items.length) * 100,
    };
  }
}

// Usage
const processor = new BatchProcessor();

const result = await processor.processBatch(
  articles,
  async (article) => await summarizeArticle(article)
);

console.log(`Processed ${result.successCount}/${result.totalCount} articles`);

if (result.errors.length > 0) {
  console.warn(`${result.errors.length} failures:`, result.errors);
  
  // Retry failed items
  const retryResult = await processor.processBatch(
    result.errors.map(e => e.item),
    async (article) => await summarizeArticle(article)
  );
}

Monitoring & Alerting

Health Checks

// app/api/health/route.ts
import { NextResponse } from 'next/server';

export async function GET() {
  const checks = {
    openai: await checkOpenAI(),
    database: await checkDatabase(),
    slack: await checkSlack(),
    hubspot: await checkHubSpot(),
  };

  const allHealthy = Object.values(checks).every((c) => c.healthy);

  return NextResponse.json(
    {
      status: allHealthy ? 'healthy' : 'degraded',
      timestamp: new Date().toISOString(),
      checks,
    },
    { status: allHealthy ? 200 : 503 }
  );
}

async function checkOpenAI(): Promise<HealthCheck> {
  const start = Date.now();
  
  try {
    await openai.models.list();
    
    return {
      healthy: true,
      latency: Date.now() - start,
      message: 'OK',
    };
  } catch (error) {
    return {
      healthy: false,
      latency: Date.now() - start,
      message: (error as Error).message,
    };
  }
}

async function checkDatabase(): Promise<HealthCheck> {
  const start = Date.now();
  
  try {
    await db.$queryRaw`SELECT 1`;
    
    return {
      healthy: true,
      latency: Date.now() - start,
      message: 'OK',
    };
  } catch (error) {
    return {
      healthy: false,
      latency: Date.now() - start,
      message: (error as Error).message,
    };
  }
}

Error Tracking

// lib/error-tracker.ts
import * as Sentry from '@sentry/node';

export class ErrorTracker {
  static init() {
    Sentry.init({
      dsn: process.env.SENTRY_DSN,
      environment: process.env.NODE_ENV,
      tracesSampleRate: 0.1,
    });
  }

  static captureAgentError(
    error: Error,
    context: {
      agent: string;
      prompt?: string;
      tools?: string[];
      duration?: number;
    }
  ) {
    Sentry.captureException(error, {
      tags: {
        type: 'agent_error',
        agent: context.agent,
      },
      contexts: {
        agent: {
          name: context.agent,
          tools: context.tools,
          duration: context.duration,
        },
      },
      extra: {
        prompt: context.prompt?.substring(0, 500), // First 500 chars
      },
    });
  }

  static captureToolError(
    error: Error,
    context: {
      tool: string;
      args: any;
      duration?: number;
    }
  ) {
    Sentry.captureException(error, {
      tags: {
        type: 'tool_error',
        tool: context.tool,
      },
      extra: {
        args: context.args,
        duration: context.duration,
      },
    });
  }

  static trackPerformance(
    operation: string,
    duration: number,
    metadata?: Record<string, any>
  ) {
    Sentry.metrics.distribution(
      `agent.${operation}.duration`,
      duration,
      {
        tags: metadata,
      }
    );
  }
}

// Usage in agents
try {
  const start = Date.now();
  const result = await agent.generate(prompt);
  
  ErrorTracker.trackPerformance(
    'generation',
    Date.now() - start,
    { agent: agent.name }
  );
  
  return result;
} catch (error) {
  ErrorTracker.captureAgentError(error as Error, {
    agent: agent.name,
    prompt,
  });
  
  throw error;
}

SLA Monitoring

// lib/sla-monitor.ts
export class SLAMonitor {
  private metrics: Map<string, Metric[]> = new Map();

  recordOperation(
    operation: string,
    duration: number,
    success: boolean
  ): void {
    if (!this.metrics.has(operation)) {
      this.metrics.set(operation, []);
    }

    this.metrics.get(operation)!.push({
      duration,
      success,
      timestamp: new Date(),
    });

    // Keep only last 1000 entries
    const metrics = this.metrics.get(operation)!;
    if (metrics.length > 1000) {
      metrics.shift();
    }
  }

  getStats(operation: string): SLAStats {
    const metrics = this.metrics.get(operation) || [];

    if (metrics.length === 0) {
      return {
        availability: 0,
        p50: 0,
        p95: 0,
        p99: 0,
        errorRate: 0,
        totalRequests: 0,
      };
    }

    const successes = metrics.filter((m) => m.success);
    const durations = metrics.map((m) => m.duration).sort((a, b) => a - b);

    return {
      availability: (successes.length / metrics.length) * 100,
      p50: this.percentile(durations, 50),
      p95: this.percentile(durations, 95),
      p99: this.percentile(durations, 99),
      errorRate: ((metrics.length - successes.length) / metrics.length) * 100,
      totalRequests: metrics.length,
    };
  }

  checkSLA(operation: string, sla: SLA): SLAStatus {
    const stats = this.getStats(operation);

    const violations: string[] = [];

    if (stats.availability < sla.minAvailability) {
      violations.push(
        `Availability: ${stats.availability.toFixed(2)}% < ${sla.minAvailability}%`
      );
    }

    if (stats.p95 > sla.maxP95Latency) {
      violations.push(
        `P95 latency: ${stats.p95}ms > ${sla.maxP95Latency}ms`
      );
    }

    if (stats.errorRate > sla.maxErrorRate) {
      violations.push(
        `Error rate: ${stats.errorRate.toFixed(2)}% > ${sla.maxErrorRate}%`
      );
    }

    return {
      healthy: violations.length === 0,
      violations,
      stats,
    };
  }

  private percentile(sorted: number[], p: number): number {
    const index = Math.ceil((sorted.length * p) / 100) - 1;
    return sorted[index] || 0;
  }
}

// Usage
const monitor = new SLAMonitor();

// Record operations
const start = Date.now();
try {
  await agent.generate(prompt);
  monitor.recordOperation('agent.generate', Date.now() - start, true);
} catch (error) {
  monitor.recordOperation('agent.generate', Date.now() - start, false);
}

// Check SLA
const status = monitor.checkSLA('agent.generate', {
  minAvailability: 99.9, // 99.9% uptime
  maxP95Latency: 5000, // 5s P95
  maxErrorRate: 1, // 1% error rate
});

if (!status.healthy) {
  // Alert team
  await slack.postMessage({
    channel: '#alerts',
    text: `SLA violation: ${status.violations.join(', ')}`,
  });
}

Production Checklist

Must-Haves

// Complete production-ready agent wrapper
export class ProductionAgent {
  private circuit: CircuitBreaker;
  private retry: RetryStrategy;
  private fallback: ModelFallback;
  private monitor: SLAMonitor;

  async generate(prompt: string): Promise<string> {
    const start = Date.now();
    let success = false;

    try {
      // Circuit breaker + retry
      const result = await this.circuit.execute(async () => {
        return this.retry.withExponentialBackoff(async () => {
          // Try primary model with fallbacks
          return await this.fallback.generate(prompt);
        });
      });

      success = true;
      return result;
    } catch (error) {
      // Track error
      ErrorTracker.captureAgentError(error as Error, {
        agent: 'production-agent',
        prompt,
        duration: Date.now() - start,
      });

      // Return safe default
      return "I'm experiencing technical difficulties. Please try again.";
    } finally {
      // Record metrics
      this.monitor.recordOperation(
        'agent.generate',
        Date.now() - start,
        success
      );
    }
  }
}

Key Takeaways

  1. Retry with exponential backoff - Handle transient failures
  2. Circuit breakers - Prevent cascading failures
  3. Fallback chains - Multiple models, tools, channels
  4. Graceful degradation - Partial success is better than total failure
  5. Comprehensive monitoring - Health checks, error tracking, SLA monitoring

Production reliability is not optional for agentic systems!

Get chapter updates & code samples

We’ll email diagrams, code snippets, and additions.