Pattern Overview

💡 Prerequisites: This chapter builds on Chapter 7: Triggers where vision triggers were introduced. You should understand the five trigger types before diving into this complete pattern implementation.

Vision-based agents extend traditional text-based AI systems with visual understanding capabilities. By integrating computer vision models like Moondream, agents can process images, detect objects, extract information from documents, and respond to visual events in real-time.

This pattern is particularly powerful because it unlocks entirely new categories of automation that were previously impossible with text-only agents.

Why This Pattern Matters

Most agentic systems today are blind. They can read, write, and reason—but they can't see. This creates massive blind spots:

E-commerce platforms can't automatically verify product photos match descriptions
Security systems can't intelligently respond to visual threats
Healthcare apps can't monitor patients visually
Inventory systems can't track stock from camera feeds
Document processing requires manual data entry

Vision agents solve this. They give your agents eyes.

Architecture Pattern

┌─────────────────────────────────────────────────────────────┐
│                   VISION AGENT ARCHITECTURE                  │
│                                                              │
│  INPUT LAYER                                                 │
│  ┌────────────────────────────────────────────────────────┐ │
│  │  • Image Upload (User)                                 │ │
│  │  • Camera Feed (IoT/Security)                          │ │
│  │  • Screenshot (Monitoring)                             │ │
│  │  • Document Scan (Mobile)                              │ │
│  └────────────────────────────────────────────────────────┘ │
│                            ↓                                 │
│  VISION LAYER (Moondream)                                    │
│  ┌────────────────────────────────────────────────────────┐ │
│  │  SKILLS:                                               │ │
│  │  • Query: "What's in this image?"                      │ │
│  │  • Detect: Find objects + bounding boxes               │ │
│  │  • Point: Locate coordinates of elements               │ │
│  │  • Caption: Natural language descriptions              │ │
│  └────────────────────────────────────────────────────────┘ │
│                            ↓                                 │
│  AGENT LAYER (Mastra)                                        │
│  ┌────────────────────────────────────────────────────────┐ │
│  │  • Analyze vision results                              │ │
│  │  • Make decisions based on visual data                 │ │
│  │  • Trigger workflows                                   │ │
│  │  • Call additional tools                               │ │
│  └────────────────────────────────────────────────────────┘ │
│                            ↓                                 │
│  ACTION LAYER                                                │
│  ┌────────────────────────────────────────────────────────┐ │
│  │  • Send notifications                                  │ │
│  │  • Update databases                                    │ │
│  │  • Trigger other agents                                │ │
│  │  • Generate reports                                    │ │
│  └────────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────┘

When to Use This Pattern

✅ Perfect for:

Real-time monitoring (security, quality control)
Document processing (receipts, IDs, forms)
Content moderation
Inventory management
Accessibility features
Visual search and discovery

❌ Not ideal for:

Pure text-based workflows
High-frequency video processing (cost)
Real-time video streaming (latency)
Tasks requiring human-level visual judgment

Complete Example: Smart Inventory Agent

Let's build a production-ready vision agent that monitors warehouse shelves, detects low stock, and automatically triggers restock workflows.

Step 1: Setup


# Install dependencies
npm install moondream sharp mastra


// lib/vision.ts
import moondream from 'moondream';

export const vision = moondream.vl({
  apiKey: process.env.MOONDREAM_API_KEY!,
});

export type VisionQuery = {
  image: Buffer;
  question: string;
};

export type VisionDetection = {
  image: Buffer;
  object: string;
};

Step 2: Create Vision Tool for Mastra


// mastra/tools/vision-tools.ts
import { createTool } from '@mastra/core';
import { z } from 'zod';
import { vision } from '@/lib/vision';

export const analyzeImageTool = createTool({
  id: 'analyze-image',
  description: 'Analyze an image and answer questions about its contents',
  inputSchema: z.object({
    imageUrl: z.string().describe('URL or base64 encoded image'),
    question: z.string().describe('Question to ask about the image'),
  }),
  outputSchema: z.object({
    answer: z.string(),
    requestId: z.string(),
  }),
  execute: async ({ context, input }) => {
    // Fetch image if URL, or decode if base64
    const imageBuffer = await fetchImageBuffer(input.imageUrl);
    
    const result = await vision.query(imageBuffer, input.question);
    
    return {
      answer: result.answer,
      requestId: result.request_id,
    };
  },
});

export const detectObjectsTool = createTool({
  id: 'detect-objects',
  description: 'Detect specific objects in an image and get their bounding boxes',
  inputSchema: z.object({
    imageUrl: z.string(),
    objectType: z.string().describe('Type of object to detect (e.g., "person", "box", "pallet")'),
  }),
  outputSchema: z.object({
    objects: z.array(z.object({
      x_min: z.number(),
      y_min: z.number(),
      x_max: z.number(),
      y_max: z.number(),
    })),
    count: z.number(),
  }),
  execute: async ({ input }) => {
    const imageBuffer = await fetchImageBuffer(input.imageUrl);
    
    const result = await vision.detect(imageBuffer, input.objectType);
    
    return {
      objects: result.objects || [],
      count: result.objects?.length || 0,
    };
  },
});

async function fetchImageBuffer(urlOrBase64: string): Promise<Buffer> {
  if (urlOrBase64.startsWith('data:')) {
    const base64Data = urlOrBase64.split(',')[1];
    return Buffer.from(base64Data, 'base64');
  }
  
  const response = await fetch(urlOrBase64);
  return Buffer.from(await response.arrayBuffer());
}

Step 3: Create Inventory Vision Agent


// mastra/agents/inventory-vision-agent.ts
import { Agent } from '@mastra/core';
import { openai } from '@mastra/openai';
import { analyzeImageTool, detectObjectsTool } from '../tools/vision-tools';

export const inventoryVisionAgent = new Agent({
  name: 'Inventory Vision Agent',
  instructions: `You are a warehouse inventory monitoring agent with computer vision capabilities.

Your responsibilities:
1. Analyze shelf images to identify products
2. Count items using object detection
3. Detect low stock situations (< 3 items)
4. Identify misplaced or damaged products
5. Trigger restock workflows when needed
6. Generate inventory reports

When analyzing images:
- Be thorough and accurate
- Use the detect-objects tool for precise counting
- Use the analyze-image tool for qualitative assessment
- Report exact counts and locations
- Flag any anomalies (damaged boxes, wrong placement, etc.)

Always provide actionable recommendations.`,
  
  model: {
    provider: openai,
    name: 'gpt-4o',
    toolChoice: 'auto',
  },
  
  tools: {
    analyzeImage: analyzeImageTool,
    detectObjects: detectObjectsTool,
  },
});

Step 4: Vision Processing Service


// lib/services/inventory-vision-service.ts
import { vision } from '@/lib/vision';
import { inventoryVisionAgent } from '@/mastra/agents/inventory-vision-agent';
import { db } from '@/lib/db';
import sharp from 'sharp';

export type ShelfAnalysis = {
  shelfId: string;
  products: ProductDetection[];
  lowStock: string[];
  needsAttention: boolean;
  timestamp: Date;
};

export type ProductDetection = {
  name: string;
  count: number;
  confidence: number;
  locations: Array<{ x_min: number; y_min: number; x_max: number; y_max: number }>;
};

export class InventoryVisionService {
  /**
   * Process a single shelf image
   */
  async analyzeShelf(imageBuffer: Buffer, shelfId: string): Promise<ShelfAnalysis> {
    // Optimize image for vision processing
    const optimizedImage = await this.optimizeImage(imageBuffer);

    // Step 1: Ask the agent to identify all products
    const identificationResult = await inventoryVisionAgent.generate(
      `Analyze this warehouse shelf image and identify all product types visible. 
       Provide a comma-separated list of product names.`,
      {
        stream: false,
      }
    );

    const productNames = this.parseProductList(identificationResult.text);

    // Step 2: Detect and count each product
    const productDetections: ProductDetection[] = [];
    
    for (const productName of productNames) {
      const detectResult = await vision.detect(optimizedImage, productName);
      
      productDetections.push({
        name: productName,
        count: detectResult.objects?.length || 0,
        confidence: 0.9, // Moondream doesn't return confidence, using default
        locations: detectResult.objects || [],
      });
    }

    // Step 3: Identify low stock items
    const lowStock = productDetections
      .filter(p => p.count < 3)
      .map(p => p.name);

    // Step 4: Check for quality issues
    const qualityCheck = await vision.query(
      optimizedImage,
      'Are there any damaged boxes, items on the floor, or misplaced products in this image? Answer yes or no and explain.'
    );

    const needsAttention = 
      lowStock.length > 0 || 
      qualityCheck.answer.toLowerCase().includes('yes');

    // Step 5: If attention needed, get agent recommendations
    if (needsAttention) {
      await inventoryVisionAgent.generate(`
        Shelf ${shelfId} analysis results:
        
        Products detected: ${productDetections.map(p => `${p.name} (${p.count})`).join(', ')}
        Low stock items: ${lowStock.join(', ') || 'None'}
        Quality issues: ${qualityCheck.answer}
        
        Provide actionable recommendations and determine if immediate action is needed.
      `);
    }

    // Step 6: Store in database
    await this.saveAnalysis({
      shelfId,
      products: productDetections,
      lowStock,
      needsAttention,
      timestamp: new Date(),
    });

    return {
      shelfId,
      products: productDetections,
      lowStock,
      needsAttention,
      timestamp: new Date(),
    };
  }

  /**
   * Optimize image for vision processing
   */
  private async optimizeImage(buffer: Buffer): Promise<Buffer> {
    return sharp(buffer)
      .resize(1024, 1024, { fit: 'inside', withoutEnlargement: true })
      .jpeg({ quality: 85 })
      .toBuffer();
  }

  /**
   * Parse comma-separated product list from agent response
   */
  private parseProductList(text: string): string[] {
    // Extract product names from agent response
    const match = text.match(/\b([a-zA-Z0-9\s,]+)\b/);
    if (!match) return [];

    return match[1]
      .split(',')
      .map(p => p.trim())
      .filter(p => p.length > 0);
  }

  /**
   * Save analysis to database
   */
  private async saveAnalysis(analysis: ShelfAnalysis) {
    await db.shelfAnalysis.create({
      data: {
        shelfId: analysis.shelfId,
        products: JSON.stringify(analysis.products),
        lowStock: analysis.lowStock,
        needsAttention: analysis.needsAttention,
        analyzedAt: analysis.timestamp,
      },
    });

    // If low stock, create restock task
    if (analysis.lowStock.length > 0) {
      await db.restockTask.createMany({
        data: analysis.lowStock.map(product => ({
          shelfId: analysis.shelfId,
          productName: product,
          priority: 'high',
          status: 'pending',
          createdAt: new Date(),
        })),
      });
    }
  }

  /**
   * Process multiple shelves in parallel
   */
  async analyzeWarehouse(shelfImages: Array<{ shelfId: string; image: Buffer }>) {
    const results = await Promise.all(
      shelfImages.map(({ shelfId, image }) => 
        this.analyzeShelf(image, shelfId)
      )
    );

    const summary = {
      totalShelves: results.length,
      shelvesNeedingAttention: results.filter(r => r.needsAttention).length,
      totalLowStockItems: results.reduce((sum, r) => sum + r.lowStock.length, 0),
      criticalShelves: results
        .filter(r => r.lowStock.length >= 3)
        .map(r => r.shelfId),
    };

    return { results, summary };
  }
}

Step 5: API Routes


// app/api/inventory/analyze-shelf/route.ts
import { NextRequest, NextResponse } from 'next/server';
import { InventoryVisionService } from '@/lib/services/inventory-vision-service';

const service = new InventoryVisionService();

export async function POST(request: NextRequest) {
  try {
    const formData = await request.formData();
    const shelfId = formData.get('shelfId') as string;
    const image = formData.get('image') as File;

    if (!shelfId || !image) {
      return NextResponse.json(
        { error: 'Missing shelfId or image' },
        { status: 400 }
      );
    }

    const imageBuffer = Buffer.from(await image.arrayBuffer());
    const analysis = await service.analyzeShelf(imageBuffer, shelfId);

    return NextResponse.json({
      success: true,
      analysis,
    });
  } catch (error) {
    console.error('Shelf analysis failed:', error);
    return NextResponse.json(
      { error: 'Analysis failed' },
      { status: 500 }
    );
  }
}


// app/api/cron/monitor-warehouse/route.ts
import { NextRequest, NextResponse } from 'next/server';
import { InventoryVisionService } from '@/lib/services/inventory-vision-service';
import { fetchWarehouseCameraImages } from '@/lib/warehouse-integration';

export async function GET(request: NextRequest) {
  // Verify Vercel Cron auth
  const authHeader = request.headers.get('authorization');
  if (authHeader !== `Bearer ${process.env.CRON_SECRET}`) {
    return NextResponse.json({ error: 'Unauthorized' }, { status: 401 });
  }

  try {
    const service = new InventoryVisionService();
    
    // Fetch latest images from warehouse cameras
    const shelfImages = await fetchWarehouseCameraImages();
    
    // Analyze entire warehouse
    const { results, summary } = await service.analyzeWarehouse(shelfImages);

    console.log('Warehouse analysis complete:', summary);

    return NextResponse.json({
      success: true,
      summary,
      timestamp: new Date().toISOString(),
    });
  } catch (error) {
    console.error('Warehouse monitoring failed:', error);
    return NextResponse.json(
      { error: 'Monitoring failed' },
      { status: 500 }
    );
  }
}

Step 6: Schedule Regular Monitoring


// vercel.json
{
  "crons": [
    {
      "path": "/api/cron/monitor-warehouse",
      "schedule": "0 */4 * * *"
    }
  ]
}

Advanced Pattern: Multi-Stage Vision Pipeline

For complex use cases, combine multiple vision skills in sequence.


// lib/services/product-verification-service.ts
import { vision } from '@/lib/vision';
import { productVerificationAgent } from '@/mastra/agents/product-verification';

export class ProductVerificationService {
  async verifyProduct(imageBuffer: Buffer, expectedProduct: {
    name: string;
    color: string;
    brand: string;
  }) {
    // Stage 1: Caption - General understanding
    const caption = await vision.caption(imageBuffer);
    
    // Stage 2: Query - Specific verification
    const brandCheck = await vision.query(
      imageBuffer,
      `Is this product from the brand "${expectedProduct.brand}"? Answer yes or no.`
    );
    
    const colorCheck = await vision.query(
      imageBuffer,
      `What is the primary color of this product?`
    );
    
    // Stage 3: Detect - Find product location
    const detection = await vision.detect(imageBuffer, expectedProduct.name);
    
    // Stage 4: Agent analysis
    const verification = await productVerificationAgent.generate(`
      Product verification requested:
      Expected: ${JSON.stringify(expectedProduct)}
      
      Vision analysis:
      - General description: ${caption.caption}
      - Brand match: ${brandCheck.answer}
      - Detected color: ${colorCheck.answer}
      - Product detected: ${detection.objects?.length || 0} instances
      
      Verify if this product matches expectations. Provide:
      1. PASS/FAIL verdict
      2. Confidence level (high/medium/low)
      3. Specific issues found (if any)
      4. Recommended action
    `);

    return {
      verdict: verification.text.includes('PASS') ? 'PASS' : 'FAIL',
      caption: caption.caption,
      brandMatch: brandCheck.answer.toLowerCase().includes('yes'),
      detectedColor: colorCheck.answer,
      productCount: detection.objects?.length || 0,
      agentAnalysis: verification.text,
    };
  }
}

Pattern Variations

Variation 1: Real-Time Security Monitoring


// lib/services/security-vision-service.ts
import { vision } from '@/lib/vision';
import { securityAgent } from '@/mastra/agents/security';

export class SecurityVisionService {
  private sceneHistory: Map<string, string> = new Map();

  async monitorSecurityFeed(
    cameraId: string,
    imageBuffer: Buffer
  ): Promise<SecurityAlert | null> {
    // Detect people
    const peopleDetection = await vision.detect(imageBuffer, 'person');
    const peopleCount = peopleDetection.objects?.length || 0;

    if (peopleCount > 0) {
      // Get detailed analysis
      const analysis = await vision.query(
        imageBuffer,
        `Describe the ${peopleCount} person(s) in this image: What are they doing? Are they wearing any identifiable clothing or carrying anything?`
      );

      // Check against authorized persons database
      const alert = await securityAgent.generate(`
        SECURITY ALERT - Camera ${cameraId}
        Time: ${new Date().toISOString()}
        People detected: ${peopleCount}
        Analysis: ${analysis.answer}
        
        Assess threat level (LOW/MEDIUM/HIGH/CRITICAL) and recommend action.
      `);

      if (alert.text.includes('HIGH') || alert.text.includes('CRITICAL')) {
        return {
          cameraId,
          threatLevel: alert.text.includes('CRITICAL') ? 'CRITICAL' : 'HIGH',
          peopleCount,
          description: analysis.answer,
          recommendation: alert.text,
          timestamp: new Date(),
        };
      }
    }

    return null;
  }
}

type SecurityAlert = {
  cameraId: string;
  threatLevel: 'LOW' | 'MEDIUM' | 'HIGH' | 'CRITICAL';
  peopleCount: number;
  description: string;
  recommendation: string;
  timestamp: Date;
};

Variation 2: Document Intelligence Agent


// lib/services/document-intelligence-service.ts
import { vision } from '@/lib/vision';
import { documentAgent } from '@/mastra/agents/document';

export class DocumentIntelligenceService {
  async processDocument(imageBuffer: Buffer, documentType: 'receipt' | 'invoice' | 'id' | 'form') {
    // Get document type-specific fields
    const fields = this.getFieldsForDocumentType(documentType);
    
    // Extract all fields in parallel
    const extractions = await Promise.all(
      fields.map(async (field) => ({
        field,
        value: (await vision.query(imageBuffer, field.question)).answer,
      }))
    );

    // Let agent structure and validate the data
    const structuredData = await documentAgent.generate(`
      Document type: ${documentType}
      
      Extracted fields:
      ${extractions.map(e => `- ${e.field.name}: ${e.value}`).join('\n')}
      
      1. Structure this data as JSON
      2. Validate all fields are present and reasonable
      3. Flag any issues or missing data
      4. Calculate confidence score (0-100)
    `);

    return {
      documentType,
      rawExtractions: extractions,
      structuredData: structuredData.text,
      timestamp: new Date(),
    };
  }

  private getFieldsForDocumentType(type: string) {
    const fieldMaps = {
      receipt: [
        { name: 'merchant', question: 'What is the merchant name?' },
        { name: 'total', question: 'What is the total amount?' },
        { name: 'date', question: 'What is the date?' },
        { name: 'items', question: 'List all items purchased' },
      ],
      invoice: [
        { name: 'invoiceNumber', question: 'What is the invoice number?' },
        { name: 'vendor', question: 'Who is the vendor?' },
        { name: 'amount', question: 'What is the total amount due?' },
        { name: 'dueDate', question: 'What is the payment due date?' },
      ],
      id: [
        { name: 'name', question: 'What is the name on this ID?' },
        { name: 'idNumber', question: 'What is the ID number?' },
        { name: 'dateOfBirth', question: 'What is the date of birth?' },
        { name: 'expiryDate', question: 'What is the expiry date?' },
      ],
    };

    return fieldMaps[type] || [];
  }
}

Production Considerations

1. Cost Management

Vision API calls can add up quickly. Implement smart caching:


import { Redis } from '@upstash/redis';

const redis = Redis.fromEnv();

async function cachedVisionQuery(
  imageHash: string,
  query: string,
  fn: () => Promise<any>
) {
  const cacheKey = `vision:${imageHash}:${query}`;
  const cached = await redis.get(cacheKey);
  
  if (cached) {
    return cached;
  }

  const result = await fn();
  
  // Cache for 24 hours
  await redis.setex(cacheKey, 86400, JSON.stringify(result));
  
  return result;
}

2. Image Preprocessing

Always optimize images before sending to vision API:


async function preprocessImage(buffer: Buffer): Promise<Buffer> {
  return sharp(buffer)
    .resize(1024, 1024, { fit: 'inside' })
    .rotate() // Auto-rotate based on EXIF
    .normalize() // Improve contrast
    .jpeg({ quality: 85 })
    .toBuffer();
}

3. Error Handling

Vision APIs can fail—handle gracefully:


async function robustVisionQuery(
  imageBuffer: Buffer,
  query: string,
  maxRetries = 3
): Promise<string> {
  for (let i = 0; i < maxRetries; i++) {
    try {
      const result = await vision.query(imageBuffer, query);
      return result.answer;
    } catch (error) {
      if (i === maxRetries - 1) {
        console.error('Vision query failed after retries:', error);
        return 'Unable to analyze image';
      }
      
      // Exponential backoff
      await new Promise(resolve => setTimeout(resolve, 1000 * Math.pow(2, i)));
    }
  }
  
  return 'Unable to analyze image';
}

4. Monitoring & Observability

Track vision agent performance:


import { logger } from '@/lib/logger';

async function monitoredVisionAnalysis(
  imageBuffer: Buffer,
  context: Record<string, any>
) {
  const start = Date.now();
  
  try {
    const result = await vision.query(imageBuffer, context.query);
    
    logger.info('Vision analysis completed', {
      duration: Date.now() - start,
      imageSize: imageBuffer.length,
      context,
      success: true,
    });
    
    return result;
  } catch (error) {
    logger.error('Vision analysis failed', {
      duration: Date.now() - start,
      imageSize: imageBuffer.length,
      context,
      error,
    });
    
    throw error;
  }
}

Key Takeaways

Vision unlocks new agent capabilities - Security, inventory, document processing, quality control
Moondream is agent-friendly - Lightweight, fast, affordable, TypeScript-native
Multi-stage pipelines - Combine caption, query, detect for robust analysis
Production requires optimization - Cache results, preprocess images, handle failures
Vision + Other triggers - Combine with scheduled, event-based triggers for powerful workflows
Vision is a tool - Integrate vision capabilities into your agent's tool set for maximum flexibility

What's Next

You now understand how to build vision-powered agents. Continue building your expertise:

Chapter 14: Prompting Mastery - Learn how to craft effective prompts that work with vision results
Chapter 15: Tool Design & Function Calling - Design vision tools that integrate seamlessly with your agent architecture
Chapter 20: Advanced Agentic UI Patterns with Cedar - Build sophisticated UIs for vision agents with image upload, preview, and annotation
Chapter 21: Multi-Agent Systems - Coordinate vision agents with other specialized agents
Chapter 24: Scaling to Production - High-volume image processing infrastructure and cost optimization

Get chapter updates & code samples

We’ll email diagrams, code snippets, and additions.

Pattern 5: Vision-Based Agents