Pattern Overview
💡 Prerequisites: This chapter builds on Chapter 7: Triggers where vision triggers were introduced. You should understand the five trigger types before diving into this complete pattern implementation.
Vision-based agents extend traditional text-based AI systems with visual understanding capabilities. By integrating computer vision models like Moondream, agents can process images, detect objects, extract information from documents, and respond to visual events in real-time.
This pattern is particularly powerful because it unlocks entirely new categories of automation that were previously impossible with text-only agents.
Why This Pattern Matters
Most agentic systems today are blind. They can read, write, and reason—but they can't see. This creates massive blind spots:
- E-commerce platforms can't automatically verify product photos match descriptions
- Security systems can't intelligently respond to visual threats
- Healthcare apps can't monitor patients visually
- Inventory systems can't track stock from camera feeds
- Document processing requires manual data entry
Vision agents solve this. They give your agents eyes.
Architecture Pattern
┌─────────────────────────────────────────────────────────────┐
│ VISION AGENT ARCHITECTURE │
│ │
│ INPUT LAYER │
│ ┌────────────────────────────────────────────────────────┐ │
│ │ • Image Upload (User) │ │
│ │ • Camera Feed (IoT/Security) │ │
│ │ • Screenshot (Monitoring) │ │
│ │ • Document Scan (Mobile) │ │
│ └────────────────────────────────────────────────────────┘ │
│ ↓ │
│ VISION LAYER (Moondream) │
│ ┌────────────────────────────────────────────────────────┐ │
│ │ SKILLS: │ │
│ │ • Query: "What's in this image?" │ │
│ │ • Detect: Find objects + bounding boxes │ │
│ │ • Point: Locate coordinates of elements │ │
│ │ • Caption: Natural language descriptions │ │
│ └────────────────────────────────────────────────────────┘ │
│ ↓ │
│ AGENT LAYER (Mastra) │
│ ┌────────────────────────────────────────────────────────┐ │
│ │ • Analyze vision results │ │
│ │ • Make decisions based on visual data │ │
│ │ • Trigger workflows │ │
│ │ • Call additional tools │ │
│ └────────────────────────────────────────────────────────┘ │
│ ↓ │
│ ACTION LAYER │
│ ┌────────────────────────────────────────────────────────┐ │
│ │ • Send notifications │ │
│ │ • Update databases │ │
│ │ • Trigger other agents │ │
│ │ • Generate reports │ │
│ └────────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────┘
When to Use This Pattern
✅ Perfect for:
- Real-time monitoring (security, quality control)
- Document processing (receipts, IDs, forms)
- Content moderation
- Inventory management
- Accessibility features
- Visual search and discovery
❌ Not ideal for:
- Pure text-based workflows
- High-frequency video processing (cost)
- Real-time video streaming (latency)
- Tasks requiring human-level visual judgment
Complete Example: Smart Inventory Agent
Let's build a production-ready vision agent that monitors warehouse shelves, detects low stock, and automatically triggers restock workflows.
Step 1: Setup
# Install dependencies npm install moondream sharp mastra
// lib/vision.ts import moondream from 'moondream'; export const vision = moondream.vl({ apiKey: process.env.MOONDREAM_API_KEY!, }); export type VisionQuery = { image: Buffer; question: string; }; export type VisionDetection = { image: Buffer; object: string; };
Step 2: Create Vision Tool for Mastra
// mastra/tools/vision-tools.ts import { createTool } from '@mastra/core'; import { z } from 'zod'; import { vision } from '@/lib/vision'; export const analyzeImageTool = createTool({ id: 'analyze-image', description: 'Analyze an image and answer questions about its contents', inputSchema: z.object({ imageUrl: z.string().describe('URL or base64 encoded image'), question: z.string().describe('Question to ask about the image'), }), outputSchema: z.object({ answer: z.string(), requestId: z.string(), }), execute: async ({ context, input }) => { // Fetch image if URL, or decode if base64 const imageBuffer = await fetchImageBuffer(input.imageUrl); const result = await vision.query(imageBuffer, input.question); return { answer: result.answer, requestId: result.request_id, }; }, }); export const detectObjectsTool = createTool({ id: 'detect-objects', description: 'Detect specific objects in an image and get their bounding boxes', inputSchema: z.object({ imageUrl: z.string(), objectType: z.string().describe('Type of object to detect (e.g., "person", "box", "pallet")'), }), outputSchema: z.object({ objects: z.array(z.object({ x_min: z.number(), y_min: z.number(), x_max: z.number(), y_max: z.number(), })), count: z.number(), }), execute: async ({ input }) => { const imageBuffer = await fetchImageBuffer(input.imageUrl); const result = await vision.detect(imageBuffer, input.objectType); return { objects: result.objects || [], count: result.objects?.length || 0, }; }, }); async function fetchImageBuffer(urlOrBase64: string): Promise<Buffer> { if (urlOrBase64.startsWith('data:')) { const base64Data = urlOrBase64.split(',')[1]; return Buffer.from(base64Data, 'base64'); } const response = await fetch(urlOrBase64); return Buffer.from(await response.arrayBuffer()); }
Step 3: Create Inventory Vision Agent
// mastra/agents/inventory-vision-agent.ts import { Agent } from '@mastra/core'; import { openai } from '@mastra/openai'; import { analyzeImageTool, detectObjectsTool } from '../tools/vision-tools'; export const inventoryVisionAgent = new Agent({ name: 'Inventory Vision Agent', instructions: `You are a warehouse inventory monitoring agent with computer vision capabilities. Your responsibilities: 1. Analyze shelf images to identify products 2. Count items using object detection 3. Detect low stock situations (< 3 items) 4. Identify misplaced or damaged products 5. Trigger restock workflows when needed 6. Generate inventory reports When analyzing images: - Be thorough and accurate - Use the detect-objects tool for precise counting - Use the analyze-image tool for qualitative assessment - Report exact counts and locations - Flag any anomalies (damaged boxes, wrong placement, etc.) Always provide actionable recommendations.`, model: { provider: openai, name: 'gpt-4o', toolChoice: 'auto', }, tools: { analyzeImage: analyzeImageTool, detectObjects: detectObjectsTool, }, });
Step 4: Vision Processing Service
// lib/services/inventory-vision-service.ts import { vision } from '@/lib/vision'; import { inventoryVisionAgent } from '@/mastra/agents/inventory-vision-agent'; import { db } from '@/lib/db'; import sharp from 'sharp'; export type ShelfAnalysis = { shelfId: string; products: ProductDetection[]; lowStock: string[]; needsAttention: boolean; timestamp: Date; }; export type ProductDetection = { name: string; count: number; confidence: number; locations: Array<{ x_min: number; y_min: number; x_max: number; y_max: number }>; }; export class InventoryVisionService { /** * Process a single shelf image */ async analyzeShelf(imageBuffer: Buffer, shelfId: string): Promise<ShelfAnalysis> { // Optimize image for vision processing const optimizedImage = await this.optimizeImage(imageBuffer); // Step 1: Ask the agent to identify all products const identificationResult = await inventoryVisionAgent.generate( `Analyze this warehouse shelf image and identify all product types visible. Provide a comma-separated list of product names.`, { stream: false, } ); const productNames = this.parseProductList(identificationResult.text); // Step 2: Detect and count each product const productDetections: ProductDetection[] = []; for (const productName of productNames) { const detectResult = await vision.detect(optimizedImage, productName); productDetections.push({ name: productName, count: detectResult.objects?.length || 0, confidence: 0.9, // Moondream doesn't return confidence, using default locations: detectResult.objects || [], }); } // Step 3: Identify low stock items const lowStock = productDetections .filter(p => p.count < 3) .map(p => p.name); // Step 4: Check for quality issues const qualityCheck = await vision.query( optimizedImage, 'Are there any damaged boxes, items on the floor, or misplaced products in this image? Answer yes or no and explain.' ); const needsAttention = lowStock.length > 0 || qualityCheck.answer.toLowerCase().includes('yes'); // Step 5: If attention needed, get agent recommendations if (needsAttention) { await inventoryVisionAgent.generate(` Shelf ${shelfId} analysis results: Products detected: ${productDetections.map(p => `${p.name} (${p.count})`).join(', ')} Low stock items: ${lowStock.join(', ') || 'None'} Quality issues: ${qualityCheck.answer} Provide actionable recommendations and determine if immediate action is needed. `); } // Step 6: Store in database await this.saveAnalysis({ shelfId, products: productDetections, lowStock, needsAttention, timestamp: new Date(), }); return { shelfId, products: productDetections, lowStock, needsAttention, timestamp: new Date(), }; } /** * Optimize image for vision processing */ private async optimizeImage(buffer: Buffer): Promise<Buffer> { return sharp(buffer) .resize(1024, 1024, { fit: 'inside', withoutEnlargement: true }) .jpeg({ quality: 85 }) .toBuffer(); } /** * Parse comma-separated product list from agent response */ private parseProductList(text: string): string[] { // Extract product names from agent response const match = text.match(/\b([a-zA-Z0-9\s,]+)\b/); if (!match) return []; return match[1] .split(',') .map(p => p.trim()) .filter(p => p.length > 0); } /** * Save analysis to database */ private async saveAnalysis(analysis: ShelfAnalysis) { await db.shelfAnalysis.create({ data: { shelfId: analysis.shelfId, products: JSON.stringify(analysis.products), lowStock: analysis.lowStock, needsAttention: analysis.needsAttention, analyzedAt: analysis.timestamp, }, }); // If low stock, create restock task if (analysis.lowStock.length > 0) { await db.restockTask.createMany({ data: analysis.lowStock.map(product => ({ shelfId: analysis.shelfId, productName: product, priority: 'high', status: 'pending', createdAt: new Date(), })), }); } } /** * Process multiple shelves in parallel */ async analyzeWarehouse(shelfImages: Array<{ shelfId: string; image: Buffer }>) { const results = await Promise.all( shelfImages.map(({ shelfId, image }) => this.analyzeShelf(image, shelfId) ) ); const summary = { totalShelves: results.length, shelvesNeedingAttention: results.filter(r => r.needsAttention).length, totalLowStockItems: results.reduce((sum, r) => sum + r.lowStock.length, 0), criticalShelves: results .filter(r => r.lowStock.length >= 3) .map(r => r.shelfId), }; return { results, summary }; } }
Step 5: API Routes
// app/api/inventory/analyze-shelf/route.ts import { NextRequest, NextResponse } from 'next/server'; import { InventoryVisionService } from '@/lib/services/inventory-vision-service'; const service = new InventoryVisionService(); export async function POST(request: NextRequest) { try { const formData = await request.formData(); const shelfId = formData.get('shelfId') as string; const image = formData.get('image') as File; if (!shelfId || !image) { return NextResponse.json( { error: 'Missing shelfId or image' }, { status: 400 } ); } const imageBuffer = Buffer.from(await image.arrayBuffer()); const analysis = await service.analyzeShelf(imageBuffer, shelfId); return NextResponse.json({ success: true, analysis, }); } catch (error) { console.error('Shelf analysis failed:', error); return NextResponse.json( { error: 'Analysis failed' }, { status: 500 } ); } }
// app/api/cron/monitor-warehouse/route.ts import { NextRequest, NextResponse } from 'next/server'; import { InventoryVisionService } from '@/lib/services/inventory-vision-service'; import { fetchWarehouseCameraImages } from '@/lib/warehouse-integration'; export async function GET(request: NextRequest) { // Verify Vercel Cron auth const authHeader = request.headers.get('authorization'); if (authHeader !== `Bearer ${process.env.CRON_SECRET}`) { return NextResponse.json({ error: 'Unauthorized' }, { status: 401 }); } try { const service = new InventoryVisionService(); // Fetch latest images from warehouse cameras const shelfImages = await fetchWarehouseCameraImages(); // Analyze entire warehouse const { results, summary } = await service.analyzeWarehouse(shelfImages); console.log('Warehouse analysis complete:', summary); return NextResponse.json({ success: true, summary, timestamp: new Date().toISOString(), }); } catch (error) { console.error('Warehouse monitoring failed:', error); return NextResponse.json( { error: 'Monitoring failed' }, { status: 500 } ); } }
Step 6: Schedule Regular Monitoring
// vercel.json { "crons": [ { "path": "/api/cron/monitor-warehouse", "schedule": "0 */4 * * *" } ] }
Advanced Pattern: Multi-Stage Vision Pipeline
For complex use cases, combine multiple vision skills in sequence.
// lib/services/product-verification-service.ts import { vision } from '@/lib/vision'; import { productVerificationAgent } from '@/mastra/agents/product-verification'; export class ProductVerificationService { async verifyProduct(imageBuffer: Buffer, expectedProduct: { name: string; color: string; brand: string; }) { // Stage 1: Caption - General understanding const caption = await vision.caption(imageBuffer); // Stage 2: Query - Specific verification const brandCheck = await vision.query( imageBuffer, `Is this product from the brand "${expectedProduct.brand}"? Answer yes or no.` ); const colorCheck = await vision.query( imageBuffer, `What is the primary color of this product?` ); // Stage 3: Detect - Find product location const detection = await vision.detect(imageBuffer, expectedProduct.name); // Stage 4: Agent analysis const verification = await productVerificationAgent.generate(` Product verification requested: Expected: ${JSON.stringify(expectedProduct)} Vision analysis: - General description: ${caption.caption} - Brand match: ${brandCheck.answer} - Detected color: ${colorCheck.answer} - Product detected: ${detection.objects?.length || 0} instances Verify if this product matches expectations. Provide: 1. PASS/FAIL verdict 2. Confidence level (high/medium/low) 3. Specific issues found (if any) 4. Recommended action `); return { verdict: verification.text.includes('PASS') ? 'PASS' : 'FAIL', caption: caption.caption, brandMatch: brandCheck.answer.toLowerCase().includes('yes'), detectedColor: colorCheck.answer, productCount: detection.objects?.length || 0, agentAnalysis: verification.text, }; } }
Pattern Variations
Variation 1: Real-Time Security Monitoring
// lib/services/security-vision-service.ts import { vision } from '@/lib/vision'; import { securityAgent } from '@/mastra/agents/security'; export class SecurityVisionService { private sceneHistory: Map<string, string> = new Map(); async monitorSecurityFeed( cameraId: string, imageBuffer: Buffer ): Promise<SecurityAlert | null> { // Detect people const peopleDetection = await vision.detect(imageBuffer, 'person'); const peopleCount = peopleDetection.objects?.length || 0; if (peopleCount > 0) { // Get detailed analysis const analysis = await vision.query( imageBuffer, `Describe the ${peopleCount} person(s) in this image: What are they doing? Are they wearing any identifiable clothing or carrying anything?` ); // Check against authorized persons database const alert = await securityAgent.generate(` SECURITY ALERT - Camera ${cameraId} Time: ${new Date().toISOString()} People detected: ${peopleCount} Analysis: ${analysis.answer} Assess threat level (LOW/MEDIUM/HIGH/CRITICAL) and recommend action. `); if (alert.text.includes('HIGH') || alert.text.includes('CRITICAL')) { return { cameraId, threatLevel: alert.text.includes('CRITICAL') ? 'CRITICAL' : 'HIGH', peopleCount, description: analysis.answer, recommendation: alert.text, timestamp: new Date(), }; } } return null; } } type SecurityAlert = { cameraId: string; threatLevel: 'LOW' | 'MEDIUM' | 'HIGH' | 'CRITICAL'; peopleCount: number; description: string; recommendation: string; timestamp: Date; };
Variation 2: Document Intelligence Agent
// lib/services/document-intelligence-service.ts import { vision } from '@/lib/vision'; import { documentAgent } from '@/mastra/agents/document'; export class DocumentIntelligenceService { async processDocument(imageBuffer: Buffer, documentType: 'receipt' | 'invoice' | 'id' | 'form') { // Get document type-specific fields const fields = this.getFieldsForDocumentType(documentType); // Extract all fields in parallel const extractions = await Promise.all( fields.map(async (field) => ({ field, value: (await vision.query(imageBuffer, field.question)).answer, })) ); // Let agent structure and validate the data const structuredData = await documentAgent.generate(` Document type: ${documentType} Extracted fields: ${extractions.map(e => `- ${e.field.name}: ${e.value}`).join('\n')} 1. Structure this data as JSON 2. Validate all fields are present and reasonable 3. Flag any issues or missing data 4. Calculate confidence score (0-100) `); return { documentType, rawExtractions: extractions, structuredData: structuredData.text, timestamp: new Date(), }; } private getFieldsForDocumentType(type: string) { const fieldMaps = { receipt: [ { name: 'merchant', question: 'What is the merchant name?' }, { name: 'total', question: 'What is the total amount?' }, { name: 'date', question: 'What is the date?' }, { name: 'items', question: 'List all items purchased' }, ], invoice: [ { name: 'invoiceNumber', question: 'What is the invoice number?' }, { name: 'vendor', question: 'Who is the vendor?' }, { name: 'amount', question: 'What is the total amount due?' }, { name: 'dueDate', question: 'What is the payment due date?' }, ], id: [ { name: 'name', question: 'What is the name on this ID?' }, { name: 'idNumber', question: 'What is the ID number?' }, { name: 'dateOfBirth', question: 'What is the date of birth?' }, { name: 'expiryDate', question: 'What is the expiry date?' }, ], }; return fieldMaps[type] || []; } }
Production Considerations
1. Cost Management
Vision API calls can add up quickly. Implement smart caching:
import { Redis } from '@upstash/redis'; const redis = Redis.fromEnv(); async function cachedVisionQuery( imageHash: string, query: string, fn: () => Promise<any> ) { const cacheKey = `vision:${imageHash}:${query}`; const cached = await redis.get(cacheKey); if (cached) { return cached; } const result = await fn(); // Cache for 24 hours await redis.setex(cacheKey, 86400, JSON.stringify(result)); return result; }
2. Image Preprocessing
Always optimize images before sending to vision API:
async function preprocessImage(buffer: Buffer): Promise<Buffer> { return sharp(buffer) .resize(1024, 1024, { fit: 'inside' }) .rotate() // Auto-rotate based on EXIF .normalize() // Improve contrast .jpeg({ quality: 85 }) .toBuffer(); }
3. Error Handling
Vision APIs can fail—handle gracefully:
async function robustVisionQuery( imageBuffer: Buffer, query: string, maxRetries = 3 ): Promise<string> { for (let i = 0; i < maxRetries; i++) { try { const result = await vision.query(imageBuffer, query); return result.answer; } catch (error) { if (i === maxRetries - 1) { console.error('Vision query failed after retries:', error); return 'Unable to analyze image'; } // Exponential backoff await new Promise(resolve => setTimeout(resolve, 1000 * Math.pow(2, i))); } } return 'Unable to analyze image'; }
4. Monitoring & Observability
Track vision agent performance:
import { logger } from '@/lib/logger'; async function monitoredVisionAnalysis( imageBuffer: Buffer, context: Record<string, any> ) { const start = Date.now(); try { const result = await vision.query(imageBuffer, context.query); logger.info('Vision analysis completed', { duration: Date.now() - start, imageSize: imageBuffer.length, context, success: true, }); return result; } catch (error) { logger.error('Vision analysis failed', { duration: Date.now() - start, imageSize: imageBuffer.length, context, error, }); throw error; } }
Key Takeaways
- Vision unlocks new agent capabilities - Security, inventory, document processing, quality control
- Moondream is agent-friendly - Lightweight, fast, affordable, TypeScript-native
- Multi-stage pipelines - Combine caption, query, detect for robust analysis
- Production requires optimization - Cache results, preprocess images, handle failures
- Vision + Other triggers - Combine with scheduled, event-based triggers for powerful workflows
- Vision is a tool - Integrate vision capabilities into your agent's tool set for maximum flexibility
What's Next
You now understand how to build vision-powered agents. Continue building your expertise:
- Chapter 14: Prompting Mastery - Learn how to craft effective prompts that work with vision results
- Chapter 15: Tool Design & Function Calling - Design vision tools that integrate seamlessly with your agent architecture
- Chapter 20: Advanced Agentic UI Patterns with Cedar - Build sophisticated UIs for vision agents with image upload, preview, and annotation
- Chapter 21: Multi-Agent Systems - Coordinate vision agents with other specialized agents
- Chapter 24: Scaling to Production - High-volume image processing infrastructure and cost optimization
Get chapter updates & code samples
We’ll email diagrams, code snippets, and additions.