reference
agents

Reference Architecture: E2B + Mastra + CopilotKit Coding Agent

8/1/2025

This is a production‑oriented reference for building a TypeScript coding agent with a clean separation of concerns:

  • UI and user‑visible “actions” with CopilotKit
  • Agent orchestration and tools with Mastra
  • Secure, ephemeral execution with E2B

The goal: move from a cool demo to a durable service you can operate—observability, evals, rollbacks, and safety gates included.

High‑level architecture

The system has three boundaries that keep responsibilities crisp:

  • CopilotKit (frontend + server actions): presents a conversational UI, exposes a set of declarative actions (e.g., “open PR”, “apply codemod”, “run tests”), and streams agent thoughts/results.
  • Mastra (server): orchestrates the agent loop, tool selection, planning, and verification. It decides which action to call and when to ask for human approval.
  • E2B (sandbox): runs untrusted code and commands in short‑lived, isolated environments. The sandbox holds no long‑term credentials and enforces quotas.

Tools and capabilities

Model your tools as small, testable functions with strict inputs/outputs. Examples:

  • repo.readFiles(glob)
  • repo.applyDiff(patch)
  • ci.runTests(target)
  • git.openPR(title, body)
  • shell.run(cmd, options)

In Mastra, register tools with clear schemas and guardrails (allowed paths, max runtime, dry‑run modes). In CopilotKit, surface only the tools you want the user to trigger directly.

Minimal code skeletons

The following pseudo‑code shows how the pieces fit. Adjust for your project.

CopilotKit: expose actions from the UI

// app/actions/agent.ts
import { createAction } from "@copilotkit/server";

export const requestRefactor = createAction({
  name: "refactor_and_test",
  parameters: {
    repo: "string",
    branch: "string",
    path: "string",
    goal: "string",
  },
  execute: async (params) => {
    // Forward to your Mastra agent endpoint and stream back tokens/events
    const res = await fetch("/api/agent/refactor", {
      method: "POST",
      body: JSON.stringify(params),
    });
    return res.body; // stream to UI
  },
});

Mastra: define the agent and tools

// server/agent.ts
import { Agent, z } from "@mastra/agent-builder"; // example import path

const Repo = {
  readFiles: async ({ repo, branch, glob }: { repo: string; branch: string; glob: string }) => { /* ... */ },
  applyDiff: async ({ repo, branch, patch }: { repo: string; branch: string; patch: string }) => { /* ... */ },
};

const CI = {
  runTests: async ({ repo, branch }: { repo: string; branch: string }) => { /* ... */ },
};

export const codingAgent = new Agent({
  name: "coding-agent",
  tools: {
    readFiles: {
      schema: z.object({ repo: z.string(), branch: z.string(), glob: z.string() }),
      call: Repo.readFiles,
      limits: { timeoutMs: 20_000 },
    },
    applyDiff: {
      schema: z.object({ repo: z.string(), branch: z.string(), patch: z.string() }),
      call: Repo.applyDiff,
      limits: { timeoutMs: 30_000, dryRun: true },
    },
    runTests: {
      schema: z.object({ repo: z.string(), branch: z.string() }),
      call: CI.runTests,
      limits: { timeoutMs: 120_000 },
    },
  },
  loop: {
    maxSteps: 6,
    verify: async (step, result) => {
      // Example checks: unit tests passed, diff only in allowed paths, PR title length, etc.
      return Boolean(result?.ok);
    },
  },
});

E2B: ephemeral execution for untrusted work

Wrap any shell/git operations in an E2B sandbox. Never run them in your API process.

// server/sandbox.ts
import { createSandbox } from "e2b"; // conceptual import

export async function withSandbox<T>(fn: (s: any) => Promise<T>): Promise<T> {
  const s = await createSandbox({
    ttlSeconds: 600,
    cpu: 2,
    memoryGb: 4,
    network: "restricted",
  });
  try {
    return await fn(s);
  } finally {
    await s.destroy();
  }
}

Inside the sandbox, scope credentials narrowly (e.g., a short‑lived GitHub token restricted to one repo/branch) and mount a temporary workspace.

Security and safety

  • Scope every token. Prefer GitHub fine‑grained PATs, expiring OIDC tokens, or bot accounts with least privilege.
  • Use dry‑run modes first: generate diffs, don’t apply them; run tests without publishing artifacts.
  • Add allow‑lists: limit file globs, commands, and outbound hosts.
  • Enforce budgets: max steps, max tool invocations, wall‑clock timeouts.
  • Keep a human‑in‑the‑loop for actions that mutate state (opening PRs, merging, tagging releases).

Observability and logs

  • Emit structured events per step: plan, tool_call, result, check, ask_approval, complete.
  • Attach correlation IDs across CopilotKit, Mastra, and E2B runs.
  • Persist transcripts alongside artifacts (diffs, test logs, PR URL).
  • Surface a tiny run dashboard so engineers can replay/inspect sessions.

Evals that matter

Focus on end‑to‑end outcomes:

  • “Given prompt X on repo Y, does the agent produce a PR that passes CI?”
  • Track pass‑rate, mean tokens, mean runtime, and the most common failure causes.
  • Keep a golden set of prompts and repos; run nightly and on changes to tools/prompts.

Deployment approach

  • Run the Mastra agent and CopilotKit server actions on your core API infra.
  • Keep E2B sandboxes in a separate project/network with restricted egress.
  • Store no secrets inside the sandbox image; inject short‑lived tokens at start.
  • Blue‑green prompts/configs: version them, roll back quickly.

Failure modes and fallbacks

  • Tools flake: add retries with jitter; degrade gracefully to “draft‑only” mode.
  • Mis‑scoped diffs: fail closed; require approval; add stricter path filters.
  • Test timeouts: shard or reduce scope; surface partial results and ask for help.
  • Model variance: use tool‑former style hints and exemplars; add self‑checks.

Checklist to go from demo to durable

  • Define 1–2 workflows with crisp success criteria (e.g., “open PR that passes CI”).
  • Implement tools with schemas, timeouts, and dry‑runs.
  • Add transcripts, IDs, and a simple run viewer.
  • Create a golden eval set; track weekly pass‑rate and rework.
  • Start with approval gates; expand autonomy per‑workflow as trust grows.

When you can show “minutes saved per PR” with stable rework, you’ve crossed the demo chasm.