ai-dev
agents

Beyond Autocomplete: Agent Loops for Real Work

7/24/2025

Autocomplete is table stakes. What actually moves the needle is an agent loop that plans, calls your tools, verifies results, and summarizes the outcome. In this guide, we’ll define what a practical loop looks like, how to ship your first workflow safely, and how to measure impact once it’s live.

Why autocomplete only gets you so far

Autocomplete accelerates typing. But most real work is not typing—it’s navigating systems, reading docs and dashboards, calling APIs, and checking that outputs are correct. An agent loop can orchestrate these steps end to end, so humans spend time on direction and judgment rather than repetitive glue work.

Where autocomplete shines:

  • Writing boilerplate and common patterns.
  • Inline refactors and small edits.
  • Quick reminders of APIs and syntax.

Where it breaks down:

  • Multi-step tasks that cross code, data, and external tools.
  • Work that requires verification and retries.
  • “Do the thing” outcomes (create a PR, run a job, update a dashboard) instead of just producing text.

What an agent loop actually does

A minimal, production-friendly agent loop is small, observable, and opinionated:

  • Plan: Produce a short plan with 2–6 steps. Keep it explicit and bounded.
  • Ground: Pull the right context (files, tickets, runbooks, metrics) before acting.
  • Act: Call a tool for each step: read files, run tests, query an API, open a pull request, send a message.
  • Check: Verify outputs with assertions or heuristics; if a check fails, revise or ask for help.
  • Summarize: Create a durable artifact: a PR description, a run summary, an incident note.

Don’t overcomplicate this. You don’t need autonomous research agents or deep search trees to ship value. You need a small loop with the correct tools and guardrails.

Pseudocode for a tiny loop

type Step = {
  id: string;
  goal: string;
  tool: string;
  args: Record<string, unknown>;
  check?: string; // e.g., "tests pass", "HTTP 2xx", "diff contains expected change"
};

async function runAgentLoop(plan: Step[], tools: Record<string, Function>, budget = 6) {
  const transcript: any[] = [];
  for (let i = 0; i < Math.min(plan.length, budget); i++) {
    const step = plan[i];
    transcript.push({ type: 'plan', step });
    const tool = tools[step.tool];
    if (!tool) throw new Error(`Unknown tool: ${step.tool}`);
    const result = await tool(step.args);
    transcript.push({ type: 'result', stepId: step.id, result });
    if (step.check) {
      const ok = await verify(step.check, result);
      transcript.push({ type: 'check', stepId: step.id, ok });
      if (!ok) return { status: 'needs_human', transcript };
    }
  }
  return { status: 'done', transcript };
}

The crucial details live outside the loop: tool quality, verification, limits, and observability.

Choosing your first workflow

Start where you have clear success criteria and deterministic tools. Good candidates:

  • Update a single source of truth: “Read service A’s version number and update the changelog + docs + dashboard.”
  • Create a PR from a template change: “Apply this lint rule or config change across N files; run tests; open a PR with a diff and summary.”
  • Run an internal runbook: “Restart service X; check metrics; post a summary to Slack.”
  • Generate structured reporting: “Query the analytics API and produce a weekly table with deltas and notes.”

Avoid first: anything that requires broad search, subjective decisions, or unbounded exploration.

Safety-first rollout

Think in autonomy levels and graduate over time:

  • Level 0 — Draft only: The agent never executes—just proposes a plan, commands, and expected diffs. Humans run.
  • Level 1 — Constrained execution: The agent can call read-only tools and dry-runs. Humans approve side effects.
  • Level 2 — Supervised writes: The agent can open PRs or trigger jobs behind an approval gate. Every change has a review URL.
  • Level 3 — Trusted tasks: The agent runs specific workflows end-to-end with audits and alerts.

Design your tools for safety:

  • Prefer idempotent operations and dry-run modes.
  • Require explicit scopes (e.g., only modify files matching a glob, only operate in a sandbox branch).
  • Attach every action to a trace/span so you can reconstruct what happened.
  • Enforce budgets: step count, token/latency, and a hard wall-clock timeout.

Verification is non-negotiable

Verification can be as simple as:

  • Did unit/integration tests pass?
  • Does HTTP response have the expected shape and status?
  • Does the diff only touch allowed paths and include expected strings?
  • Do metrics move in the right direction after a job runs?

Add LLM-based self-checks for things you can’t write exact predicates for, but keep them advisory—fail closed when in doubt.

Observability and evals

  • Capture a transcript: prompts, tool calls, outputs, decisions, and diffs.
  • Emit structured logs per step (JSON), not just free text.
  • Build a tiny eval harness: given an input, does the agent produce the expected artifact (e.g., a PR that passes CI)?
  • Track failure modes: missing context, flaky tools, unclear instructions.

Over time, you’ll replace the fragile steps: stabilize prompts, tighten tools, and add checks where you formerly relied on “vibes.”

How to measure impact

  • Lead time: time from request to artifact (PR, job, report).
  • Rework rate: percent of runs needing human correction.
  • Adoption: number of weekly active users and runs per user.
  • Unit economics: average tokens, runtime, and side-effect cost per successful run.

If you can show “N minutes saved per run” with a stable rework rate, you have a durable win.

A simple rollout plan you can copy

  1. Pick one workflow with deterministic tools and a clear end artifact.
  2. Implement the smallest loop possible with budgets and a transcript.
  3. Run in Level 0 (draft only) for a week; collect feedback.
  4. Move to Level 1 with dry-runs and approval prompts.
  5. Graduate to Level 2 for just that workflow with audit trails.
  6. Add a second workflow only after the first is boring.

Closing thoughts

Agent loops are not magic—they’re disciplined automation with language in the middle. Keep the loop small, the tools sharp, and the verification explicit. Ship one workflow, measure it, then add the next.

See also:

  • RAG pipelines that supply better context to your loop: /blog/rag-for-your-blog-open-source
  • A full coding agent reference architecture: /blog/build-a-coding-agent-ts