Computer Use Agents: How They Work and What's Hard

CallMissed
·5 min readArticle

Anthropic introduced Computer Use in late 2024 as the first production-grade API where an LLM could drive a screen — see pixels, move a mouse, type. Eighteen months in, it's no longer a research demo. Production teams are running it for QA automation, internal tooling, RPA-style workflows, and customer-onboarding handholding. It is also still the most operationally fragile agent surface most teams will touch. Here's how it actually works and what breaks.

The core loop

Computer Use is not magic. It's a tight loop:

  • Take a screenshot of the screen the agent is allowed to see
  • Send the image plus the user's goal plus prior actions to the model
  • Model returns one of: screenshot, mouse_move(x,y), left_click, type("hello"), key("Return"), scroll, wait, etc.
  • Your harness executes the action
  • Repeat
  • Everything client-side: per the Anthropic docs, screenshots, mouse actions, keyboard input, and files are captured and stored in your environment, not Anthropic's. Anthropic processes the images and action requests but does not retain them after the response.

    The model versions matter. The computer_20251124 tool is supported on Claude Opus 4.5, Sonnet 4.6, Opus 4.6, and Opus 4.7, with capabilities like region zoom for fine-grained text reads.

    Why it works at all

    Two design choices give this a fighting chance:

  • General computer skills, not per-app integrations. Rather than a Salesforce tool and a Gmail tool and a Linear tool, the model uses the same screen-and-mouse interface a human would. New apps work without integration work.
  • Coordinated screenshots between actions. Each step gets a fresh image so the model can see the consequence of the previous action. This is the loop's most expensive feature — and the source of most of its limitations.
  • What's hard in production

    Latency

    Each round-trip is screenshot capture + image upload + model inference + action dispatch. At 4–8 seconds per step, a 20-step task is 1.5–3 minutes. For human-in-the-loop work that's fine; for unattended automation it bottlenecks throughput. [Inference] Most production deployments mix Computer Use with deterministic tools — Playwright for the predictable parts, Computer Use for the unpredictable parts.

    Error recovery

    Humans recover from "wait, that wasn't the right window" almost instantly. Models often loop: click the wrong button, click it again, then try harder. Useful patterns:

  • A hard step budget (max_steps=30) per task with explicit failure
  • A "stuck detector" that watches for repeated screenshots and short-circuits
  • An explicit give_up tool the model can call when it doesn't know how to proceed
  • Without these, a small misdetection becomes a 100-step thrash that costs more than the task.

    Sandboxing

    The model is going to click things. You do not want it clicking real things. Almost every production deployment runs Computer Use inside:

  • A disposable VM or container
  • A dedicated user account with no access to the host or other users' data
  • Network egress filtered to the target apps
  • The convenience of "let it use the real desktop" is rarely worth the blast radius. [Inference]

    Anti-bot defenses

    Sites with serious bot protection (Cloudflare Turnstile, hCaptcha, banking app device-attestation) can detect and block automated browsers and synthetic mouse motion. Computer Use is harder to detect than headless Chrome (it's literally driving a real browser), but not invisible. Plan for the case where the target app blocks you and have a human-handoff path.

    Vision precision

    The model occasionally misjudges pixel-perfect coordinates — clicking three pixels off the right element. The 2025 enhanced computer tool added a zoom action so the model can crop and reread a region at full resolution before clicking, which materially improves precision on dense UIs.

    Where it shines

  • Onboarding flows. Walk a user through a multi-step setup, narrating actions, while taking screenshots they can review.
  • QA automation for visual regressions. "Click through the signup flow on staging and tell me if anything looks broken."
  • Long-tail RPA. Apps with no API where you'd otherwise pay for a pixel-bot vendor.
  • Knowledge-worker copilots that span apps — moving data between Notion, Excel, and a CRM that has poor integrations.
  • Where it doesn't

  • High-throughput data extraction. Use Playwright + a vision model on demand, not a screenshot-per-step loop.
  • Latency-sensitive customer flows. A user waiting 3 minutes for the agent to finish is a worse experience than a slightly less smart deterministic flow.
  • Anything regulated. Banking, healthcare, anything subject to audit — the loop's nondeterminism makes audit trails painful.
  • A pragmatic stack

    Most 2026 production stacks layer:

  • Playwright (or equivalent) for the 70–80% of steps that are predictable selectors
  • An LLM with vision for the 15–20% that need understanding of dynamic UI
  • Computer Use for the long-tail 5–10% that need full mouse-and-keyboard improvisation
  • Treating Computer Use as the only tool, or refusing to use it because Playwright "should be enough," both miss. It's a layer in a stack, not a strategy.

    Frequently Asked Questions

    Is Computer Use generally available or still beta?
    Computer Use has been on the Anthropic API since late 2024. It is production-available with model-specific tool versions; the most current variant is computer_20251124, supported on Opus 4.5+ and Sonnet 4.6+. [Inference]
    Can I run Computer Use on a user's own machine?
    You can, but it's risky. Most production deployments run it in a sandbox VM or container with limited network and storage access, then expose results back to the user. Letting the model click on a real desktop with real credentials is a large blast radius.
    How does Computer Use compare to Playwright + a vision model?
    Playwright is faster and cheaper when the UI has stable selectors. Computer Use is better when the UI is dynamic, has no integration surface, or spans multiple apps. Many teams use both.

    Related Posts