Browser Automation with AI: Playwright + LLMs in Production

CallMissed
·5 min readGuide

Browser automation went from "Selenium scripts that break every Tuesday" to "an LLM clicking around" faster than most categories. By April 2026 the field has consolidated to a small set of production-grade stacks — Playwright + LLM, Stagehand, Browser-Use, Anthropic Computer Use, and the OpenAI CUA — each with a different opinion on selectors vs screenshots. Here's how to actually pick.

The two control models

Selector-driven. The classic Playwright / Puppeteer model: find an element by CSS / XPath / accessibility attributes, click it, fill it, navigate. Fast, deterministic, cheap. Breaks when the UI changes.

Vision-driven. Take a screenshot, give it to a model with vision, the model returns a click target or a typed instruction. Robust to UI changes, expensive per step, slower.

The 2026 production answer isn't either-or. It's a layered stack: selectors for predictable steps, vision for the dynamic ones, full Computer Use for the truly improvisational tail.

The contenders

Playwright

Playwright is the deterministic browser-automation framework — Chromium, Firefox, WebKit, one clean API. Microsoft-maintained, fast, mature. Token-efficient when used directly by agents like Claude Code or GitHub Copilot — selectors are tiny strings compared to screenshots.

Best for: predictable UIs, large-scale scraping where deterministic latency matters, end-to-end testing.

Browser-Use

Browser-Use is an open-source Python library that turns any LLM into a full browser agent. The LLM decides what to click, what to type, when to scroll, and when the task is complete. It crossed 50,000+ GitHub stars, making it one of the fastest-growing OSS AI projects of 2025–2026.

Best for: tasks that require understanding rather than rote clicking — filling out variable forms, navigating sites that change layout often.

Stagehand

Stagehand layers AI on top of Playwright. You write Playwright code where the deterministic parts work, and act() / extract() / observe() calls hand off to an LLM for the dynamic parts.

Best for: teams that already have Playwright suites and want to upgrade specific brittle steps to AI without rewriting everything.

Anthropic Computer Use / OpenAI CUA

Full vision-driven control of a browser (or full desktop). Slowest and most expensive per step; best when no API exists and the UI is genuinely unpredictable.

Best for: long-tail RPA, multi-app workflows, situations where Playwright + LLM-helper isn't enough.

The hybrid stack

The pattern most production teams use in 2026:

  • Playwright handles 70–80% of steps — the predictable selectors, navigation, form fields with stable IDs
  • An LLM with vision handles 15–20% — dynamic content, popups, cookie banners, A/B-tested layouts
  • Computer Use handles the long tail 5–10% — multi-app, true improvisation, "the form moved and I have to recover"
  • This split gives speed (selectors are fast), reliability (deterministic for the bulk), and flexibility (vision when needed).

    Selectors vs screenshots: tradeoffs

    DimensionSelectorsScreenshots
    LatencySub-100ms per action2–8s per action
    CostNegligible$0.01–$0.05 per step
    Reliability under UI changeBrittleRobust
    Token consumptionTinyLarge (vision tokens)
    DebuggingClear (selector string)Harder (which pixels?)

    A useful heuristic: every step you can do with a selector, do with a selector. Reach for vision when the selector strategy fails three times or when the element you need has no stable identifier.

    Anti-bot and the real-world ceiling

    A piece most marketing material glosses over: serious anti-bot defenses (Cloudflare Turnstile, hCaptcha, banking app device-attestation) detect and block automated browsers, regardless of whether you're driving them with Playwright or vision. Computer Use is harder to detect than headless Chrome, but not invisible.

    Practical guidance:

  • Use real browser fingerprints (run actual Chromium, not Puppeteer's stripped build)
  • Slow down inhuman-fast actions; add jitter on mouse moves and typing
  • For sites with serious protection, plan for a human-handoff path. Some sites are not automatable.
  • Solve captchas through legitimate services where allowed; do not pretend to be a human in violation of Terms of Service. Legal and ethical concerns are real.
  • Production checklist

    If you're shipping browser automation in 2026:

  • [ ] Headless mode for batch, headed for interactive debugging
  • [ ] Persistent browser profiles for sessions (cookies, login state)
  • [ ] Per-step timeout (typically 5–15s) with typed errors
  • [ ] Screenshot capture on every error for post-mortem
  • [ ] Step-count budget per task (most tasks under 30 steps)
  • [ ] Sandboxed environment — don't run on the host that has your credentials
  • [ ] Trace export — Playwright tracing or OpenTelemetry spans for the agent layer
  • Cost back-of-envelope

    [Inference] A vision-driven step costs $0.01–$0.05 per action. A 50-step task costs $0.50–$2.50. A pure-selector Playwright run costs cents in compute and zero in model fees. Hybrid stacks land somewhere in between depending on the selector hit rate. If your unit economics depend on browser automation being cheap, lean as hard on selectors as you can.

    Frequently Asked Questions

    Should I use Browser-Use or Playwright for production?
    Both, layered. Playwright for the predictable steps where selectors are stable; Browser-Use (or Stagehand or vision-driven Playwright) for the parts that need understanding. Pure Browser-Use is slower and more expensive than necessary for tasks where selectors work.
    Can browser agents bypass captchas and bot detection?
    Not reliably, and you should not try in violation of Terms of Service. Use legitimate captcha-solving services where allowed; for sites with serious protection, plan a human-in-the-loop step. Legal exposure is real.
    How does Anthropic Computer Use compare to Playwright + vision model?
    Computer Use is full mouse-and-keyboard improvisation — slow, expensive, robust. Playwright + a vision model is faster and cheaper when the page has any selectable structure. Use Computer Use as the fallback when Playwright + vision can't recover.

    Related Posts