Browser Automation with AI: Playwright + LLMs in Production
Browser automation went from "Selenium scripts that break every Tuesday" to "an LLM clicking around" faster than most categories. By April 2026 the field has consolidated to a small set of production-grade stacks — Playwright + LLM, Stagehand, Browser-Use, Anthropic Computer Use, and the OpenAI CUA — each with a different opinion on selectors vs screenshots. Here's how to actually pick.
The two control models
Selector-driven. The classic Playwright / Puppeteer model: find an element by CSS / XPath / accessibility attributes, click it, fill it, navigate. Fast, deterministic, cheap. Breaks when the UI changes.
Vision-driven. Take a screenshot, give it to a model with vision, the model returns a click target or a typed instruction. Robust to UI changes, expensive per step, slower.
The 2026 production answer isn't either-or. It's a layered stack: selectors for predictable steps, vision for the dynamic ones, full Computer Use for the truly improvisational tail.
The contenders
Playwright
Playwright is the deterministic browser-automation framework — Chromium, Firefox, WebKit, one clean API. Microsoft-maintained, fast, mature. Token-efficient when used directly by agents like Claude Code or GitHub Copilot — selectors are tiny strings compared to screenshots.
Best for: predictable UIs, large-scale scraping where deterministic latency matters, end-to-end testing.
Browser-Use
Browser-Use is an open-source Python library that turns any LLM into a full browser agent. The LLM decides what to click, what to type, when to scroll, and when the task is complete. It crossed 50,000+ GitHub stars, making it one of the fastest-growing OSS AI projects of 2025–2026.
Best for: tasks that require understanding rather than rote clicking — filling out variable forms, navigating sites that change layout often.
Stagehand
Stagehand layers AI on top of Playwright. You write Playwright code where the deterministic parts work, and act() / extract() / observe() calls hand off to an LLM for the dynamic parts.
Best for: teams that already have Playwright suites and want to upgrade specific brittle steps to AI without rewriting everything.
Anthropic Computer Use / OpenAI CUA
Full vision-driven control of a browser (or full desktop). Slowest and most expensive per step; best when no API exists and the UI is genuinely unpredictable.
Best for: long-tail RPA, multi-app workflows, situations where Playwright + LLM-helper isn't enough.
The hybrid stack
The pattern most production teams use in 2026:
This split gives speed (selectors are fast), reliability (deterministic for the bulk), and flexibility (vision when needed).
Selectors vs screenshots: tradeoffs
| Dimension | Selectors | Screenshots |
|---|---|---|
| Latency | Sub-100ms per action | 2–8s per action |
| Cost | Negligible | $0.01–$0.05 per step |
| Reliability under UI change | Brittle | Robust |
| Token consumption | Tiny | Large (vision tokens) |
| Debugging | Clear (selector string) | Harder (which pixels?) |
A useful heuristic: every step you can do with a selector, do with a selector. Reach for vision when the selector strategy fails three times or when the element you need has no stable identifier.
Anti-bot and the real-world ceiling
A piece most marketing material glosses over: serious anti-bot defenses (Cloudflare Turnstile, hCaptcha, banking app device-attestation) detect and block automated browsers, regardless of whether you're driving them with Playwright or vision. Computer Use is harder to detect than headless Chrome, but not invisible.
Practical guidance:
Production checklist
If you're shipping browser automation in 2026:
Cost back-of-envelope
[Inference] A vision-driven step costs $0.01–$0.05 per action. A 50-step task costs $0.50–$2.50. A pure-selector Playwright run costs cents in compute and zero in model fees. Hybrid stacks land somewhere in between depending on the selector hit rate. If your unit economics depend on browser automation being cheap, lean as hard on selectors as you can.

