Agents that drive the computer: patterns that work
Actualizado: 2026-05-15
When Anthropic released computer use in October 2024, many teams tried it for an afternoon, were surprised to see Claude move the mouse and type in a spreadsheet, and then shelved it as a tech curiosity. A year and a half later, with computer use stabilized in Claude 4.5, browser-use turned into the standard browser automation library, and OpenAI Operator and Gemini Control covering the space, agents that drive the computer have become real tools. Not for everything, but for a slice of cases where they replace brittle RPA macros or scrapers that break every week.
Key takeaways
- A well-prompted agent on a known interface hits between 70% and 90% of runs; unexpected modals and redesigns drop that rate.
- A ten-minute computer-use flow costs between fifty cents and two euros depending on model and resolution.
- The most expensive anti-pattern: using agents for critical irreversible decisions without human supervision.
- The minimum viable architecture includes a verification layer and an evidence-persistence layer.
- The tipping point for preferring a custom API integration usually sits at daily or higher runs with stable tasks.
What has changed since 2024
The material change from the first versions is that models now understand graphical interfaces with enough precision for multi-minute tasks without intervention. Computer use in Claude 4.5 resolves ten-to-fifteen-step flows with reasonable success rates when the interface is standard. Browser-use has matured into a production library with handlers for common failures, session persistence across steps, and structured DOM capture the model can query without repainting the whole screen.
Reliability has improved a lot but isn’t perfect. As soon as unexpected modal alerts, redesigns, or visually dense screens appear, the rate drops. This forces designing flows with clear checkpoints, screenshots saved as evidence, and retries with explicit context on what failed before.
Patterns that survive in production
The human-interface scraper is the first pattern with consistent value. Legacy enterprise apps without an API, SaaS panels with poor exports, or proprietary systems where the only way to extract data is click-and-copy. An agent with browser-use walks the flow daily and drops data into CSV or a database. Versus a Selenium scraper with brittle selectors, the agent is more expensive per run but survives minor redesigns better.
Low-volume administrative task automation is the second pattern: filling forms in supplier portals, uploading files to platforms with changing interfaces, booking resources in legacy internal systems. Where an RPA macro needs maintenance every two months, an agent absorbs small variations and keeps working. The limit is volume: at a hundred daily tasks, agent cost skyrockets.
The QA exploratory-testing assistant is the third pattern. Instead of writing end-to-end tests that break every DOM change, an agent receives a functional goal and walks the app verifying the flow completes. It doesn’t replace stable automated tests, but covers well areas where the team can’t get to tests or where the interface changes too much.
Anti-patterns you pay dearly for
Three anti-patterns are documented enough in 2026 to avoid:
- High-volume or low-latency tasks. If you need to process thousands of operations in minutes, the agent is too slow and expensive. Build the proper API integration.
- Critical decisions without supervision. Approving payments, changing production config, any irreversible action. Agents hit almost always but not 100%, and the 1% on irreversible decisions wrecks the business case. See also enterprise agent governance.
- Pretending the agent replaces human judgment. An agent fills structured forms well; it doesn’t evaluate whether a contract has problematic clauses. Confusing click automation with replacing human reasoning leads to expensive deployments that get abandoned.
Typical 2026 architecture
A reasonable computer-use agent deployment has several pieces:
- An orchestration layer that fires the flow on schedule or event.
- An agent layer with packaged context.
- A verification layer that checks the result makes sense.
- An evidence-persistence layer with screenshots and reasoning traces.
Teams that skip verification and evidence layers discover quickly that when something goes wrong they can’t explain why or reproduce the failure.
# Typical browser-use 2026 pattern
from browser_use import Agent
from anthropic import Anthropic
agent = Agent(
task="Download monthly report from supplier panel", llm=Anthropic(model="claude-4.5-sonnet"), max_steps=15, save_screenshots=True, on_step=lambda s: log_step(s),
)
result = agent.run()
verify_and_persist(result)In production you add retries on failed steps, channel notifications when the agent asks for human help, and a maximum budget per run. This layered architecture mirrors what is documented in lessons from agents in production in 2025.
Real cost and when it pays off
The arithmetic is fairly simple. If a human spends two hours weekly on a repetitive interface task, annual cost is around four thousand euros at mid-range Spanish salaries. If the agent costs twenty euros a month in tokens and three hours of initial development, payoff is in months, not years. If the task only takes half an hour weekly, automating it probably isn’t worth it unless it’s prone to expensive errors.
The tipping point where a custom API integration beats the agent usually sits at daily or higher runs with stable tasks. If the system has an API, even a private one, and the team can spend a week building against it, the integration is cheaper to operate and more reliable.
My reading
In 2026, agents that drive the computer have found their pragmatic niche: low-to-medium volume tasks on systems without APIs, exploratory test automation, and brittle scraper replacement.
The decision to adopt an agent looks like any tool decision: measure current cost of the problem, cost of building a proper solution, and cost of maintaining it. What makes no sense is either ignoring the tool out of 2024-hype prejudice or deploying it everywhere because you can. The middle path is where the real value sits.