/* BeakVision — computer use API and screen understanding for AI agents */

Computer use vision
for your agents.

< Turn screenshots into grounded UI elements, exact coordinates,
and next-step actions for browser, desktop, and mobile agents. >

browser_agents Teams building computer-use and browser automation agents that act from screenshots.
mobile_automation Mobile QA, testing, and agent products that need reliable tap targets and next actions.
qa_and_rpa Automation stacks that need screenshot-to-coordinate fallbacks when DOM hooks are missing.
assistive_software Accessibility and operator tooling that needs fast UI grounding on real interfaces.
button input toggle dropdown nav_item search_bar
button·link·input·toggle·dropdown·nav_item·search_bar·checkbox·radio·tab·menu_item·slider·icon·image·text·other·button·link·input·toggle·dropdown·nav_item·search_bar·checkbox·radio·tab·menu_item·slider·icon·image·text·other·
// How_it_works

Screenshot in.
Coordinates and UI grounding out.

One API call transforms any screen into a machine-readable map for computer use, screen understanding, and screenshot-based automation.

parse.ts
1const result = await fetch('/v1/parse', {
2  method: 'POST',
3  headers: {
4    'Authorization': 'Bearer your_api_key',
5  },
6  body: JSON.stringify({
7    image: base64Screenshot,
8    goal: "tap the submit button",
9    mode: "mobile",
10 })                
11});
 
12// → action.point: { x: 344, y: 192 }
13// → action.thought: "Found submit button..."
step_01

Send a screenshot

Base64-encode any screen — desktop, mobile, web app — and POST to a single endpoint. No SDKs, no setup.

step_02

Vision AI parses it

BeakVision identifies visible UI elements, reasons about the goal, and performs UI grounding for the exact target on screen.

step_03

Agent gets coordinates

You receive structured UI data and a precise action point your agent can execute immediately. No post-processing, no guesswork.

// Capabilities

Built for modern agent workflows {

  // Everything your agent needs for computer use, UI grounding, and screenshot automation.

}

// Response_schema

What your agent gets back.

Action Mode response

Mobile and computer-use modes return one next action with reasoning and exact coordinates. Grounding returns a direct click target for a named on-screen element.

  • action.typestringclick · type · scroll · drag · hotkey …
  • action.pointobject{ x, y } pixel coordinates to act on
  • action.thoughtstringModel reasoning — why this element
  • action.textstring?Text to type (type actions only)
  • meta.processing_time_msnumberEnd-to-end latency
  • data.modestringThe action mode used for this request
response.json
{ "success": true, "data": { "action": { "type": "click", "point": { "x": 344, "y": 192 }, "thought": "The submit button is visible at the bottom-right of the form. Tapping it will confirm the action." }, "mode": "mobile" }, "meta": { "processing_time_ms": 820, "element_count": 0 } }
// Pricing

Pricing for teams shipping
agent products.

Choose a plan, get API access, and start turning screenshots into grounded UI actions in under a minute. Upgrade as your agent traffic grows.

// standard
Pro
$4.99 / month
billed $59.88 / year
  • Includes 2M input tokens each month
  • Includes 2M output tokens each month
  • Effective blended rate: $0.01 / 1K tokens
  • API access
// standard
Metered
Usage-based
  • Flexible billing
  • Input tokens billed at $0.0001 / 1K tokens
  • Output tokens billed at $0.0002 / 1K tokens
// Use_cases

Who BeakVision is intended for.

BeakVision is for developers and teams building AI agents that must understand a screen when the DOM, accessibility tree, or app internals are unavailable.

Browser and computer-use agents

Use BeakVision as the screen understanding layer for browser agents and desktop agents. Send a screenshot, describe the goal, and get the next action point to click, drag, scroll, or type.

QA and end-to-end automation

Use screenshot-to-coordinate workflows when selectors are brittle, delayed, or unavailable. This is especially useful for visual QA, regression environments, and RPA-like flows across third-party software.

Mobile automation and testing

For mobile apps, BeakVision helps agents identify the next tap target from a screenshot. That makes it useful for test harnesses, agentic demos, and goal-driven mobile automation.

Accessibility and operator tooling

Teams building assistive software, human-in-the-loop tools, or support overlays can use UI grounding to locate controls precisely on real screens without custom instrumentation.

// Playground — try_it_free()

Try BeakVision.

Upload a screenshot. Add a goal to get a precise action point, or leave it blank to map every element on screen.

playground.ts — sign in to run, no subscription required
Goal → next mobile action
Desktop: drag, hotkey, dbl-click
Name an element → find it
Drop an image here or click to select PNG, JPEG — max 4MB
Reasoning
// API_Reference

Simple to integrate.

One endpoint. Three modes. Authenticate with your API key.

mode:"mobile" — mobile task → Thought + next action with exact coordinates.

mode:"computer" — desktop task → adds drag, hotkey, left_double.

mode:"ground" — visible element name → direct click coordinates only. No task planning.

// FAQ

Questions teams ask before they ship.

These are the common questions behind searches for computer use APIs, UI grounding, and screenshot-based agent automation.

What makes BeakVision different from a generic vision model?

BeakVision is tuned around actionability. Instead of only describing a screenshot, it returns structured UI information, grounded coordinates, and next-step actions that an agent can execute.

Is this a computer use API?

Yes. mode:"computer" is designed for desktop and browser agent workflows where the agent must infer what to do next on a screen and where to do it.

Does it support UI grounding?

Yes. mode:"ground" lets you name a visible element and receive exact grounded coordinates for that target, which is useful for direct clicks and assistive overlays.

Who should use it?

BeakVision is intended for AI agent developers, automation teams, QA engineers, and product builders that need screenshot understanding when direct UI hooks are unreliable or unavailable.

Ready to ship
with BeakVision?

Choose a plan in Polar, then use your API key from the dashboard to call the parsing API.

View Plans → Open Dashboard Try Free Playground
/* Community */

Build with us.

Join the BeakVision community for product updates, support, and shared agent workflows.

join_community()