/* BeakVision — GUI vision API and screen understanding for AI agents */

GUI vision
for your agents.

< Turn screenshots into grounded UI elements, exact coordinates, and next-step actions for browser, desktop, and mobile agents. >

button input toggle dropdown nav_item search_bar
button·link·input·toggle·dropdown·nav_item·search_bar·checkbox·radio·tab·menu_item·slider·icon·image·text·other·button·link·input·toggle·dropdown·nav_item·search_bar·checkbox·radio·tab·menu_item·slider·icon·image·text·other·
// How_it_works

Screenshot in.
Coordinates and UI grounding out.

One API call transforms any screen into a machine-readable map for GUI vision, screen understanding, and screenshot-based automation.

parse.ts
1const result = await fetch('/v1/parse', {
2  method: 'POST',
3  headers: {
4    'Authorization': 'Bearer your_api_key',
5  },
6  body: JSON.stringify({
7    image: base64Screenshot,
8    goal: "tap the submit button",
9    mode: "mobile",
10 })                
11});
 
12// → action.point: { x: 344, y: 192 }
13// → action.thought: "Found submit button..."
step_01

Send a screenshot

Base64-encode any screen — desktop, mobile, web app — and POST to a single endpoint. No SDKs, no setup.

step_02

Vision AI parses it

BeakVision identifies visible UI elements, reasons about the goal, and performs UI grounding for the exact target on screen.

step_03

Agent gets coordinates

You receive structured UI data and a precise action point your agent can execute immediately. No post-processing, no guesswork.

// Capabilities

Built for modern agent workflows {

  // Everything your agent needs for GUI vision, UI grounding, and screenshot automation.

}

// Response_schema

What your agent gets back.

Action Mode response

Mobile and GUI vision modes return one next action with reasoning and exact coordinates. Grounding returns a direct click target for a named on-screen element.

  • action.typestringclick · type · scroll · drag · hotkey …
  • action.pointobject{ x, y } pixel coordinates to act on
  • action.thoughtstringModel reasoning — why this element
  • action.textstring?Text to type (type actions only)
  • meta.processing_time_msnumberEnd-to-end latency
  • data.modestringThe action mode used for this request
response.json
{ "success": true, "data": { "action": { "type": "click", "point": { "x": 344, "y": 192 }, "thought": "The submit button is visible at the bottom-right of the form. Tapping it will confirm the action." }, "mode": "mobile" }, "meta": { "processing_time_ms": 820, "element_count": 0 } }
// Pricing

Pricing for teams shipping
agent products.

Choose a plan, get API access, and start turning screenshots into grounded UI actions in under a minute. Upgrade as your agent traffic grows.

// standard
Pro
$4.99 / month
billed $59.88 / year
  • Includes 2M input tokens each month
  • Includes 2M output tokens each month
  • Effective blended rate: $0.01 / 1K tokens
  • API access
// standard
Metered
Usage-based
  • API access
  • Flexible billing
  • Input tokens billed at $0.0001 / 1K tokens
  • Output tokens billed at $0.0002 / 1K tokens
// Use_cases

Who BeakVision is intended for.

BeakVision is for developers and teams building AI agents that must understand a screen when the DOM, accessibility tree, or app internals are unavailable.

browser_agents.ts

Browser and GUI vision agents

Use BeakVision as the screen understanding layer for browser agents and desktop agents. Send a screenshot, describe the goal, and get the next action point to click, drag, scroll, or type.

qa_automation.ts

QA and end-to-end automation

Use screenshot-to-coordinate workflows when selectors are brittle, delayed, or unavailable. This is especially useful for visual QA, regression environments, and RPA-like flows across third-party software.

mobile_testing.ts

Mobile automation and testing

For mobile apps, BeakVision helps agents identify the next tap target from a screenshot. That makes it useful for test harnesses, agentic demos, and goal-driven mobile automation.

assistive_tools.ts

Accessibility and operator tooling

Teams building assistive software, human-in-the-loop tools, or support overlays can use UI grounding to locate controls precisely on real screens without custom instrumentation.

// Playground — try_it_free()

Try BeakVision.

Upload a screenshot. Add a goal to get a precise action point, or leave it blank to map every element on screen.

playground.ts — sign in to run, no subscription required
Goal → next mobile action
Desktop: drag, hotkey, dbl-click
Name an element → find it
Drop an image here or click to select PNG, JPEG — max 4MB
Reasoning
// API_Reference

Simple to integrate.

One endpoint. Three modes. Authenticate with your API key.

mode:"mobile" — mobile task → Thought + next action with exact coordinates.

mode:"computer" — GUI vision task → adds drag, hotkey, left_double.

mode:"ground" — visible element name → direct click coordinates only. No task planning.

// FAQ

Questions teams ask before they ship.

These are the common questions behind searches for GUI vision APIs, UI grounding, and screenshot-based agent automation.

What makes BeakVision different from a generic vision model?

BeakVision is tuned around actionability. Instead of only describing a screenshot, it returns structured UI information, grounded coordinates, and next-step actions that an agent can execute.

Is this a GUI vision API?

Yes. mode:"computer" is designed for desktop and browser agent workflows where the agent must infer what to do next on a screen and where to do it.

Does it support UI grounding?

Yes. mode:"ground" lets you name a visible element and receive exact grounded coordinates for that target, which is useful for direct clicks and assistive overlays.

Who should use it?

BeakVision is intended for AI agent developers, automation teams, QA engineers, and product builders that need screenshot understanding when direct UI hooks are unreliable or unavailable.

Ready to ship
with BeakVision?

Choose a plan in Polar, then use your API key from the dashboard to call the parsing API.

View Plans → Open Dashboard Try Free Playground
/* Community */

Build with us.

Join the BeakVision community for product updates, support, and shared agent workflows.

join_community()