BeakVision is a screen understanding API for AI agents that converts screenshots into grounded UI data, next actions, and exact coordinates.

Who is BeakVision for?

It is built for teams shipping browser agents, GUI vision workflows, mobile automation, QA tooling, accessibility products, and custom agent infrastructure.

What does the API return?

Depending on mode, the API returns visible UI elements, grounded click points, action suggestions, and reasoning about the best next step on a screen.

Does it support GUI vision and UI grounding?

Yes. BeakVision supports GUI vision workflows for desktop screenshots and UI grounding for locating a named element on screen with exact coordinates.

/* BeakVision — GUI vision API and screen understanding for AI agents */

GUI vision
for your agents.

< Turn screenshots into grounded UI elements, exact coordinates, and next-step actions for browser, desktop, and mobile agents. >

get_access() try_playground()

button input toggle dropdown nav_item search_bar

button·link·input·toggle·dropdown·nav_item·search_bar·checkbox·radio·tab·menu_item·slider·icon·image·text·other·button·link·input·toggle·dropdown·nav_item·search_bar·checkbox·radio·tab·menu_item·slider·icon·image·text·other·

// How_it_works

Screenshot in.
Coordinates and UI grounding out.

One API call transforms any screen into a machine-readable map for GUI vision, screen understanding, and screenshot-based automation.

1const result = await fetch('/v1/parse', {

2 method: 'POST',

3 headers: {

4 'Authorization': 'Bearer your_api_key',

5 },

6 body: JSON.stringify({

7 image: base64Screenshot,

8 goal: "tap the submit button",

9 mode: "mobile",

10 })

11});

12// → action.point: { x: 344, y: 192 }

13// → action.thought: "Found submit button..."

step_01

Send a screenshot

Base64-encode any screen — desktop, mobile, web app — and POST to a single endpoint. No SDKs, no setup.

step_02

Vision AI parses it

BeakVision identifies visible UI elements, reasons about the goal, and performs UI grounding for the exact target on screen.

step_03

Agent gets coordinates

You receive structured UI data and a precise action point your agent can execute immediately. No post-processing, no guesswork.

// Capabilities

Built for modern agent workflows {

// Everything your agent needs for GUI vision, UI grounding, and screenshot automation.

Mobile Use mode: "mobile" Goal-driven mobile action planning for iOS and Android. Describe the next step you want and get the exact tap coordinates back.
GUI Vision mode: "computer" Desktop action planning with drag, double-click, hotkey, scroll and more. Built for browser agents and multi-step GUI vision workflows.
Element Grounding mode: "ground" Name a visible element such as "the dark mode toggle" and get exact grounded coordinates for a direct click.
Edge-Fast Inference global edge Runs globally with low-latency inference, no cold starts, and always-on availability for production workflows.
Action Reasoning action.thought Mobile and desktop modes include a thought field explaining the next action. Grounding returns coordinates only.
Simple Auth Bearer auth Use an API key or signed-in session to call the parsing API securely from your app.

}

// Response_schema

What your agent gets back.

Action Mode response

Mobile and GUI vision modes return one next action with reasoning and exact coordinates. Grounding returns a direct click target for a named on-screen element.

action.typestringclick · type · scroll · drag · hotkey …
action.pointobject{ x, y } pixel coordinates to act on
action.thoughtstringModel reasoning — why this element
action.textstring?Text to type (type actions only)
meta.processing_time_msnumberEnd-to-end latency
data.modestringThe action mode used for this request

response.json

{ "success": true, "data": { "action": { "type": "click", "point": { "x": 344, "y": 192 }, "thought": "The submit button is visible at the bottom-right of the form. Tapping it will confirm the action." }, "mode": "mobile" }, "meta": { "processing_time_ms": 820, "element_count": 0 } }

// Pricing

Pricing for teams shipping
agent products.

Choose a plan, get API access, and start turning screenshots into grounded UI actions in under a minute. Upgrade as your agent traffic grows.

// popular

Ultra

$9.99 / month

billed $119.88 / year

✓Includes 8M input tokens each month
✓Includes 8M output tokens each month
✓Effective blended rate: $0.0075 / 1K tokens
✓API access

// standard

Pro

$4.99 / month

billed $59.88 / year

✓Includes 2M input tokens each month
✓Includes 2M output tokens each month
✓Effective blended rate: $0.01 / 1K tokens
✓API access

// standard

Metered

Usage-based

✓API access
✓Flexible billing
✓Input tokens billed at $0.0001 / 1K tokens
✓Output tokens billed at $0.0002 / 1K tokens

// Use_cases

Who BeakVision is intended for.

BeakVision is for developers and teams building AI agents that must understand a screen when the DOM, accessibility tree, or app internals are unavailable.

browser_agents.ts

Browser and GUI vision agents

Use BeakVision as the screen understanding layer for browser agents and desktop agents. Send a screenshot, describe the goal, and get the next action point to click, drag, scroll, or type.

qa_automation.ts

QA and end-to-end automation

Use screenshot-to-coordinate workflows when selectors are brittle, delayed, or unavailable. This is especially useful for visual QA, regression environments, and RPA-like flows across third-party software.

mobile_testing.ts

Mobile automation and testing

For mobile apps, BeakVision helps agents identify the next tap target from a screenshot. That makes it useful for test harnesses, agentic demos, and goal-driven mobile automation.

assistive_tools.ts

Accessibility and operator tooling

Teams building assistive software, human-in-the-loop tools, or support overlays can use UI grounding to locate controls precisely on real screens without custom instrumentation.

// Playground — try_it_free()

Try BeakVision.

Upload a screenshot. Add a goal to get a precise action point, or leave it blank to map every element on screen.

playground.ts — sign in to run, no subscription required

Goal → next mobile action

Desktop: drag, hotkey, dbl-click

Name an element → find it

Goal Mobile Use

Context (optional)

Screenshot

Drop an image here or click to select PNG, JPEG — max 4MB

Reasoning

// API_Reference

Simple to integrate.

One endpoint. Three modes. Authenticate with your API key.

mode:"mobile" — mobile task → Thought + next action with exact coordinates.

mode:"computer" — GUI vision task → adds drag, hotkey, left_double.

mode:"ground" — visible element name → direct click coordinates only. No task planning.

// FAQ

Questions teams ask before they ship.

These are the common questions behind searches for GUI vision APIs, UI grounding, and screenshot-based agent automation.

What makes BeakVision different from a generic vision model?

BeakVision is tuned around actionability. Instead of only describing a screenshot, it returns structured UI information, grounded coordinates, and next-step actions that an agent can execute.

Is this a GUI vision API?

Yes. mode:"computer" is designed for desktop and browser agent workflows where the agent must infer what to do next on a screen and where to do it.

Does it support UI grounding?

Yes. mode:"ground" lets you name a visible element and receive exact grounded coordinates for that target, which is useful for direct clicks and assistive overlays.

Who should use it?

BeakVision is intended for AI agent developers, automation teams, QA engineers, and product builders that need screenshot understanding when direct UI hooks are unreliable or unavailable.

Ready to ship
with BeakVision?

Choose a plan in Polar, then use your API key from the dashboard to call the parsing API.

View Plans → Open Dashboard Try Free Playground

/* Community */

Build with us.

Join the BeakVision community for product updates, support, and shared agent workflows.

join_community()

GUI visionfor your agents.

Screenshot in.Coordinates and UI grounding out.