Introducing the Gemini 2.5 Pc Use mannequin

Earlier this 12 months, we talked about that we’re bringing laptop use capabilities to builders through the Gemini API. At the moment, we’re releasing the Gemini 2.5 Pc Use mannequin, our new specialised mannequin constructed on Gemini 2.5 Professional’s visible understanding and reasoning capabilities that powers brokers able to interacting with consumer interfaces (UIs). It outperforms main options on a number of net and cellular management benchmarks, all with decrease latency. Builders can entry these capabilities through the Gemini API in Google AI Studio and Vertex AI.

Whereas AI fashions can interface with software program via structured APIs, many digital duties nonetheless require direct interplay with graphical consumer interfaces, for instance, filling and submitting kinds. To finish these duties, brokers should navigate net pages and purposes simply as people do: by clicking, typing and scrolling. The power to natively fill out kinds, manipulate interactive components like dropdowns and filters, and function behind logins is an important subsequent step in constructing highly effective, general-purpose brokers.

The way it works

The mannequin’s core capabilities are uncovered via the brand new `computer_use` device within the Gemini API and needs to be operated inside a loop. Inputs to the device are the consumer request, screenshot of the surroundings, and a historical past of latest actions. The enter can even specify whether or not to exclude capabilities from the full listing of supported UI actions or specify extra customized capabilities to incorporate.