Introducing the Gemini 2.5 Pc Use mannequin

Earlier this yr, we talked about that we’re bringing pc use capabilities to builders through the Gemini API. At present, we’re releasing the Gemini 2.5 Pc Use mannequin, our new specialised mannequin constructed on Gemini 2.5 Professional’s visible understanding and reasoning capabilities that powers brokers able to interacting with consumer interfaces (UIs). It outperforms main options on a number of internet and cell management benchmarks, all with decrease latency. Builders can entry these capabilities through the Gemini API in Google AI Studio and Vertex AI.

Whereas AI fashions can interface with software program by way of structured APIs, many digital duties nonetheless require direct interplay with graphical consumer interfaces, for instance, filling and submitting kinds. To finish these duties, brokers should navigate internet pages and purposes simply as people do: by clicking, typing and scrolling. The flexibility to natively fill out kinds, manipulate interactive components like dropdowns and filters, and function behind logins is an important subsequent step in constructing highly effective, general-purpose brokers.

The way it works

The mannequin’s core capabilities are uncovered by way of the brand new `computer_use` software within the Gemini API and must be operated inside a loop. Inputs to the software are the consumer request, screenshot of the surroundings, and a historical past of current actions. The enter can even specify whether or not to exclude capabilities from the full record of supported UI actions or specify further customized capabilities to incorporate.