It was the dawn of the AI age when, all of a sudden, new capabilities emerged from computer codes, the same 0s and 1s that once were used to create drawings using Turtle - if you are old enough to have used them. Now, we can control phones by just talking to them!
Open-sourced tool can be found here.
They are expensive, restrictive, and not privacy-friendly, so I decided to make use of local models while allowing people to use famous closed-source models if they prefer them.
In the above video, I asked the tool to search for bus stops at a certain location and it successfully found those. You can run the tool from the command line also along with the web interface created by Gradio.
Above, I asked the tool to start a 3+2 chess game on lichess. It successfully opened the lichess app and then clicked on the 3+2 game.
The architecture is divided into three main modules - Planner
, Finder
, Executor
Planner
: Creates plan of action
Finder
: Finds UI bounds of elements
Executor
: Scrolls, clicks, navigates, etc.
There is a flexibility of using either local model (Molmo
via mlx-vlm
vs closed source model for either Planner or Finder. So far, the recommendations are like below:
You can use this to create walkthrough overlays over any app.
Someone can automate stuff like filtering your matches on Tinder by auto-swiping in the apps based on some feature you tell it to look for.
For now, only Finder uses structured output. Soon, the planner will also be driven by some open-source model. Waiting on either of the open-source models to implement function/tool calling.