paint-brush
I Built a Tool for Mobile and Computer Operator Using Local and Remote LLMs by@mkagenius
131 reads

I Built a Tool for Mobile and Computer Operator Using Local and Remote LLMs

by February 4th, 2025
Read on Terminal Reader
Read this story w/o Javascript
tldt arrow

Too Long; Didn't Read

Open-sourced tool can be found [here.](https://github.com/BandarLabs/clickclickclick) ![If you are a developer, drop a star!](https://cdn.hackernoon.com:null-dv333au)
featured image - I Built a Tool for Mobile and Computer Operator Using Local and Remote LLMs
undefined HackerNoon profile picture

It was the dawn of the AI age when, all of a sudden, new capabilities emerged from computer codes, the same 0s and 1s that once were used to create drawings using Turtle - if you are old enough to have used them. Now, we can control phones by just talking to them!

Click3

Open-sourced tool can be found here. If you are a developer, drop a star!

Claude Computer Use and OpenAI Operator

They are expensive, restrictive, and not privacy-friendly, so I decided to make use of local models while allowing people to use famous closed-source models if they prefer them.

Demos

In the above video, I asked the tool to search for bus stops at a certain location and it successfully found those. You can run the tool from the command line also along with the web interface created by Gradio.

Above, I asked the tool to start a 3+2 chess game on lichess. It successfully opened the lichess app and then clicked on the 3+2 game.

Architecture

The architecture is divided into three main modules - Planner, Finder, Executor

Planner: Creates plan of action

Finder: Finds UI bounds of elements

Executor: Scrolls, clicks, navigates, etc.


There is a flexibility of using either local model (Molmo via mlx-vlm vs closed source model for either Planner or Finder. So far, the recommendations are like below:

Use Cases

You can use this to create walkthrough overlays over any app.


Someone can automate stuff like filtering your matches on Tinder by auto-swiping in the apps based on some feature you tell it to look for.

Conclusion

For now, only Finder uses structured output. Soon, the planner will also be driven by some open-source model. Waiting on either of the open-source models to implement function/tool calling.


If you are a developer drop a star!

Tool - https://github.com/BandarLabs/clickclickclick