SeeAct using GPT-4V: An Efficient Web Navigator

Written By: kinza.sabir
Last Updated On: January 16, 2024

SeeAct, a web assistant that efficiently navigate the browser by understanding and acting upon the instructions. Researchers from The Ohio State University presented this super helpful agent using advance language model GPT-4V.

SeeAct is an adaptable and flexible assistant that has the ability to do various tasks, having expertise of Large Language Models (LLMs). This model understand both text and images on the website.

It performs action online according to the instructions by understanding the web interface. GPT-4V is a smart program that go through the web pages and understand what is happening in the visuals. This technique is super helpful in handling different task on internet.

In the tasks related to web assistant, “grounding” is the most challenging and tricky part. Grounding means that figuring out the exact area or module on the website that the given instruction should interact with. In simple words, it is a way to turn instructions into actions on a website. For example clicking the right button, links or typing at the right place while following a set of online instructions.

The above added image shows how SeeAct works. The task “Compare iPhone 15 Pro Max with iPhone 13 Pro Max” was given to the model. Multiple navigations were performed until the desired output is shown.

SeeAct with GPT-4V

SeeAct with the expertise of GPT-4V showed extraordinary capabilities as a web assistant. This model managed to successfully complete half of the tasks given from numerous websites. The percentage task completion by GPT-4 and FLAN-T5 was 20% and 18% respectively, which is very less compared to SeeAct. This percentage shows that how powerful GPT-4V for handling task on internet.

GPT-4V has two essential capabilities as a generalist web agent;

Action Generation
Element Grounding

Lets understand by the example “Rent a truck with the lowest rate” in the car rental website.

Action Generation will give step-by-step instructions for the task. GPT-4V generate clear instructions for each step, like “Click on the ‘Find Your Truck’ button” or “Fill in your details and click ‘Search.'”
Element Grounding will figure out exactly which parts of the website these instructions should interact with. It will identify and say, “The ‘Find Your Truck’ button is the one you need to click on.”

GPT-4V gives step-by-step instructions for tasks on websites, also understand and pin point to the specific parts of the website these instructions refer to.

The best grounding technique for SeeAct is using both text and pictures on the website. The strategy is 30% better than the methods that focus on the visuals only. This technique still doesn’t perform perfectly but it is way better than just relying only on the images.

It is observed that while handling new websites the method of in-context learning excels. Whereas, while dealing with website that is already trained, the method of supervised fine-tuning is already has a bit of advantage. So, it is clear that both methods have their own strengths depending on either the website is familiar or totally new to the model.

In SeeAct, it is observed that online evaluation, that is testing the model while it is active on the internet, gives better idea of how well it performs. It has numerous ways to finish the task on the website. Whereas, the offline evaluation using stored data, doesn’t show flexibility and better performance. New online evaluation tool was created using Playwright to rest things on live website.

Here is the visual example of SeeAct, showing an input task, multiple actions and desired output.

Given Task: Search for the flight status of AA 3942 leaving on Dec 29.

Action 1: Go to the website: www.aa.com/homePage.do

Actions 2: Navigate the “Flight Status” section

Action 3: Switch the search mode to “Flight Number“

Action 4: Input the flight number “AA 3942“

Action 5: Click on the date dropdown menu to select “Friday, December 29“

Action 6: Click on the “Search” button to execute search.

The image mentioned above is the last snap shot of the example showing the output and desired result of the whole process. The example mentioned above is a successful execution of the given task. Whereas, the success gap is almost 20-25% which further needs improvements.

According to my opinion, the web agent using images and screenshots have have better results and output compared to the web assistant completing task on live website. Models using images and screenshots are very useful and effective innovation that can be used in routine life specially when any GUI has very tiny details mentioned in it. This model will help to find out important detail from the image or screenshot.

Wrap Up!

SeeAct is a smart web assistant that understand both text and visuals on the website. It also managed to successfully complete half of the given task on the live website. CogAgent also perform the same task but the difference is that it takes the screenshots of the GUIs (website or smartphone). Using the best strategy, it still have 20-25% difference from what would be considered as perfect understanding. So, further work could be done on the model for better understanding of websites.