Agentic Usage – Mobileadapt

Agentic Usage with Mobileadapt

This guide walks you through using an LLM to automate interaction with a mobile device. Prerequisites Before getting started, make sure you have the following:

Python installed on your system.
mobileadapt package installed.
An OpenAI API key.
The app you want to interact with (e.g., Flexify app) installed on the device.
Install the Necessary Packages Ensure you have the following Python packages installed, as they are required for the script to run:

poetry install openai loguru Pillow

Set Up Your OpenAI API Key Replace <your_openai_api_key> in the script with your actual OpenAI API key. This key is required to authenticate your requests to the OpenAI API. Code Explanation
Imports and Setup

import asyncio
import base64
import json
from typing import Any, Dict
from openai import OpenAI
from mobileadapt import mobileadapt

Explanation:

asyncio: This module provides support for asynchronous programming. It is used to run tasks concurrently in the script.
base64: This module allows encoding and decoding of binary data to and from Base64, which is used for transmitting data as text, especially in web contexts.
json: This module allows for working with JSON data, which is a common format for API communication.
typing: This module provides support for type hints, helping to specify the expected types for variables and function arguments.
OpenAI: This is the library from OpenAI that allows interaction with their API.
mobileadapt: This is the library to interact with mobile devices programmatically.

`llm_call` Function

This function is a core part of the script. It uses OpenAI's language model to decide what action to take based on the app's current state. It constructs a prompt using the current UI state and a screenshot to inform the model about what it needs to do. Let's explain this function further:

Function Definition and Initialization

def llm_call(html_state: str, image: bytes, nlp_task: str):
 client = OpenAI()

html_state: str: A string representation of the current HTML state or layout of the app.
image: bytes: A screenshot or relevant image encoded as bytes.
nlp_task: str: A natural language description of the task the model needs to perform.
client = OpenAI(): Initializes a client instance to interact with the OpenAI API.

Function Call Structure

function_call_instruction_guided_replay = {
 "name": "run_step",
 "description": "Based on the current step and the current state, return the next action to take",
 "parameters": {
 "type": "object",
 "properties": {
 "reasoning": {"type": "string", "description": "The reasoning for the action to be performed"},
 "action_type": {"type": "string", "description": "The type of action", "enum": ["tap", "swipe",
"input"]},
 "action_id": {"type": "integer", "description": "The ID of the action to be performed"},
 "value": {"type": "string", "description": "Input value or text"},
 "direction": {"type": "string", "description": "Swipe direction", "enum": ["up", "down", "left",
"right"]}
 },
 "required": ["action_type", "action_id", "reasoning"]
 }
}

This dictionary defines how the OpenAI model should structure its response. It specifies that the model needs to output the reasoning behind its decision, the type of action (e.g., tap, swipe, input), and details like action ID, input values, and swipe directions.

Calling the OpenAI API

response = client.chat.completions.create(
 model="gpt-4o-2024-08-06",
 messages=[
 {"role": "system", "content": "You are an AI assistant that helps with mobile app testing."},
 {"role": "user", "content": [
 {"type": "text", "text": f"Given the following task: {nlp_task} And the current state of the app:
HTML: {html_state}"},
 {"type": "image_url", "image_url": {"url":
f"data:image/jpeg;base64,{base64.b64encode(image).decode('utf-8')}"}},
 ]},
 ],
 functions=[function_call_instruction_guided_replay],
 function_call={"name": "run_step"}
)

model="gpt-4o-2024-08-06": Specifies the version of the GPT-4 model used.
messages: Context messages for the model, including system-level instructions and user-provided task details.
functions: The function structure to guide the model's output.
function_call: The specific function to be executed, here named run_step.

Extracting the Response

return json.loads(response.choices[0].message.function_call.arguments)

This line extracts the model's response, specifically the arguments of the function call, and returns them as a JSON object.

The `main()` Function

This is the central function where the interaction with the Android device is orchestrated.

Starting and Interacting with the Device

async def main():
 android_device = mobileadapt(platform="android")
 await android_device.start_device()

mobileadapt(platform="android"): Creates an instance of mobileadapt configured for an Android device.
await android_device.start_device(): Asynchronously starts the interaction with the connected Android device.

Getting and Interpreting the Device State

encoded_ui, screenshot, ui = await android_device.get_state()

This line retrieves the current state of the device, including the UI layout and a screenshot, which are used as inputs for the llm_call function.

Navigating to the Target App

await android_device.navigate("com.presley.flexify")

navigate("com.presley.flexify"): Directs the Android device to open the app identified by the package name com.presley.flexify.

Generating Markers and Making Calls

set_of_mark = android_device.generate_set_of_mark(ui, screenshot)
action_grounded = llm_call(
 html_state=encoded_ui,
 image=set_of_mark,
 nlp_task="Press the button with the text 'Add a new task'",
)
await android_device.perform_action(action_grounded)

generate_set_of_mark(ui, screenshot): This method creates markers that help identify UI elements based on the screenshot.
llm_call(): Calls the function defined earlier to get the next action to perform.
perform_action(action_grounded): Executes the action returned by llm_call on the Android device.

Stopping the Device

await android_device.stop_device()

Stops the interaction and connection with the Android device.

Entry Point of the Script

The following lines ensure that the script runs the main() function when executed.

if __name__ == "__main__":
 asyncio.run(main())

if __name__ == "__main__": This is a standard Python construct to check if the script is run directly.
asyncio.run(main()): Runs the main() function within the asyncio event loop, facilitating asynchronous execution.I.

Quickstart