Agentic Usage with Mobileadapt
This guide walks you through using an LLM to automate interaction with a mobile device. Prerequisites Before getting started, make sure you have the following:
- Python installed on your system.
mobileadapt
package installed.- An OpenAI API key.
- The app you want to interact with (e.g., Flexify app) installed on the device.
- Install the Necessary Packages Ensure you have the following Python packages installed, as they are required for the script to run:
poetry install openai loguru Pillow
- Set Up Your OpenAI API Key
Replace
<your_openai_api_key>
in the script with your actual OpenAI API key. This key is required to authenticate your requests to the OpenAI API. Code Explanation - Imports and Setup
import asyncio
import base64
import json
from typing import Any, Dict
from openai import OpenAI
from mobileadapt import mobileadapt
Explanation:
asyncio
: This module provides support for asynchronous programming. It is used to run tasks concurrently in the script.base64
: This module allows encoding and decoding of binary data to and from Base64, which is used for transmitting data as text, especially in web contexts.json
: This module allows for working with JSON data, which is a common format for API communication.typing
: This module provides support for type hints, helping to specify the expected types for variables and function arguments.OpenAI
: This is the library from OpenAI that allows interaction with their API.mobileadapt
: This is the library to interact with mobile devices programmatically.
llm_call
Function
This function is a core part of the script. It uses OpenAI's language model to decide what action to take based on the app's current state. It constructs a prompt using the current UI state and a screenshot to inform the model about what it needs to do. Let's explain this function further:
Function Definition and Initialization
def llm_call(html_state: str, image: bytes, nlp_task: str):
client = OpenAI()
html_state: str
: A string representation of the current HTML state or layout of the app.image: bytes
: A screenshot or relevant image encoded as bytes.nlp_task: str
: A natural language description of the task the model needs to perform.client = OpenAI()
: Initializes a client instance to interact with the OpenAI API.
Function Call Structure
function_call_instruction_guided_replay = {
"name": "run_step",
"description": "Based on the current step and the current state, return the next action to take",
"parameters": {
"type": "object",
"properties": {
"reasoning": {"type": "string", "description": "The reasoning for the action to be performed"},
"action_type": {"type": "string", "description": "The type of action", "enum": ["tap", "swipe",
"input"]},
"action_id": {"type": "integer", "description": "The ID of the action to be performed"},
"value": {"type": "string", "description": "Input value or text"},
"direction": {"type": "string", "description": "Swipe direction", "enum": ["up", "down", "left",
"right"]}
},
"required": ["action_type", "action_id", "reasoning"]
}
}
This dictionary defines how the OpenAI model should structure its response. It specifies that the model needs to output the reasoning behind its decision, the type of action (e.g., tap, swipe, input), and details like action ID, input values, and swipe directions.
Calling the OpenAI API
response = client.chat.completions.create(
model="gpt-4o-2024-08-06",
messages=[
{"role": "system", "content": "You are an AI assistant that helps with mobile app testing."},
{"role": "user", "content": [
{"type": "text", "text": f"Given the following task: {nlp_task} And the current state of the app:
HTML: {html_state}"},
{"type": "image_url", "image_url": {"url":
f"data:image/jpeg;base64,{base64.b64encode(image).decode('utf-8')}"}},
]},
],
functions=[function_call_instruction_guided_replay],
function_call={"name": "run_step"}
)
model="gpt-4o-2024-08-06"
: Specifies the version of the GPT-4 model used.messages
: Context messages for the model, including system-level instructions and user-provided task details.functions
: The function structure to guide the model's output.function_call
: The specific function to be executed, here namedrun_step
.
Extracting the Response
return json.loads(response.choices[0].message.function_call.arguments)
This line extracts the model's response, specifically the arguments of the function call, and returns them as a JSON object.
The main()
Function
This is the central function where the interaction with the Android device is orchestrated.
Starting and Interacting with the Device
async def main():
android_device = mobileadapt(platform="android")
await android_device.start_device()
mobileadapt(platform="android")
: Creates an instance ofmobileadapt
configured for an Android device.await android_device.start_device()
: Asynchronously starts the interaction with the connected Android device.
Getting and Interpreting the Device State
encoded_ui, screenshot, ui = await android_device.get_state()
This line retrieves the current state of the device, including the UI layout and a screenshot, which
are used as inputs for the llm_call
function.
Navigating to the Target App
await android_device.navigate("com.presley.flexify")
navigate("com.presley.flexify")
: Directs the Android device to open the app identified by the package namecom.presley.flexify
.
Generating Markers and Making Calls
set_of_mark = android_device.generate_set_of_mark(ui, screenshot)
action_grounded = llm_call(
html_state=encoded_ui,
image=set_of_mark,
nlp_task="Press the button with the text 'Add a new task'",
)
await android_device.perform_action(action_grounded)
generate_set_of_mark(ui, screenshot)
: This method creates markers that help identify UI elements based on the screenshot.llm_call()
: Calls the function defined earlier to get the next action to perform.perform_action(action_grounded)
: Executes the action returned byllm_call
on the Android device.
Stopping the Device
await android_device.stop_device()
Stops the interaction and connection with the Android device.
Entry Point of the Script
The following lines ensure that the script runs the main()
function when executed.
if __name__ == "__main__":
asyncio.run(main())
if __name__ == "__main__"
: This is a standard Python construct to check if the script is run directly.asyncio.run(main())
: Runs themain()
function within the asyncio event loop, facilitating asynchronous execution.I.