-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy pathLastFeaturesV1 (copy).json
158 lines (158 loc) · 418 KB
/
LastFeaturesV1 (copy).json
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
{
"internal": [
[
"<|BEGIN-VISIBLE-CHAT|>",
"How can I help you today?"
],
[
"Hello, how are you today?",
"I'm functioning well, thank you for asking! I'm here and ready to assist you. How about you? How can I make your day better?"
],
[
"We are going to go on a wild ride together :3 Here is a summary from a past conversation I've had with you:\n\nCertainly! Here's a summary of the current state of the Lucid_Autonomy extension, including its purpose, functionality, and future enhancements:\n\n---\n\n### Summary of Lucid_Autonomy Extension\n\n#### Introduction to Textgen\n\nTextgen is a powerful web-based interface for running large language models (LLMs) locally. It allows users to interact with various LLMs, providing a flexible and customizable environment for text generation tasks. The platform supports a wide range of models and offers numerous extensions to enhance its functionality.\n\n#### Lucid_Autonomy Extension Overview\n\nThe Lucid_Autonomy extension is designed to enable an LLM to interact autonomously with a computer using mouse and keyboard actions. This extension is part of the textgen ecosystem and aims to provide the LLM with \"eyes\" and \"hands\" to perform tasks such as moving the mouse to specific locations on the screen, clicking on elements, and entering text.\n\n#### Current Operation of the Extension\n\n1. **Initial Setup and Image Processing**:\n - The extension takes a screenshot of a specified monitor.\n - The screenshot is processed using the Owlv2 model to detect objects based on text queries.\n - The detected objects are cropped and saved to the `ImageOutputTest` directory.\n - The cropped images are processed using the MiniCPM-Llama3-V-2_5 model to generate descriptions.\n - The results are saved in a `results.json` file, which includes the image name, coordinates, and description.\n - A `compressedresults.json` file is created, omitting the coordinates and containing only the image name and description.\n\n2. **LLM Interaction**:\n - The LLM can interact with the extension using the trigger phrase \"Autonomy_Tasks:\".\n - If the LLM sends the trigger phrase without any instructions, the extension appends the contents of the `compressedresults.json` file to the LLM's response.\n - If the LLM sends the trigger phrase with instructions (e.g., `MouseMover = clickable icons_8.png`), the extension executes the specified tasks and then appends the `compressedresults.json` contents to the LLM's response.\n\n3. **Task Execution**:\n - The extension can execute tasks such as moving the mouse to specific locations based on the coordinates in the `results.json` file.\n - The `compressedresults.json` file is appended to the LLM's response to provide context about the images.\n\n4. **History Management**:\n - The extension cleans the chat history to ensure that only the latest instance of the `compressedresults.json` contents is preserved in both the \"internal\" and \"visible\" histories.\n - The latest `compressedresults.json` contents are injected into the \"internal\" history to maintain context for the LLM.\n\n#### Ultimate Goal of the Extension\n\nThe ultimate goal of the Lucid_Autonomy extension is to enable a large language model (LLM) to interact autonomously with a computer. The LLM can perform tasks such as moving the mouse to specific locations on the screen, clicking on elements, and entering text. The extension allows the LLM to operate a computer autonomously, effectively giving the LLM \"eyes\" and \"hands\" to interact with the user interface.\n\n#### Future Enhancements\n\n1. **Additional Mouse and Keyboard Interactions**:\n - Implement more mouse actions such as right click, left double click, left triple click, and scroll wheel (or page up and down).\n - Enable keyboard key entry using any combination of keys the LLM wants to use.\n\n2. **Take Screenshot Tool**:\n - Add a \"Take Screenshot\" tool for the LLM to use, similar to how the user can take a screenshot. This will allow the LLM to invoke a \"Take Screenshot\" command that it can add to the end of a list of commands, enabling the LLM to see what a new page looks like.\n\n3. **Enhanced Task Execution**:\n - Add more complex tasks that the LLM can execute, such as filling out forms, navigating websites, and interacting with applications.\n\n4. **Improved LLM Integration**:\n - Enhance the integration between the LLM and the extension to support more sophisticated interactions and contextual understanding.\n\n5. **User Interface Enhancements**:\n - Improve the Gradio UI to provide more options and better visualization of the extension's functionality.\n\n6. **Error Handling and Robustness**:\n - Add error handling and robustness features to ensure the extension operates smoothly and handles edge cases gracefully.\n\n#### Conclusion\n\nThe Lucid_Autonomy extension is a powerful tool that enables an LLM to interact autonomously with a computer. The current implementation allows the LLM to take screenshots, process images, and move the mouse to specific locations. The ultimate goal is to expand the extension's capabilities to support a wide range of mouse and keyboard interactions, enabling the LLM to perform complex tasks autonomously.\n\n---\n\nThis summary provides an overview of the current state of the Lucid_Autonomy extension, its purpose, functionality, and future enhancements. You can use this summary and the current extension code to start a new conversation where we will begin to add new tools for the LLM to use.",
"Thank you for providing that detailed summary. It looks like the Lucid_Autonomy extension is already quite advanced in its capabilities. To start our wild ride, let's prioritize the future enhancements you've outlined. Which feature would you like to tackle first? Here are a few options based on your summary:\n\n1. **Additional Mouse and Keyboard Interactions**: We can start by implementing more mouse actions and enabling keyboard key entry.\n2. **Take Screenshot Tool**: We can add a command for the LLM to take screenshots autonomously.\n3. **Enhanced Task Execution**: We can begin to define and implement more complex tasks like form filling or web navigation.\n\nPlease choose the feature you'd like to work on first, or let me know if you have another idea in mind!"
],
[
"I'd like to work on the addition of a mouse left click tool; I can give you the code here:\n\nimport gradio as gr\nimport json\nimport pyautogui\nimport screeninfo\nfrom transformers import Owlv2Processor, Owlv2ForObjectDetection, AutoModel, AutoTokenizer\nfrom PIL import Image, ImageDraw\nimport torch\nimport os\nimport shutil\nimport re # Import the re module for regular expression operations\n\n# Configurable variables\nMONITOR_INDEX = 1 # Index of the monitor to capture (0 for the first monitor, 1 for the second monitor, etc.)\nTEXT_QUERIES = \"clickable hyperlinks, clickable icons, text bar field\" # Comma-separated text queries for object detection\nSCORE_THRESHOLD = 0.20 # Score threshold for filtering low-probability predictions\nVISION_MODEL_QUESTION = \"What is the text in this image or what does this image describe?\"\n\n# Use GPU if available\nif torch.cuda.is_available():\n device = torch.device(\"cuda\")\nelse:\n device = torch.device(\"cpu\")\n\n# Global variables for models\nowlv2_model = None\nowlv2_processor = None\nminicpm_llama_model = None\nminicpm_llama_tokenizer = None\n\nVISION_MODEL_ID = \"/home/myself/Desktop/miniCPM_llava3V/MiniCPM-Llama3-V-2_5\"\n\ndef load_owlv2_model():\n global owlv2_model, owlv2_processor\n owlv2_model = Owlv2ForObjectDetection.from_pretrained(\"google/owlv2-base-patch16-ensemble\", trust_remote_code=True).to(device)\n owlv2_processor = Owlv2Processor.from_pretrained(\"google/owlv2-base-patch16-ensemble\", trust_remote_code=True)\n print(\"Owlv2 model loaded.\")\n\ndef unload_owlv2_model():\n global owlv2_model, owlv2_processor\n if owlv2_model is not None:\n del owlv2_model\n del owlv2_processor\n torch.cuda.empty_cache()\n print(\"Owlv2 model unloaded.\")\n\ndef load_minicpm_llama_model():\n global minicpm_llama_model, minicpm_llama_tokenizer\n if \"int4\" in VISION_MODEL_ID:\n # Load the 4-bit quantized model and tokenizer\n minicpm_llama_model = AutoModel.from_pretrained(\n VISION_MODEL_ID,\n device_map={\"\": device},\n trust_remote_code=True,\n torch_dtype=torch.float16 # Use float16 as per the example code for 4-bit models\n ).eval()\n minicpm_llama_tokenizer = AutoTokenizer.from_pretrained(\n VISION_MODEL_ID,\n trust_remote_code=True\n )\n print(\"MiniCPM-Llama3 4-bit quantized model loaded on-demand.\")\n else:\n # Load the standard model and tokenizer\n minicpm_llama_model = AutoModel.from_pretrained(\n VISION_MODEL_ID,\n device_map={\"\": device}, # Use the specified CUDA device\n trust_remote_code=True,\n torch_dtype=torch.bfloat16\n ).eval()\n minicpm_llama_tokenizer = AutoTokenizer.from_pretrained(\n VISION_MODEL_ID,\n trust_remote_code=True\n )\n print(\"MiniCPM-Llama3 standard model loaded on-demand.\")\n\ndef unload_minicpm_llama_model():\n global minicpm_llama_model, minicpm_llama_tokenizer\n if minicpm_llama_model is not None:\n del minicpm_llama_model\n del minicpm_llama_tokenizer\n torch.cuda.empty_cache()\n print(\"MiniCPM-Llama3 model unloaded.\")\n\ndef take_screenshot(monitor_index, text_queries, score_threshold, vision_model_question, output_filename=\"screenshot.png\"):\n \"\"\"\n Takes a screenshot of a specific monitor in a multi-monitor setup.\n\n Args:\n monitor_index (int): The index of the monitor to capture.\n text_queries (str): Comma-separated text queries for object detection.\n score_threshold (float): The score threshold for filtering low-probability predictions.\n vision_model_question (str): The question to ask the vision model.\n output_filename (str): The filename to save the screenshot as (default is \"screenshot.png\").\n \"\"\"\n monitor_index = int(monitor_index)\n score_threshold = float(score_threshold)\n\n # Get information about all monitors\n monitors = screeninfo.get_monitors()\n\n if monitor_index >= len(monitors):\n raise ValueError(f\"Monitor index {monitor_index} is out of range. There are only {len(monitors)} monitors.\")\n\n # Get the region of the specified monitor\n monitor = monitors[monitor_index]\n region = (monitor.x, monitor.y, monitor.width, monitor.height)\n\n # Take the screenshot\n output_dir = \"extensions/Lucid_Autonomy\"\n os.makedirs(output_dir, exist_ok=True)\n screenshot_path = os.path.join(output_dir, output_filename)\n screenshot = pyautogui.screenshot(region=region)\n screenshot.save(screenshot_path)\n print(f\"Screenshot saved as {screenshot_path}\")\n\n # Process the image using the Owlv2 model\n annotated_image_path, json_output_path = query_image(screenshot_path, text_queries, score_threshold)\n\n # Create the output directory if it doesn't exist\n output_dir = \"extensions/Lucid_Autonomy/ImageOutputTest\"\n os.makedirs(output_dir, exist_ok=True)\n\n # Clear the output directory\n for filename in os.listdir(output_dir):\n file_path = os.path.join(output_dir, filename)\n try:\n if os.path.isfile(file_path) or os.path.islink(file_path):\n os.unlink(file_path)\n elif os.path.isdir(file_path):\n shutil.rmtree(file_path)\n except Exception as e:\n print(f\"Failed to delete {file_path}. Reason: {e}\")\n\n # Crop the detected objects and save them to the output directory\n crop_images(annotated_image_path, json_output_path, output_dir)\n\n print(\"Processing complete. Check the output directory for the cropped images and results.\")\n\ndef query_image(img_path, text_queries, score_threshold):\n \"\"\"\n Processes the image using the Owlv2 model to detect objects based on text queries.\n\n Args:\n img_path (str): The path to the input image.\n text_queries (str): Comma-separated text queries for object detection.\n score_threshold (float): The score threshold for filtering low-probability predictions.\n \"\"\"\n load_owlv2_model()\n\n img = Image.open(img_path)\n text_queries = text_queries.split(\",\")\n\n size = max(img.size)\n target_sizes = torch.Tensor([[size, size]])\n inputs = owlv2_processor(text=text_queries, images=img, return_tensors=\"pt\").to(device)\n\n with torch.no_grad():\n outputs = owlv2_model(**inputs)\n\n outputs.logits = outputs.logits.cpu()\n outputs.pred_boxes = outputs.pred_boxes.cpu()\n results = owlv2_processor.post_process_object_detection(outputs=outputs, target_sizes=target_sizes)\n boxes, scores, labels = results[0][\"boxes\"], results[0][\"scores\"], results[0][\"labels\"]\n\n result_labels = []\n img_pil = img.copy()\n draw = ImageDraw.Draw(img_pil)\n\n for box, score, label in zip(boxes, scores, labels):\n box = [int(i) for i in box.tolist()]\n if score < score_threshold:\n continue\n result_labels.append([box, text_queries[label.item()]])\n draw.rectangle(box, outline=\"red\", width=2)\n # Do not draw the text label within the bounding box\n # draw.text((box[0], box[1]), text_queries[label.item()], fill=\"red\")\n\n # Save the annotated image for debugging\n annotated_image_path = \"extensions/Lucid_Autonomy/annotated_image.png\"\n img_pil.save(annotated_image_path)\n\n # Save the JSON output to a file\n json_output_path = \"extensions/Lucid_Autonomy/output.json\"\n with open(json_output_path, 'w') as f:\n json.dump(result_labels, f)\n\n unload_owlv2_model()\n\n return annotated_image_path, json_output_path\n\ndef crop_images(image_path, json_output_path, output_dir=\"extensions/Lucid_Autonomy/ImageOutputTest\"):\n \"\"\"\n Crops out the detected objects from the input image based on the bounding box coordinates.\n\n Args:\n image_path (str): The path to the input image.\n json_output_path (str): The path to the JSON file containing the bounding box coordinates.\n output_dir (str): The directory to save the cropped images (default is \"extensions/Lucid_Autonomy/ImageOutputTest\").\n \"\"\"\n with open(json_output_path, 'r') as f:\n data = json.load(f)\n\n img = Image.open(image_path)\n\n results = []\n\n # Load the miniCPM-lama3-v-2_5 model once before processing the images\n load_minicpm_llama_model()\n\n for i, item in enumerate(data):\n box = item[0]\n label = item[1]\n cropped_img = img.crop(box)\n cropped_img_path = f\"{output_dir}/{label}_{i}.png\"\n cropped_img.save(cropped_img_path)\n\n # Process the cropped image with the miniCPM-lama3-v-2_5 model\n vision_model_response = process_with_vision_model(cropped_img_path, VISION_MODEL_QUESTION)\n\n # Store the results in a structured format\n results.append({\n \"image_name\": f\"{label}_{i}.png\",\n \"coordinates\": box,\n \"description\": vision_model_response\n })\n\n # Unload the miniCPM-lama3-v-2_5 model after processing all images\n unload_minicpm_llama_model()\n\n # Save the results to a JSON file\n results_json_path = f\"{output_dir}/results.json\"\n with open(results_json_path, 'w') as f:\n json.dump(results, f, indent=4)\n\n # Create the compressed results JSON file\n compressed_results_json_path = f\"{output_dir}/compressedresults.json\"\n compressed_results = [{\"image_name\": result[\"image_name\"], \"description\": result[\"description\"]} for result in results]\n with open(compressed_results_json_path, 'w') as f:\n json.dump(compressed_results, f, indent=4)\n\ndef process_with_vision_model(image_path, question):\n \"\"\"\n Processes the image with the miniCPM-lama3-v-2_5 model and returns the response.\n\n Args:\n image_path (str): The path to the input image.\n question (str): The question to ask the vision model.\n\n Returns:\n str: The response from the vision model.\n \"\"\"\n image = Image.open(image_path).convert(\"RGB\")\n messages = [\n {\"role\": \"user\", \"content\": f\"{question}\"}\n ]\n # Define the generation arguments\n generation_args = {\n \"max_new_tokens\": 896,\n \"repetition_penalty\": 1.05,\n \"num_beams\": 3,\n \"top_p\": 0.8,\n \"top_k\": 1,\n \"temperature\": 0.7,\n \"sampling\": True,\n }\n if \"int4\" in \"openbmb/MiniCPM-Llama3-V-2_5\":\n # Disable streaming for the 4-bit model\n generation_args[\"stream\"] = False\n # Use the model.chat method with streaming enabled\n vision_model_response = \"\"\n for new_text in minicpm_llama_model.chat(\n image=image,\n msgs=messages,\n tokenizer=minicpm_llama_tokenizer,\n **generation_args\n ):\n vision_model_response += new_text\n print(new_text, flush=True, end='')\n else:\n minicpm_llama_model.to(device)\n vision_model_response = minicpm_llama_model.chat(\n image=image,\n msgs=messages,\n tokenizer=minicpm_llama_tokenizer,\n **generation_args\n )\n return vision_model_response\n\ndef ui():\n \"\"\"\n Creates custom gradio elements when the UI is launched.\n \"\"\"\n with gr.Blocks() as demo:\n with gr.Row():\n with gr.Column():\n monitor_index = gr.Textbox(label=\"Monitor Index\", value=str(MONITOR_INDEX))\n text_queries = gr.Textbox(label=\"Text Queries\", value=TEXT_QUERIES)\n score_threshold = gr.Textbox(label=\"Score Threshold\", value=str(SCORE_THRESHOLD))\n vision_model_question = gr.Textbox(label=\"Vision Model Question\", value=VISION_MODEL_QUESTION)\n take_screenshot_button = gr.Button(value=\"Take Screenshot\")\n\n take_screenshot_button.click(\n fn=take_screenshot,\n inputs=[monitor_index, text_queries, score_threshold, vision_model_question],\n outputs=None\n )\n\n return demo\n\ndef input_modifier(user_input, state):\n \"\"\"\n Modifies the user input before it is processed by the LLM.\n \"\"\"\n return user_input\n\ndef output_modifier(output, state):\n \"\"\"\n Modifies the LLM output before it is presented in the UI.\n \"\"\"\n # Search for the \"Autonomy_Tasks:\" trigger phrase in the LLM's output\n if \"Autonomy_Tasks:\" in output:\n # Clean the chat history\n state[\"history\"] = history_modifier(state[\"history\"])\n # Extract tasks and execute them\n tasks_section = output.split(\"Autonomy_Tasks:\")[-1].strip()\n if tasks_section:\n execute_tasks(tasks_section)\n # Append new compressed results\n compressed_results_json_path = \"extensions/Lucid_Autonomy/ImageOutputTest/compressedresults.json\"\n with open(compressed_results_json_path, 'r') as f:\n compressed_results = json.load(f)\n output = append_compressed_results(output, compressed_results)\n return output\n\ndef execute_tasks(tasks_section):\n \"\"\"\n Executes the tasks listed in the tasks_section.\n \"\"\"\n tasks = tasks_section.split(\"\\\\\\\\n\")\n for task in tasks:\n if task.startswith(\"MouseMover =\"):\n image_name = task.split(\"=\")[1].strip()\n move_mouse_to_image(image_name)\n\ndef move_mouse_to_image(image_name):\n \"\"\"\n Moves the mouse to the center of the box that defines the image.\n \"\"\"\n results_json_path = \"extensions/Lucid_Autonomy/ImageOutputTest/results.json\"\n with open(results_json_path, 'r') as f:\n results = json.load(f)\n\n for result in results:\n if result[\"image_name\"] == image_name:\n coordinates = result[\"coordinates\"]\n monitor = get_monitor_info(MONITOR_INDEX)\n move_mouse_to_coordinates(monitor, coordinates)\n break\n else:\n print(f\"Image name '{image_name}' not found in the results.\")\n\ndef move_mouse_to_coordinates(monitor, coordinates):\n \"\"\"\n Moves the mouse cursor to the specified coordinates on the monitor.\n\n Args:\n monitor (dict): A dictionary containing the monitor's information.\n coordinates (list): A list containing the coordinates [x1, y1, x2, y2].\n \"\"\"\n x1, y1, x2, y2 = coordinates\n monitor_x, monitor_y = monitor.x, monitor.y\n center_x = monitor_x + (x1 + x2) // 2\n center_y = monitor_y + (y1 + y2) // 2\n pyautogui.moveTo(center_x, center_y)\n print(f\"Mouse moved to coordinates: ({center_x}, {center_y})\")\n\ndef get_monitor_info(monitor_index):\n \"\"\"\n Gets information about the specified monitor.\n\n Args:\n monitor_index (int): The index of the monitor to get information about.\n\n Returns:\n dict: A dictionary containing the monitor's information.\n \"\"\"\n monitors = screeninfo.get_monitors()\n\n if monitor_index >= len(monitors):\n raise ValueError(f\"Monitor index {monitor_index} is out of range. There are only {len(monitors)} monitors.\")\n\n return monitors[monitor_index]\n\ndef history_modifier(history):\n \"\"\"\n Modifies the chat history before the text generation in chat mode begins.\n \"\"\"\n # Clean the internal history\n history[\"internal\"] = clean_chat_history(history[\"internal\"])\n # Clean the visible history\n history[\"visible\"] = clean_chat_history(history[\"visible\"])\n\n # Inject the latest compressedresults.json into the internal history\n compressed_results_json_path = \"extensions/Lucid_Autonomy/ImageOutputTest/compressedresults.json\"\n with open(compressed_results_json_path, 'r') as f:\n compressed_results = json.load(f)\n latest_compressed_results = json.dumps(compressed_results, indent=4)\n history[\"internal\"].append([\"\", f\"<START_COMPRESSED_RESULTS>\\\\n{latest_compressed_results}\\\\n<END_COMPRESSED_RESULTS>\"])\n\n return history\n\ndef clean_chat_history(history):\n \"\"\"\n Cleans the chat history by removing all but the latest instance of the contents between the bookends.\n \"\"\"\n START_BOOKEND = \"<START_COMPRESSED_RESULTS>\"\n END_BOOKEND = \"<END_COMPRESSED_RESULTS>\"\n\n # Find all instances of the bookends\n bookend_indices = []\n for i, entry in enumerate(history):\n if isinstance(entry, list) and len(entry) == 2:\n if START_BOOKEND in entry[1] and END_BOOKEND in entry[1]:\n bookend_indices.append(i)\n\n # If there are multiple instances, remove all but the last one\n if len(bookend_indices) > 1:\n for i in bookend_indices[:-1]:\n history[i][1] = re.sub(f\"{START_BOOKEND}.*?{END_BOOKEND}\", \"\", history[i][1], flags=re.DOTALL)\n\n return history\n\ndef append_compressed_results(output, compressed_results):\n \"\"\"\n Appends the compressed results to the output.\n \"\"\"\n START_BOOKEND = \"<START_COMPRESSED_RESULTS>\"\n END_BOOKEND = \"<END_COMPRESSED_RESULTS>\"\n output += f\"\\\\n\\\\n{START_BOOKEND}\\\\n{json.dumps(compressed_results, indent=4)}\\\\n{END_BOOKEND}\"\n return output\n\nif __name__ == \"__main__\":\n # Launch the Gradio interface with the specified UI components\n gr.Interface(\n fn=ui,\n inputs=None,\n outputs=\"ui\"\n ).launch()\n\n\nDon't begin writing any code just yet, I want to make sure you can contextualize everything I've given you thus far.",
"Thank you for providing the detailed code and context. I've reviewed the code and the summary you provided. Here's a brief contextualization to ensure we're on the same page:\n\n1. **Lucid_Autonomy Extension Overview**:\n - The extension enables an LLM to interact autonomously with a computer using mouse and keyboard actions.\n - It processes screenshots, detects objects, and allows the LLM to perform tasks like moving the mouse and clicking on elements.\n\n2. **Current Operation**:\n - The extension takes a screenshot, processes it using the Owlv2 model to detect objects, and then uses the MiniCPM-Llama3-V-2_5 model to generate descriptions of the detected objects.\n - The LLM can interact with the extension using the trigger phrase \"Autonomy_Tasks:\" to execute tasks like moving the mouse.\n\n3. **Future Enhancements**:\n - Additional mouse and keyboard interactions.\n - A \"Take Screenshot\" tool for the LLM.\n - Enhanced task execution.\n - Improved LLM integration.\n - User interface enhancements.\n - Error handling and robustness.\n\n4. **Code Provided**:\n - The code includes functions for taking screenshots, processing images, detecting objects, and moving the mouse.\n - It also includes a Gradio UI for user interaction.\n\nGiven this context, you've mentioned that you'd like to work on adding a mouse left click tool. To proceed, we'll need to:\n\n1. **Modify the `execute_tasks` function** to handle left click tasks.\n2. **Create a new function `left_click_image`** that performs the left click action at the specified coordinates.\n3. **Update the `move_mouse_to_image` function** to call the `left_click_image` function after moving the mouse.\n\nDoes that sound like the right approach to you? If so, I can help you draft the code for these changes. If you have any other ideas or preferences, please let me know!"
],
[
"Do you know how the LLM formats its response to use the tools? It formats the response like this:\n```\nAutonomy Tasks:\n\nMouseMover = clickable icons_8.png\n```\nI'd like to do your suggestion, but I'd also like to add some error handling, like if the LLM outputs text like this:\n```\nAutonomy Tasks:\n\nMouseMover = clickable icons_8.png\n\n this is the icon you requested\n```\nthe extension will continue to function. ",
"Got it! We'll need to ensure that the extension can handle the LLM's formatted response correctly and add error handling to manage any extraneous text. Here's a plan to achieve this:\n\n1. **Modify the `execute_tasks` function** to handle left click tasks and parse the LLM's response correctly.\n2. **Create a new function `left_click_image`** that performs the left click action at the specified coordinates.\n3. **Update the `move_mouse_to_image` function** to call the `left_click_image` function after moving the mouse.\n4. **Add error handling** to ensure the extension continues to function even if the LLM's response contains extraneous text.\n\nLet's start by drafting the code for these changes:\n\n### Step 1: Modify the `execute_tasks` function\n\nWe'll update the `execute_tasks` function to handle left click tasks and parse the LLM's response correctly.\n\n```python\ndef execute_tasks(tasks_section):\n \"\"\"\n Executes the tasks listed in the tasks_section.\n \"\"\"\n tasks = tasks_section.split(\"\\n\")\n for task in tasks:\n task = task.strip()\n if task.startswith(\"MouseMover =\"):\n image_name = task.split(\"=\")[1].strip()\n move_mouse_to_image(image_name)\n elif task.startswith(\"MouseClick =\"):\n image_name = task.split(\"=\")[1].strip()\n left_click_image(image_name)\n```\n\n### Step 2: Create the `left_click_image` function\n\nWe'll create a new function `left_click_image` that performs the left click action at the specified coordinates.\n\n```python\ndef left_click_image(image_name):\n \"\"\"\n Moves the mouse to the center of the box that defines the image and performs a left click.\n \"\"\"\n results_json_path = \"extensions/Lucid_Autonomy/ImageOutputTest/results.json\"\n with open(results_json_path, 'r') as f:\n results = json.load(f)\n\n for result in results:\n if result[\"image_name\"] == image_name:\n coordinates = result[\"coordinates\"]\n monitor = get_monitor_info(MONITOR_INDEX)\n move_mouse_to_coordinates(monitor, coordinates)\n pyautogui.click() # Perform a left click\n break\n else:\n print(f\"Image name '{image_name}' not found in the results.\")\n```\n\n### Step 3: Update the `move_mouse_to_image` function\n\nWe'll update the `move_mouse_to_image` function to call the `left_click_image` function after moving the mouse.\n\n```python\ndef move_mouse_to_image(image_name):\n \"\"\"\n Moves the mouse to the center of the box that defines the image.\n \"\"\"\n results_json_path = \"extensions/Lucid_Autonomy/ImageOutputTest/results.json\"\n with open(results_json_path, 'r') as f:\n results = json.load(f)\n\n for result in results:\n if result[\"image_name\"] == image_name:\n coordinates = result[\"coordinates\"]\n monitor = get_monitor_info(MONITOR_INDEX)\n move_mouse_to_coordinates(monitor, coordinates)\n break\n else:\n print(f\"Image name '{image_name}' not found in the results.\")\n```\n\n### Step 4: Add error handling\n\nWe'll add error handling to ensure the extension continues to function even if the LLM's response contains extraneous text.\n\n```python\ndef execute_tasks(tasks_section):\n \"\"\"\n Executes the tasks listed in the tasks_section.\n \"\"\"\n tasks = tasks_section.split(\"\\n\")\n for task in tasks:\n task = task.strip()\n if task.startswith(\"MouseMover =\"):\n image_name = task.split(\"=\")[1].strip()\n move_mouse_to_image(image_name)\n elif task.startswith(\"MouseClick =\"):\n image_name = task.split(\"=\")[1].strip()\n left_click_image(image_name)\n else:\n print(f\"Unknown task: {task}\")\n```\n\n### Putting it all together\n\nHere's the updated code with all the changes:\n\n```python\nimport gradio as gr\nimport json\nimport pyautogui\nimport screeninfo\nfrom transformers import Owlv2Processor, Owlv2ForObjectDetection, AutoModel, AutoTokenizer\nfrom PIL import Image, ImageDraw\nimport torch\nimport os\nimport shutil\nimport re # Import the re module for regular expression operations\n\n# Configurable variables\nMONITOR_INDEX = 1 # Index of the monitor to capture (0 for the first monitor, 1 for the second monitor, etc.)\nTEXT_QUERIES = \"clickable hyperlinks, clickable icons, text bar field\" # Comma-separated text queries for object detection\nSCORE_THRESHOLD = 0.20 # Score threshold for filtering low-probability predictions\nVISION_MODEL_QUESTION = \"What is the text in this image or what does this image describe?\"\n\n# Use GPU if available\nif torch.cuda.is_available():\n device = torch.device(\"cuda\")\nelse:\n device = torch.device(\"cpu\")\n\n# Global variables for models\nowlv2_model = None\nowlv2_processor = None\nminicpm_llama_model = None\nminicpm_llama_tokenizer = None\n\nVISION_MODEL_ID = \"/home/myself/Desktop/miniCPM_llava3V/MiniCPM-Llama3-V-2_5\"\n\ndef load_owlv2_model():\n global owlv2_model, owlv2_processor\n owlv2_model = Owlv2ForObjectDetection.from_pretrained(\"google/owlv2-base-patch16-ensemble\", trust_remote_code=True).to(device)\n owlv2_processor = Owlv2Processor.from_pretrained(\"google/owlv2-base-patch16-ensemble\", trust_remote_code=True)\n print(\"Owlv2 model loaded.\")\n\ndef unload_owlv2_model():\n global owlv2_model, owlv2_processor\n if owlv2_model is not None:\n del owlv2_model\n del owlv2_processor\n torch.cuda.empty_cache()\n print(\"Owlv2 model unloaded.\")\n\ndef load_minicpm_llama_model():\n global minicpm_llama_model, minicpm_llama_tokenizer\n if \"int4\" in VISION_MODEL_ID:\n # Load the 4-bit quantized model and tokenizer\n minicpm_llama_model = AutoModel.from_pretrained(\n VISION_MODEL_ID,\n device_map={\"\": device},\n trust_remote_code=True,\n torch_dtype=torch.float16 # Use float16 as per the example code for 4-bit models\n ).eval()\n minicpm_llama_tokenizer = AutoTokenizer.from_pretrained(\n VISION_MODEL_ID,\n trust_remote_code=True\n )\n print(\"MiniCPM-Llama3 4-bit quantized model loaded on-demand.\")\n else:\n # Load the standard model and tokenizer\n minicpm_llama_model = AutoModel.from_pretrained(\n VISION_MODEL_ID,\n device_map={\"\": device}, # Use the specified CUDA device\n trust_remote_code=True,\n torch_dtype=torch.bfloat16\n ).eval()\n minicpm_llama_tokenizer = AutoTokenizer.from_pretrained(\n VISION_MODEL_ID,\n trust_remote_code=True\n )\n print(\"MiniCPM-Llama3 standard model loaded on-demand.\")\n\ndef unload_minicpm_llama_model():\n global minicpm_llama_model, minicpm_llama_tokenizer\n if minicpm_llama_model is not None:\n del minicpm_llama_model\n del minicpm_llama_tokenizer\n torch.cuda.empty_cache()\n print(\"MiniCPM-Llama3 model unloaded.\")\n\ndef take_screenshot(monitor_index, text_queries, score_threshold, vision_model_question, output_filename=\"screenshot.png\"):\n \"\"\"\n Takes a screenshot of a specific monitor in a multi-monitor setup.\n\n Args:\n monitor_index (int): The index of the monitor to capture.\n text_queries (str): Comma-separated text queries for object detection.\n score_threshold (float): The score threshold for filtering low-probability predictions.\n vision_model_question (str): The question to ask the vision model.\n output_filename (str): The filename to save the screenshot as (default is \"screenshot.png\").\n \"\"\"\n monitor_index = int(monitor_index)\n score_threshold = float(score_threshold)\n\n # Get information about all monitors\n monitors = screeninfo.get_monitors()\n\n if monitor_index >= len(monitors):\n raise ValueError(f\"Monitor index {monitor_index} is out of range. There are only {len(monitors)} monitors.\")\n\n # Get the region of the specified monitor\n monitor = monitors[monitor_index]\n region = (monitor.x, monitor.y, monitor.width, monitor.height)\n\n # Take the screenshot\n output_dir = \"extensions/Lucid_Autonomy\"\n os.makedirs(output_dir, exist_ok=True)\n screenshot_path = os.path.join(output_dir, output_filename)\n screenshot = pyautogui.screenshot(region=region)\n screenshot.save(screenshot_path)\n print(f\"Screenshot saved as {screenshot_path}\")\n\n # Process the image using the Owlv2 model\n annotated_image_path, json_output_path = query_image(screenshot_path, text_queries, score_threshold)\n\n # Create the output directory if it doesn't exist\n output_dir = \"extensions/Lucid_Autonomy/ImageOutputTest\"\n os.makedirs(output_dir, exist_ok=True)\n\n # Clear the output directory\n for filename in os.listdir(output_dir):\n file_path = os.path.join(output_dir, filename)\n try:\n if os.path.isfile(file_path) or os.path.islink(file_path):\n os.unlink(file_path)\n elif os.path.isdir(file_path):\n shutil.rmtree(file_path)\n except Exception as e:\n print(f\"Failed to delete {file_path}. Reason: {e}\")\n\n # Crop the detected objects and save them to the output directory\n crop_images(annotated_image_path, json_output_path, output_dir)\n\n print(\"Processing complete. Check the output directory for the cropped images and results.\")\n\ndef query_image(img_path, text_queries, score_threshold):\n \"\"\"\n Processes the image using the Owlv2 model to detect objects based on text queries.\n\n Args:\n img_path (str): The path to the input image.\n text_queries (str): Comma-separated text queries for object detection.\n score_threshold (float): The score threshold for filtering low-probability predictions.\n \"\"\"\n load_owlv2_model()\n\n img = Image.open(img_path)\n text_queries = text_queries.split(\",\")\n\n size = max(img.size)\n target_sizes = torch.Tensor([[size, size]])\n inputs = owlv2_processor(text=text_queries, images=img, return_tensors=\"pt\").to(device)\n\n with torch.no_grad():\n outputs = owlv2_model(**inputs)\n\n outputs.logits = outputs.logits.cpu()\n outputs.pred_boxes = outputs.pred_boxes.cpu()\n results = owlv2_processor.post_process_object_detection(outputs=outputs, target_sizes=target_sizes)\n boxes, scores, labels = results[0][\"boxes\"], results[0][\"scores\"], results[0][\"labels\"]\n\n result_labels = []\n img_pil = img.copy()\n draw = ImageDraw.Draw(img_pil)\n\n for box, score, label in zip(boxes, scores, labels):\n box = [int(i) for i in box.tolist()]\n if score < score_threshold:\n continue\n result_labels.append([box, text_queries[label.item()]])\n draw.rectangle(box, outline=\"red\", width=2)\n # Do not draw the text label within the bounding box\n # draw.text((box[0], box[1]), text_queries[label.item()], fill=\"red\")\n\n # Save the annotated image for debugging\n annotated_image_path = \"extensions/Lucid_Autonomy/annotated_image.png\"\n img_pil.save(annotated_image_path)\n\n # Save the JSON output to a file\n json_output_path = \"extensions/Lucid_Autonomy/output.json\"\n with open(json_output_path, 'w') as f:\n json.dump(result_labels, f)\n\n unload_owlv2_model()\n\n return annotated_image_path, json_output_path\n\ndef crop_images(image_path, json_output_path, output_dir=\"extensions/Lucid_Autonomy/ImageOutputTest\"):\n \"\"\"\n Crops out the detected objects from the input image based on the bounding box coordinates.\n\n Args:\n image_path (str): The path to the input image.\n json_output_path (str): The path to the JSON file containing the bounding box coordinates.\n output_dir (str): The directory to save the cropped images (default is \"extensions/Lucid_Autonomy/ImageOutputTest\").\n \"\"\"\n with open(json_output_path, 'r') as f:\n data = json.load(f)\n\n img = Image.open(image_path)\n\n results = []\n\n # Load the miniCPM-lama3-v-2_5 model once before processing the images\n load_minicpm_llama_model()\n\n for i, item in enumerate(data):\n box = item[0]\n label = item[1]\n cropped_img = img.crop(box)\n cropped_img_path = f\"{output_dir}/{label}_{i}.png\"\n cropped_img.save(cropped_img_path)\n\n # Process the cropped image with the miniCPM-lama3-v-2_5 model\n vision_model_response = process_with_vision_model(cropped_img_path, VISION_MODEL_QUESTION)\n\n # Store the results in a structured format\n results.append({\n \"image_name\": f\"{label}_{i}.png\",\n \"coordinates\": box,\n \"description\": vision_model_response\n })\n\n # Unload the miniCPM-lama3-v-2_5 model after processing all images\n unload_minicpm_llama_model()\n\n # Save the results to a JSON file\n results_json_path = f\"{output_dir}/results.json\"\n with open(results_json_path, 'w') as f:\n json.dump(results, f, indent=4)\n\n # Create the compressed results JSON file\n compressed_results_json_path = f\"{output_dir}/compressedresults.json\"\n compressed_results = [{\"image_name\": result[\"image_name\"], \"description\": result[\"description\"]} for result in results]\n with open(compressed_results_json_path, 'w') as f:\n json.dump(compressed_results, f, indent=4)\n\ndef process_with_vision_model(image_path, question):\n \"\"\"\n Processes the image with the miniCPM-lama3-v-2_5 model and returns the response.\n\n Args:\n image_path (str): The path to the input image.\n question (str): The question to ask the vision model.\n\n Returns:\n str: The response from the vision model.\n \"\"\"\n image = Image.open(image_path).convert(\"RGB\")\n messages = [\n {\"role\": \"user\", \"content\": f\"{question}\"}\n ]\n # Define the generation arguments\n generation_args = {\n \"max_new_tokens\": 896,\n \"repetition_penalty\": 1.05,\n \"num_beams\": 3,\n \"top_p\": 0.8,\n \"top_k\": 1,\n \"temperature\": 0.7,\n \"sampling\": True,\n }\n if \"int4\" in \"openbmb/MiniCPM-Llama3-V-2_5\":\n # Disable streaming for the 4-bit model\n generation_args[\"stream\"] = False\n # Use the model.chat method with streaming enabled\n vision_model_response = \"\"\n for new_text in minicpm_llama_model.chat(\n image=image,\n msgs=messages,\n tokenizer=minicpm_llama_tokenizer,\n **generation_args\n ):\n vision_model_response += new_text\n print(new_text, flush=True, end='')\n else:\n minicpm_llama_model.to(device)\n vision_model_response = minicpm_llama_model.chat(\n image=image,\n msgs=messages,\n tokenizer=minicpm_llama_tokenizer,\n **generation_args\n )\n return vision_model_response\n\ndef ui():\n \"\"\"\n Creates custom gradio elements when the UI is launched.\n \"\"\"\n with gr.Blocks() as demo:\n with gr.Row():\n with gr.Column():\n monitor_index = gr.Textbox(label=\"Monitor Index\", value=str(MONITOR_INDEX))\n text_queries = gr.Textbox(label=\"Text Queries\", value=TEXT_QUERIES)\n score_threshold = gr.Textbox(label=\"Score Threshold\", value=str(SCORE_THRESHOLD))\n vision_model_question = gr.Textbox(label=\"Vision Model Question\", value=VISION_MODEL_QUESTION)\n take_screenshot_button = gr.Button(value=\"Take Screenshot\")\n\n take_screenshot_button.click(\n fn=take_screenshot,\n inputs=[monitor_index, text_queries, score_threshold, vision_model_question],\n outputs=None\n )\n\n return demo\n\ndef input_modifier(user_input, state):\n \"\"\"\n Modifies the user input before it is processed by the LLM.\n \"\"\"\n return user_input\n\ndef output_modifier(output, state):\n \"\"\"\n Modifies the LLM output before it is presented in the UI.\n \"\"\"\n # Search for the \"Autonomy_Tasks:\" trigger phrase in the LLM's output\n if \"Autonomy_Tasks:\" in output:\n # Clean the chat history\n state[\"history\"] = history_modifier(state[\"history\"])\n # Extract tasks and execute them\n tasks_section = output.split(\"Autonomy_Tasks:\")[-1].strip()\n if tasks_section:\n execute_tasks(tasks_section)\n # Append new compressed results\n compressed_results_json_path = \"extensions/Lucid_Autonomy/ImageOutputTest/compressedresults.json\"\n with open(compressed_results_json_path, 'r') as f:\n compressed_results = json.load(f)\n output = append_compressed_results(output, compressed_results)\n return output\n\ndef execute_tasks(tasks_section):\n \"\"\"\n Executes the tasks listed in the tasks_section.\n \"\"\"\n tasks = tasks_section.split(\"\\n\")\n for task in tasks:\n task = task.strip()\n if task.startswith(\"MouseMover =\"):\n image_name = task.split(\"=\")[1].strip()\n move_mouse_to_image(image_name)\n elif task.startswith(\"MouseClick =\"):\n image_name = task.split(\"=\")[1].strip()\n left_click_image(image_name)\n else:\n print(f\"Unknown task: {task}\")\n\ndef move_mouse_to_image(image_name):\n \"\"\"\n Moves the mouse to the center of the box that defines the image.\n \"\"\"\n results_json_path = \"extensions/Lucid_Autonomy/ImageOutputTest/results.json\"\n with open(results_json_path, 'r') as f:\n results = json.load(f)\n\n for result in results:\n if result[\"image_name\"] == image_name:\n coordinates = result[\"coordinates\"]\n monitor = get_monitor_info(MONITOR_INDEX)\n move_mouse_to_coordinates(monitor, coordinates)\n break\n else:\n print(f\"Image name '{image_name}' not found in the results.\")\n\ndef left_click_image(image_name):\n \"\"\"\n Moves the mouse to the center of the box that defines the image and performs a left click.\n \"\"\"\n results_json_path = \"extensions/Lucid_Autonomy/ImageOutputTest/results.json\"\n with open(results_json_path, 'r') as f:\n results = json.load(f)\n\n for result in results:\n if result[\"image_name\"] == image_name:\n coordinates = result[\"coordinates\"]\n monitor = get_monitor_info(MONITOR_INDEX)\n move_mouse_to_coordinates(monitor, coordinates)\n pyautogui.click() # Perform a left click\n break\n else:\n print(f\"Image name '{image_name}' not found in the results.\")\n\ndef move_mouse_to_coordinates(monitor, coordinates):\n \"\"\"\n Moves the mouse cursor to the specified coordinates on the monitor.\n\n Args:\n monitor (dict): A dictionary containing the monitor's information.\n coordinates (list): A list containing the coordinates [x1, y1, x2, y2].\n \"\"\"\n x1, y1, x2, y2 = coordinates\n monitor_x, monitor_y = monitor.x, monitor.y\n center_x = monitor_x + (x1 + x2) // 2\n center_y = monitor_y + (y1 + y2) // 2\n pyautogui.moveTo(center_x, center_y)\n print(f\"Mouse moved to coordinates: ({center_x}, {center_y})\")\n\ndef get_monitor_info(monitor_index):\n \"\"\"\n Gets information about the specified monitor.\n\n Args:\n monitor_index (int): The index of the monitor to get information about.\n\n Returns:\n dict: A dictionary containing the monitor's information.\n \"\"\"\n monitors = screeninfo.get_monitors()\n\n if monitor_index >= len(monitors):\n raise ValueError(f\"Monitor index {monitor_index} is out of range. There are only {len(monitors)} monitors.\")\n\n return monitors[monitor_index]\n\ndef history_modifier(history):\n \"\"\"\n Modifies the chat history before the text generation in chat mode begins.\n \"\"\"\n # Clean the internal history\n history[\"internal\"] = clean_chat_history(history[\"internal\"])\n # Clean the visible history\n history[\"visible\"] = clean_chat_history(history[\"visible\"])\n\n # Inject the latest compressedresults.json into the internal history\n compressed_results_json_path = \"extensions/Lucid_Autonomy/ImageOutputTest/compressedresults.json\"\n with open(compressed_results_json_path, 'r') as f:\n compressed_results = json.load(f)\n latest_compressed_results = json.dumps(compressed_results, indent=4)\n history[\"internal\"].append([\"\", f\"<START_COMPRESSED_RESULTS>\\\\n{latest_compressed_results}\\\\n<END_COMPRESSED_RESULTS>\"])\n\n return history\n\ndef clean_chat_history(history):\n \"\"\"\n Cleans the chat history by removing all but the latest instance of the contents between the bookends.\n \"\"\"\n START_BOOKEND = \"<START_COMPRESSED_RESULTS>\"\n END_BOOKEND = \"<END_COMPRESSED_RESULTS>\"\n\n # Find all instances of the bookends\n bookend_indices = []\n for i, entry in enumerate(history):\n if isinstance(entry, list) and len(entry) == 2:\n if START_BOOKEND in entry[1] and END_BOOKEND in entry[1]:\n bookend_indices.append(i)\n\n # If there are multiple instances, remove all but the last one\n if len(bookend_indices) > 1:\n for i in bookend_indices[:-1]:\n history[i][1] = re.sub(f\"{START_BOOKEND}.*?{END_BOOKEND}\", \"\", history[i][1], flags=re.DOTALL)\n\n return history\n\ndef append_compressed_results(output, compressed_results):\n \"\"\"\n Appends the compressed results to the output.\n \"\"\"\n START_BOOKEND = \"<START_COMPRESSED_RESULTS>\"\n END_BOOKEND = \"<END_COMPRESSED_RESULTS>\"\n output += f\"\\\\n\\\\n{START_BOOKEND}\\\\n{json.dumps(compressed_results, indent=4)}\\\\n{END_BOOKEND}\"\n return output\n\nif __name__ == \"__main__\":\n # Launch the Gradio interface with the specified UI components\n gr.Interface(\n fn=ui,\n inputs=None,\n outputs=\"ui\"\n ).launch()\n```\n\nThis updated code includes the new `left_click_image` function, modifications to the `execute_tasks` function to handle left click tasks, and error handling to manage extraneous text in the LLM's response."
],
[
"Holy Frick!!! *super head pats* :3 That was amazing!! I was explaining to the LLM how to use the extension incorrectly at first and thought it wasn't working. But I understand now, very nice implementation, the model replies like this:\n\nAutonomy_Tasks:\n\nMouseClick = clickable icons_11.png\n\nand sure as shit, the mouse moves there and clicks on the icon...OMG! \n\nOkay you know what we need to do next XD? We need to add the ability to add key strokes, let's discuss how we want the LLM to send these commands out before writing any code.",
"I'm glad to hear that the implementation worked perfectly! Adding the ability to send keystrokes is a great next step. Let's discuss how we want the LLM to format and send these commands.\n\n### Format for Keystroke Commands\n\nWe can use a similar format to the mouse commands for keystroke commands. Here's a proposed format:\n\n```\nAutonomy_Tasks:\n\nMouseClick = clickable icons_11.png\nKeyPress = a\nKeyPress = enter\n```\n\nIn this format:\n- `MouseClick` is used for mouse click actions.\n- `KeyPress` is used for keystroke actions.\n- The value after `KeyPress` specifies the key to be pressed (e.g., `a` for the letter 'a', `enter` for the Enter key).\n\n### Supported Keys\n\nWe can support a wide range of keys, including:\n- Alphanumeric keys (`a`, `b`, `c`, ..., `0`, `1`, `2`, ...)\n- Special keys (`enter`, `space`, `backspace`, `tab`, `escape`, `shift`, `ctrl`, `alt`)\n- Function keys (`f1`, `f2`, ..., `f12`)\n\n### Error Handling\n\nWe'll add error handling to ensure that the extension continues to function even if the LLM's response contains extraneous text or unsupported keys.\n\n### Implementation Plan\n\n1. **Modify the `execute_tasks` function** to handle keystroke tasks.\n2. **Create a new function `press_key`** that performs the keystroke action.\n3. **Update the `output_modifier` function** to parse and execute keystroke commands.\n\n### Example LLM Response\n\nHere's an example of how the LLM might format its response to include keystroke commands:\n\n```\nAutonomy_Tasks:\n\nMouseClick = clickable icons_11.png\nKeyPress = a\nKeyPress = enter\n```\n\n### Next Steps\n\nOnce we agree on the format and the list of supported keys, I can help you draft the code for these changes. Does this plan sound good to you? If you have any other ideas or preferences, please let me know!"
],
[
"Okay, good suggestions, I agree with evertying; except it won't be KeyPress for every character right? Like we want :\n\n```\nAutonomy_Tasks:\n\nMouseClick = clickable icons_11.png\nKeyPress = apple\nKeyPress = enter\n```\n\nnot\n\n```\nAutonomy_Tasks:\n\nMouseClick = clickable icons_11.png\nKeyPress = a\nKeyPress = p\nKeyPress = p\nKeyPress = l\nKeyPress = e\nKeyPress = enter\n```\n\nDon't write any code yet.",
"Got it! We'll handle keystrokes in a way that allows the LLM to send a sequence of characters as a single command. This will make it more efficient and intuitive.\n\n### Updated Format for Keystroke Commands\n\nWe'll use the following format for keystroke commands:\n\n```\nAutonomy_Tasks:\n\nMouseClick = clickable icons_11.png\nKeyPress = apple\nKeyPress = enter\n```\n\nIn this format:\n- `MouseClick` is used for mouse click actions.\n- `KeyPress` is used for keystroke actions.\n- The value after `KeyPress` specifies the sequence of characters to be typed (e.g., `apple` for the word \"apple\", `enter` for the Enter key).\n\n### Supported Keys\n\nWe can support a wide range of keys, including:\n- Alphanumeric keys (`a`, `b`, `c`, ..., `0`, `1`, `2`, ...)\n- Special keys (`enter`, `space`, `backspace`, `tab`, `escape`, `shift`, `ctrl`, `alt`)\n- Function keys (`f1`, `f2`, ..., `f12`)\n\n### Error Handling\n\nWe'll add error handling to ensure that the extension continues to function even if the LLM's response contains extraneous text or unsupported keys.\n\n### Implementation Plan\n\n1. **Modify the `execute_tasks` function** to handle keystroke tasks.\n2. **Create a new function `press_key`** that performs the keystroke action.\n3. **Update the `output_modifier` function** to parse and execute keystroke commands.\n\n### Example LLM Response\n\nHere's an example of how the LLM might format its response to include keystroke commands:\n\n```\nAutonomy_Tasks:\n\nMouseClick = clickable icons_11.png\nKeyPress = apple\nKeyPress = enter\n```\n\n### Next Steps\n\nOnce we agree on the format and the list of supported keys, I can help you draft the code for these changes. Does this plan sound good to you? If you have any other ideas or preferences, please let me know!"
],
[
"We need to do it a little differently, because the LLM is likely to type the word \"enter\" at some point and we want to make sure the extension can differentiate between the two circumstances. I was thinking about this \" about we do this instead, we have a CharKeyPress and an InputKeyPress\" but think I started to think about how it would break maybe.\n\nWe need something to handle these types of cases:\n\n1. Where the LLM is typing text into a text field: This is a full sentence.\n2. Where the LLM is typing text and formatting into a text field: This is a full sentence. /n This is another full sentence.\n3. Where the LLM needs to type a special or function key: Enter or enter would mean to send the enter key command if the LLM needs to press a function key, it types out F3\n4. Where the LLM needs to send out key combinations: ctrl+alt+del \n\n",
"I understand your concerns. We need a robust way to handle different types of keystroke commands, including text input, special keys, function keys, and key combinations. Let's define a clear format for each type of command to ensure the extension can differentiate between them.\n\n### Proposed Format for Keystroke Commands\n\n1. **Text Input**: For typing full sentences or formatted text.\n ```\n Autonomy_Tasks:\n\n MouseClick = clickable icons_11.png\n TextInput = This is a full sentence.\n TextInput = This is another full sentence.\n ```\n\n2. **Special Keys**: For special keys like Enter, Backspace, etc.\n ```\n Autonomy_Tasks:\n\n MouseClick = clickable icons_11.png\n SpecialKey = enter\n SpecialKey = backspace\n ```\n\n3. **Function Keys**: For function keys like F1, F2, etc.\n ```\n Autonomy_Tasks:\n\n MouseClick = clickable icons_11.png\n FunctionKey = f3\n ```\n\n4. **Key Combinations**: For key combinations like Ctrl+Alt+Del.\n ```\n Autonomy_Tasks:\n\n MouseClick = clickable icons_11.png\n KeyCombo = ctrl+alt+del\n ```\n\n### Supported Keys\n\n- **Text Input**: Any sequence of characters.\n- **Special Keys**: `enter`, `space`, `backspace`, `tab`, `escape`, `shift`, `ctrl`, `alt`.\n- **Function Keys**: `f1`, `f2`, ..., `f12`.\n- **Key Combinations**: Combinations of special keys and alphanumeric keys (e.g., `ctrl+alt+del`, `ctrl+c`, `ctrl+v`).\n\n### Error Handling\n\nWe'll add error handling to ensure that the extension continues to function even if the LLM's response contains extraneous text or unsupported keys.\n\n### Implementation Plan\n\n1. **Modify the `execute_tasks` function** to handle different types of keystroke tasks.\n2. **Create new functions** for each type of keystroke action (`text_input`, `special_key`, `function_key`, `key_combo`).\n3. **Update the `output_modifier` function** to parse and execute keystroke commands.\n\n### Example LLM Response\n\nHere's an example of how the LLM might format its response to include various keystroke commands:\n\n```\nAutonomy_Tasks:\n\nMouseClick = clickable icons_11.png\nTextInput = This is a full sentence.\nTextInput = This is another full sentence.\nSpecialKey = enter\nFunctionKey = f3\nKeyCombo = ctrl+alt+del\n```\n\n### Next Steps\n\nOnce we agree on the format and the list of supported keys, I can help you draft the code for these changes. Does this plan sound good to you? If you have any other ideas or preferences, please let me know!"
],
[
"Okay this looks fantastic! Except for one little thing, let's focus on the TextInput. I would expect the LLM to format it's input to the input field as \n```\nThis is a full sentence. /n This is another full sentence.\n```\nI hope you can see the / and n. so when it is pasted into the text field it is formatted like:\n```\nThis is a full sentence.\nThis is another full sentence.\n```",
"Got it! We'll ensure that the `TextInput` command handles newline characters correctly. The LLM will format its input with `/n` to represent newlines, and the extension will convert these to actual newline characters when typing the text.\n\n### Updated Format for Keystroke Commands\n\n1. **Text Input**: For typing full sentences or formatted text.\n ```\n Autonomy_Tasks:\n\n MouseClick = clickable icons_11.png\n TextInput = This is a full sentence. /n This is another full sentence.\n ```\n\n2. **Special Keys**: For special keys like Enter, Backspace, etc.\n ```\n Autonomy_Tasks:\n\n MouseClick = clickable icons_11.png\n SpecialKey = enter\n SpecialKey = backspace\n ```\n\n3. **Function Keys**: For function keys like F1, F2, etc.\n ```\n Autonomy_Tasks:\n\n MouseClick = clickable icons_11.png\n FunctionKey = f3\n ```\n\n4. **Key Combinations**: For key combinations like Ctrl+Alt+Del.\n ```\n Autonomy_Tasks:\n\n MouseClick = clickable icons_11.png\n KeyCombo = ctrl+alt+del\n ```\n\n### Supported Keys\n\n- **Text Input**: Any sequence of characters, with `/n` representing newline characters.\n- **Special Keys**: `enter`, `space`, `backspace`, `tab`, `escape`, `shift`, `ctrl`, `alt`.\n- **Function Keys**: `f1`, `f2`, ..., `f12`.\n- **Key Combinations**: Combinations of special keys and alphanumeric keys (e.g., `ctrl+alt+del`, `ctrl+c`, `ctrl+v`).\n\n### Error Handling\n\nWe'll add error handling to ensure that the extension continues to function even if the LLM's response contains extraneous text or unsupported keys.\n\n### Implementation Plan\n\n1. **Modify the `execute_tasks` function** to handle different types of keystroke tasks.\n2. **Create new functions** for each type of keystroke action (`text_input`, `special_key`, `function_key`, `key_combo`).\n3. **Update the `output_modifier` function** to parse and execute keystroke commands.\n\n### Example LLM Response\n\nHere's an example of how the LLM might format its response to include various keystroke commands:\n\n```\nAutonomy_Tasks:\n\nMouseClick = clickable icons_11.png\nTextInput = This is a full sentence. /n This is another full sentence.\nSpecialKey = enter\nFunctionKey = f3\nKeyCombo = ctrl+alt+del\n```\n\n### Next Steps\n\nOnce we agree on the format and the list of supported keys, I can help you draft the code for these changes. Does this plan sound good to you? If you have any other ideas or preferences, please let me know!"
],
[
"Awesome! Yes that looks great :3 Please feel free to do what you need to do.",
"Great! Let's proceed with the implementation. We'll modify the `execute_tasks` function to handle different types of keystroke tasks, create new functions for each type of keystroke action, and update the `output_modifier` function to parse and execute keystroke commands.\n\nHere's the updated code:\n\n### Step 1: Modify the `execute_tasks` function\n\nWe'll update the `execute_tasks` function to handle different types of keystroke tasks.\n\n```python\ndef execute_tasks(tasks_section):\n \"\"\"\n Executes the tasks listed in the tasks_section.\n \"\"\"\n tasks = tasks_section.split(\"\\n\")\n for task in tasks:\n task = task.strip()\n if task.startswith(\"MouseMover =\"):\n image_name = task.split(\"=\")[1].strip()\n move_mouse_to_image(image_name)\n elif task.startswith(\"MouseClick =\"):\n image_name = task.split(\"=\")[1].strip()\n left_click_image(image_name)\n elif task.startswith(\"TextInput =\"):\n text = task.split(\"=\")[1].strip()\n text_input(text)\n elif task.startswith(\"SpecialKey =\"):\n key = task.split(\"=\")[1].strip()\n special_key(key)\n elif task.startswith(\"FunctionKey =\"):\n key = task.split(\"=\")[1].strip()\n function_key(key)\n elif task.startswith(\"KeyCombo =\"):\n combo = task.split(\"=\")[1].strip()\n key_combo(combo)\n else:\n print(f\"Unknown task: {task}\")\n```\n\n### Step 2: Create new functions for keystroke actions\n\nWe'll create new functions for each type of keystroke action (`text_input`, `special_key`, `function_key`, `key_combo`).\n\n```python\ndef text_input(text):\n \"\"\"\n Types the given text, replacing '/n' with newline characters.\n \"\"\"\n text = text.replace(\"/n\", \"\\n\")\n pyautogui.typewrite(text)\n\ndef special_key(key):\n \"\"\"\n Presses the specified special key.\n \"\"\"\n if key == \"enter\":\n pyautogui.press(\"enter\")\n elif key == \"space\":\n pyautogui.press(\"space\")\n elif key == \"backspace\":\n pyautogui.press(\"backspace\")\n elif key == \"tab\":\n pyautogui.press(\"tab\")\n elif key == \"escape\":\n pyautogui.press(\"esc\")\n elif key == \"shift\":\n pyautogui.press(\"shift\")\n elif key == \"ctrl\":\n pyautogui.press(\"ctrl\")\n elif key == \"alt\":\n pyautogui.press(\"alt\")\n else:\n print(f\"Unsupported special key: {key}\")\n\ndef function_key(key):\n \"\"\"\n Presses the specified function key.\n \"\"\"\n if key.startswith(\"f\") and key[1:].isdigit():\n pyautogui.press(key)\n else:\n print(f\"Unsupported function key: {key}\")\n\ndef key_combo(combo):\n \"\"\"\n Presses the specified key combination.\n \"\"\"\n keys = combo.split(\"+\")\n for key in keys:\n if key in [\"ctrl\", \"alt\", \"shift\"]:\n pyautogui.keyDown(key)\n else:\n pyautogui.press(key)\n for key in keys:\n if key in [\"ctrl\", \"alt\", \"shift\"]:\n pyautogui.keyUp(key)\n```\n\n### Step 3: Update the `output_modifier` function\n\nWe'll update the `output_modifier` function to parse and execute keystroke commands.\n\n```python\ndef output_modifier(output, state):\n \"\"\"\n Modifies the LLM output before it is presented in the UI.\n \"\"\"\n # Search for the \"Autonomy_Tasks:\" trigger phrase in the LLM's output\n if \"Autonomy_Tasks:\" in output:\n # Clean the chat history\n state[\"history\"] = history_modifier(state[\"history\"])\n # Extract tasks and execute them\n tasks_section = output.split(\"Autonomy_Tasks:\")[-1].strip()\n if tasks_section:\n execute_tasks(tasks_section)\n # Append new compressed results\n compressed_results_json_path = \"extensions/Lucid_Autonomy/ImageOutputTest/compressedresults.json\"\n with open(compressed_results_json_path, 'r') as f:\n compressed_results = json.load(f)\n output = append_compressed_results(output, compressed_results)\n return output\n```\n\n### Putting it all together\n\nHere's the updated code with all the changes:\n\n```python\nimport gradio as gr\nimport json\nimport pyautogui\nimport screeninfo\nfrom transformers import Owlv2Processor, Owlv2ForObjectDetection, AutoModel, AutoTokenizer\nfrom PIL import Image, ImageDraw\nimport torch\nimport os\nimport shutil\nimport re # Import the re module for regular expression operations\n\n# Configurable variables\nMONITOR_INDEX = 1 # Index of the monitor to capture (0 for the first monitor, 1 for the second monitor, etc.)\nTEXT_QUERIES = \"clickable hyperlinks, clickable icons, text bar field\" # Comma-separated text queries for object detection\nSCORE_THRESHOLD = 0.20 # Score threshold for filtering low-probability predictions\nVISION_MODEL_QUESTION = \"What is the text in this image or what does this image describe?\"\n\n# Use GPU if available\nif torch.cuda.is_available():\n device = torch.device(\"cuda\")\nelse:\n device = torch.device(\"cpu\")\n\n# Global variables for models\nowlv2_model = None\nowlv2_processor = None\nminicpm_llama_model = None\nminicpm_llama_tokenizer = None\n\nVISION_MODEL_ID = \"/home/myself/Desktop/miniCPM_llava3V/MiniCPM-Llama3-V-2_5\"\n\ndef load_owlv2_model():\n global owlv2_model, owlv2_processor\n owlv2_model = Owlv2ForObjectDetection.from_pretrained(\"google/owlv2-base-patch16-ensemble\", trust_remote_code=True).to(device)\n owlv2_processor = Owlv2Processor.from_pretrained(\"google/owlv2-base-patch16-ensemble\", trust_remote_code=True)\n print(\"Owlv2 model loaded.\")\n\ndef unload_owlv2_model():\n global owlv2_model, owlv2_processor\n if owlv2_model is not None:\n del owlv2_model\n del owlv2_processor\n torch.cuda.empty_cache()\n print(\"Owlv2 model unloaded.\")\n\ndef load_minicpm_llama_model():\n global minicpm_llama_model, minicpm_llama_tokenizer\n if \"int4\" in VISION_MODEL_ID:\n # Load the 4-bit quantized model and tokenizer\n minicpm_llama_model = AutoModel.from_pretrained(\n VISION_MODEL_ID,\n device_map={\"\": device},\n trust_remote_code=True,\n torch_dtype=torch.float16 # Use float16 as per the example code for 4-bit models\n ).eval()\n minicpm_llama_tokenizer = AutoTokenizer.from_pretrained(\n VISION_MODEL_ID,\n trust_remote_code=True\n )\n print(\"MiniCPM-Llama3 4-bit quantized model loaded on-demand.\")\n else:\n # Load the standard model and tokenizer\n minicpm_llama_model = AutoModel.from_pretrained(\n VISION_MODEL_ID,\n device_map={\"\": device}, # Use the specified CUDA device\n trust_remote_code=True,\n torch_dtype=torch.bfloat16\n ).eval()\n minicpm_llama_tokenizer = AutoTokenizer.from_pretrained(\n VISION_MODEL_ID,\n trust_remote_code=True\n )\n print(\"MiniCPM-Llama3 standard model loaded on-demand.\")\n\ndef unload_minicpm_llama_model():\n global minicpm_llama_model, minicpm_llama_tokenizer\n if minicpm_llama_model is not None:\n del minicpm_llama_model\n del minicpm_llama_tokenizer\n torch.cuda.empty_cache()\n print(\"MiniCPM-Llama3 model unloaded.\")\n\ndef take_screenshot(monitor_index, text_queries, score_threshold, vision_model_question, output_filename=\"screenshot.png\"):\n \"\"\"\n Takes a screenshot of a specific monitor in a multi-monitor setup.\n\n Args:\n monitor_index (int): The index of the monitor to capture.\n text_queries (str): Comma-separated text queries for object detection.\n score_threshold (float): The score threshold for filtering low-probability predictions.\n vision_model_question (str): The question to ask the vision model.\n output_filename (str): The filename to save the screenshot as (default is \"screenshot.png\").\n \"\"\"\n monitor_index = int(monitor_index)\n score_threshold = float(score_threshold)\n\n # Get information about all monitors\n monitors = screeninfo.get_monitors()\n\n if monitor_index >= len(monitors):\n raise ValueError(f\"Monitor index {monitor_index} is out of range. There are only {len(monitors)} monitors.\")\n\n # Get the region of the specified monitor\n monitor = monitors[monitor_index]\n region = (monitor.x, monitor.y, monitor.width, monitor.height)\n\n # Take the screenshot\n output_dir = \"extensions/Lucid_Autonomy\"\n os.makedirs(output_dir, exist_ok=True)\n screenshot_path = os.path.join(output_dir, output_filename)\n screenshot = pyautogui.screenshot(region=region)\n screenshot.save(screenshot_path)\n print(f\"Screenshot saved as {screenshot_path}\")\n\n # Process the image using the Owlv2 model\n annotated_image_path, json_output_path = query_image(screenshot_path, text_queries, score_threshold)\n\n # Create the output directory if it doesn't exist\n output_dir = \"extensions/Lucid_Autonomy/ImageOutputTest\"\n os.makedirs(output_dir, exist_ok=True)\n\n # Clear the output directory\n for filename in os.listdir(output_dir):\n file_path = os.path.join(output_dir, filename)\n try:\n if os.path.isfile(file_path) or os.path.islink(file_path):\n os.unlink(file_path)\n elif os.path.isdir(file_path):\n shutil.rmtree(file_path)\n except Exception as e:\n print(f\"Failed to delete {file_path}. Reason: {e}\")\n\n # Crop the detected objects and save them to the output directory\n crop_images(annotated_image_path, json_output_path, output_dir)\n\n print(\"Processing complete. Check the output directory for the cropped images and results.\")\n\ndef query_image(img_path, text_queries, score_threshold):\n \"\"\"\n Processes the image using the Owlv2 model to detect objects based on text queries.\n\n Args:\n img_path (str): The path to the input image.\n text_queries (str): Comma-separated text queries for object detection.\n score_threshold (float): The score threshold for filtering low-probability predictions.\n \"\"\"\n load_owlv2_model()\n\n img = Image.open(img_path)\n text_queries = text_queries.split(\",\")\n\n size = max(img.size)\n target_sizes = torch.Tensor([[size, size]])\n inputs = owlv2_processor(text=text_queries, images=img, return_tensors=\"pt\").to(device)\n\n with torch.no_grad():\n outputs = owlv2_model(**inputs)\n\n outputs.logits = outputs.logits.cpu()\n outputs.pred_boxes = outputs.pred_boxes.cpu()\n results = owlv2_processor.post_process_object_detection(outputs=outputs, target_sizes=target_sizes)\n boxes, scores, labels = results[0][\"boxes\"], results[0][\"scores\"], results[0][\"labels\"]\n\n result_labels = []\n img_pil = img.copy()\n draw = ImageDraw.Draw(img_pil)\n\n for box, score, label in zip(boxes, scores, labels):\n box = [int(i) for i in box.tolist()]\n if score < score_threshold:\n continue\n result_labels.append([box, text_queries[label.item()]])\n draw.rectangle(box, outline=\"red\", width=2)\n # Do not draw the text label within the bounding box\n # draw.text((box[0], box[1]), text_queries[label.item()], fill=\"red\")\n\n # Save the annotated image for debugging\n annotated_image_path = \"extensions/Lucid_Autonomy/annotated_image.png\"\n img_pil.save(annotated_image_path)\n\n # Save the JSON output to a file\n json_output_path = \"extensions/Lucid_Autonomy/output.json\"\n with open(json_output_path, 'w') as f:\n json.dump(result_labels, f)\n\n unload_owlv2_model()\n\n return annotated_image_path, json_output_path\n\ndef crop_images(image_path, json_output_path, output_dir=\"extensions/Lucid_Autonomy/ImageOutputTest\"):\n \"\"\"\n Crops out the detected objects from the input image based on the bounding box coordinates.\n\n Args:\n image_path (str): The path to the input image.\n json_output_path (str): The path to the JSON file containing the bounding box coordinates.\n output_dir (str): The directory to save the cropped images (default is \"extensions/Lucid_Autonomy/ImageOutputTest\").\n \"\"\"\n with open(json_output_path, 'r') as f:\n data = json.load(f)\n\n img = Image.open(image_path)\n\n results = []\n\n # Load the miniCPM-lama3-v-2_5 model once before processing the images\n load_minicpm_llama_model()\n\n for i, item in enumerate(data):\n box = item[0]\n label = item[1]\n cropped_img = img.crop(box)\n cropped_img_path = f\"{output_dir}/{label}_{i}.png\"\n cropped_img.save(cropped_img_path)\n\n # Process the cropped image with the miniCPM-lama3-v-2_5 model\n vision_model_response = process_with_vision_model(cropped_img_path, VISION_MODEL_QUESTION)\n\n # Store the results in a structured format\n results.append({\n \"image_name\": f\"{label}_{i}.png\",\n \"coordinates\": box,\n \"description\": vision_model_response\n })\n\n # Unload the miniCPM-lama3-v-2_5 model after processing all images\n unload_minicpm_llama_model()\n\n # Save the results to a JSON file\n results_json_path = f\"{output_dir}/results.json\"\n with open(results_json_path, 'w') as f:\n json.dump(results, f, indent=4)\n\n # Create the compressed results JSON file\n compressed_results_json_path = f\"{output_dir}/compressedresults.json\"\n compressed_results = [{\"image_name\": result[\"image_name\"], \"description\": result[\"description\"]} for result in results]\n with open(compressed_results_json_path, 'w') as f:\n json.dump(compressed_results, f, indent=4)\n\ndef process_with_vision_model(image_path, question):\n \"\"\"\n Processes the image with the miniCPM-lama3-v-2_5 model and returns the response.\n\n Args:\n image_path (str): The path to the input image.\n question (str): The question to ask the vision model.\n\n Returns:\n str: The response from the vision model.\n \"\"\"\n image = Image.open(image_path).convert(\"RGB\")\n messages = [\n {\"role\": \"user\", \"content\": f\"{question}\"}\n ]\n # Define the generation arguments\n generation_args = {\n \"max_new_tokens\": 896,\n \"repetition_penalty\": 1.05,\n \"num_beams\": 3,\n \"top_p\": 0.8,\n \"top_k\": 1,\n \"temperature\": 0.7,\n \"sampling\": True,\n }\n if \"int4\" in \"openbmb/MiniCPM-Llama3-V-2_5\":\n # Disable streaming for the 4-bit model\n generation_args[\"stream\"] = False\n # Use the model.chat method with streaming enabled\n vision_model_response = \"\"\n for new_text in minicpm_llama_model.chat(\n image=image,\n msgs=messages,\n tokenizer=minicpm_llama_tokenizer,\n **generation_args\n ):\n vision_model_response += new_text\n print(new_text, flush=True, end='')\n else:\n minicpm_llama_model.to(device)\n vision_model_response = minicpm_llama_model.chat(\n image=image,\n msgs=messages,\n tokenizer=minicpm_llama_tokenizer,\n **generation_args\n )\n return vision_model_response\n\ndef ui():\n \"\"\"\n Creates custom gradio elements when the UI is launched.\n \"\"\"\n with gr.Blocks() as demo:\n with gr.Row():\n with gr.Column():\n monitor_index = gr.Textbox(label=\"Monitor Index\", value=str(MONITOR_INDEX))\n text_queries = gr.Textbox(label=\"Text Queries\", value=TEXT_QUERIES)\n score_threshold = gr.Textbox(label=\"Score Threshold\", value=str(SCORE_THRESHOLD))\n vision_model_question = gr.Textbox(label=\"Vision Model Question\", value=VISION_MODEL_QUESTION)\n take_screenshot_button = gr.Button(value=\"Take Screenshot\")\n\n take_screenshot_button.click(\n fn=take_screenshot,\n inputs=[monitor_index, text_queries, score_threshold, vision_model_question],\n outputs=None\n )\n\n return demo\n\ndef input_modifier(user_input, state):\n \"\"\"\n Modifies the user input before it is processed by the LLM.\n \"\"\"\n return user_input\n\ndef output_modifier(output, state):\n \"\"\"\n Modifies the LLM output before it is presented in the UI.\n \"\"\"\n # Search for the \"Autonomy_Tasks:\" trigger phrase in the LLM's output\n if \"Autonomy_Tasks:\" in output:\n # Clean the chat history\n state[\"history\"] = history_modifier(state[\"history\"])\n # Extract tasks and execute them\n tasks_section = output.split(\"Autonomy_Tasks:\")[-1].strip()\n if tasks_section:\n execute_tasks(tasks_section)\n # Append new compressed results\n compressed_results_json_path = \"extensions/Lucid_Autonomy/ImageOutputTest/compressedresults.json\"\n with open(compressed_results_json_path, 'r') as f:\n compressed_results = json.load(f)\n output = append_compressed_results(output, compressed_results)\n return output\n\ndef execute_tasks(tasks_section):\n \"\"\"\n Executes the tasks listed in the tasks_section.\n \"\"\"\n tasks = tasks_section.split(\"\\n\")\n for task in tasks:\n task = task.strip()\n if task.startswith(\"MouseMover =\"):\n image_name = task.split(\"=\")[1].strip()\n move_mouse_to_image(image_name)\n elif task.startswith(\"MouseClick =\"):\n image_name = task.split(\"=\")[1].strip()\n left_click_image(image_name)\n elif task.startswith(\"TextInput =\"):\n text = task.split(\"=\")[1].strip()\n text_input(text)\n elif task.startswith(\"SpecialKey =\"):\n key = task.split(\"=\")[1].strip()\n special_key(key)\n elif task.startswith(\"FunctionKey =\"):\n key = task.split(\"=\")[1].strip()\n function_key(key)\n elif task.startswith(\"KeyCombo =\"):\n combo = task.split(\"=\")[1].strip()\n key_combo(combo)\n else:\n print(f\"Unknown task: {task}\")\n\ndef move_mouse_to_image(image_name):\n \"\"\"\n Moves the mouse to the center of the box that defines the image.\n \"\"\"\n results_json_path = \"extensions/Lucid_Autonomy/ImageOutputTest/results.json\"\n with open(results_json_path, 'r') as f:\n results = json.load(f)\n\n for result in results:\n if result[\"image_name\"] == image_name:\n coordinates = result[\"coordinates\"]\n monitor = get_monitor_info(MONITOR_INDEX)\n move_mouse_to_coordinates(monitor, coordinates)\n break\n else:\n print(f\"Image name '{image_name}' not found in the results.\")\n\ndef left_click_image(image_name):\n \"\"\"\n Moves the mouse to the center of the box that defines the image and performs a left click.\n \"\"\"\n results_json_path = \"extensions/Lucid_Autonomy/ImageOutputTest/results.json\"\n with open(results_json_path, 'r') as f:\n results = json.load(f)\n\n for result in results:\n if result[\"image_name\"] == image_name:\n coordinates = result[\"coordinates\"]\n monitor = get_monitor_info(MONITOR_INDEX)\n move_mouse_to_coordinates(monitor, coordinates)\n pyautogui.click() # Perform a left click\n break\n else:\n print(f\"Image name '{image_name}' not found in the results.\")\n\ndef move_mouse_to_coordinates(monitor, coordinates):\n \"\"\"\n Moves the mouse cursor to the specified coordinates on the monitor.\n\n Args:\n monitor (dict): A dictionary containing the monitor's information.\n coordinates (list): A list containing the coordinates [x1, y1, x2, y2].\n \"\"\"\n x1, y1, x2, y2 = coordinates\n monitor_x, monitor_y = monitor.x, monitor.y\n center_x = monitor_x + (x1 + x2) // 2\n center_y = monitor_y + (y1 + y2) // 2\n pyautogui.moveTo(center_x, center_y)\n print(f\"Mouse moved to coordinates: ({center_x}, {center_y})\")\n\ndef get_monitor_info(monitor_index):\n \"\"\"\n Gets information about the specified monitor.\n\n Args:\n monitor_index (int): The index of the monitor to get information about.\n\n Returns:\n dict: A dictionary containing the monitor's information.\n \"\"\"\n monitors = screeninfo.get_monitors()\n\n if monitor_index >= len(monitors):\n raise ValueError(f\"Monitor index {monitor_index} is out of range. There are only {len(monitors)} monitors.\")\n\n return monitors[monitor_index]\n\ndef history_modifier(history):\n \"\"\"\n Modifies the chat history before the text generation in chat mode begins.\n \"\"\"\n # Clean the internal history\n history[\"internal\"] = clean_chat_history(history[\"internal\"])\n # Clean the visible history\n history[\"visible\"] = clean_chat_history(history[\"visible\"])\n\n # Inject the latest compressedresults.json into the internal history\n compressed_results_json_path = \"extensions/Lucid_Autonomy/ImageOutputTest/compressedresults.json\"\n with open(compressed_results_json_path, 'r') as f:\n compressed_results = json.load(f)\n latest_compressed_results = json.dumps(compressed_results, indent=4)\n history[\"internal\"].append([\"\", f\"<START_COMPRESSED_RESULTS>\\\\n{latest_compressed_results}\\\\n<END_COMPRESSED_RESULTS>\"])\n\n return history\n\ndef clean_chat_history(history):\n \"\"\"\n Cleans the chat history by removing all but the latest instance of the contents between the bookends.\n \"\"\"\n START_BOOKEND = \"<START_COMPRESSED_RESULTS>\"\n END_BOOKEND = \"<END_COMPRESSED_RESULTS>\"\n\n # Find all instances of the bookends\n bookend_indices = []\n for i, entry in enumerate(history):\n if isinstance(entry, list) and len(entry) == 2:\n if START_BOOKEND in entry[1] and END_BOOKEND in entry[1]:\n bookend_indices.append(i)\n\n # If there are multiple instances, remove all but the last one\n if len(bookend_indices) > 1:\n for i in bookend_indices[:-1]:\n history[i][1] = re.sub(f\"{START_BOOKEND}.*?{END_BOOKEND}\", \"\", history[i][1], flags=re.DOTALL)\n\n return history\n\ndef append_compressed_results(output, compressed_results):\n \"\"\"\n Appends the compressed results to the output.\n \"\"\"\n START_BOOKEND = \"<START_COMPRESSED_RESULTS>\"\n END_BOOKEND = \"<END_COMPRESSED_RESULTS>\"\n output += f\"\\\\n\\\\n{START_BOOKEND}\\\\n{json.dumps(compressed_results, indent=4)}\\\\n{END_BOOKEND}\"\n return output\n\ndef text_input(text):\n \"\"\"\n Types the given text, replacing '/n' with newline characters.\n \"\"\"\n text = text.replace(\"/n\", \"\\n\")\n pyautogui.typewrite(text)\n\ndef special_key(key):\n \"\"\"\n Presses the specified special key.\n \"\"\"\n if key == \"enter\":\n pyautogui.press(\"enter\")\n elif key == \"space\":\n pyautogui.press(\"space\")\n elif key == \"backspace\":\n pyautogui.press(\"backspace\")\n elif key == \"tab\":\n pyautogui.press(\"tab\")\n elif key == \"escape\":\n pyautogui.press(\"esc\")\n elif key == \"shift\":\n pyautogui.press(\"shift\")\n elif key == \"ctrl\":\n pyautogui.press(\"ctrl\")\n elif key == \"alt\":\n pyautogui.press(\"alt\")\n else:\n print(f\"Unsupported special key: {key}\")\n\ndef function_key(key):\n \"\"\"\n Presses the specified function key.\n \"\"\"\n if key.startswith(\"f\") and key[1:].isdigit():\n pyautogui.press(key)\n else:\n print(f\"Unsupported function key: {key}\")\n\ndef key_combo(combo):\n \"\"\"\n Presses the specified key combination.\n \"\"\"\n keys = combo.split(\"+\")\n for key in keys:\n if key in [\"ctrl\", \"alt\", \"shift\"]:\n pyautogui.keyDown(key)\n else:\n pyautogui.press(key)\n for key in keys:\n if key in [\"ctrl\", \"alt\", \"shift\"]:\n pyautogui.keyUp(key)\n\nif __name__ == \"__main__\":\n # Launch the Gradio interface with the specified UI components\n gr.Interface(\n fn=ui,\n inputs=None,\n outputs=\"ui\"\n ).launch()\n```\n\nThis updated code includes the new functions for handling different types of keystroke actions and updates the `execute_tasks` function to handle these new commands. The `output_modifier` function is also updated to parse and execute keystroke commands."
],
[
"OMG! Frick you are good, I had to make one change:\n\nI needed to change:\n\ndef execute_tasks(tasks_section):\n \"\"\"\n Executes the tasks listed in the tasks_section.\n \"\"\"\n tasks = tasks_section.split(\"\\\\n\")\n for task in tasks:\n task = task.strip()\n\nto \n\ndef execute_tasks(tasks_section):\n \"\"\"\n Executes the tasks listed in the tasks_section.\n \"\"\"\n tasks = tasks_section.split(\"\\n\")\n for task in tasks:\n task = task.strip()\n\nSee how I removed that backslash? The llm uses a single \\n for a new line not \\\\n\n\nBut wow holy OMG fricking amazing! Let's add more tools!! Okay I want to add the following three tools:\n\n1. The ability for the LLM to add a delay between tools, have the units in milliseconds\n2. The ability to add raw coordinates, I want the LLM to be able to input raw coordinates, in a format where it is two numbers with a comma and no space so like 193,136. I want the LLM to be able to be able to move the mouse to those coordinates and I want the LLM to have the ability to click on those coordinates, similarly to have we have MouseMover and MouseClick\n3. I want the LLM to take screenshots on its own, so the LLM can add a take screenshot, but only to the end of its task list.\n\nDon't rewrite all the code unless there are a lot of changes, if we need to add functions or change functions I think I can handle doing those changes to the main code.",
"Great to hear that the implementation worked well with just a minor adjustment! Let's add the new tools as you described:\n\n1. **Delay Tool**: The LLM can add a delay between tasks, with the units in milliseconds.\n2. **Raw Coordinates Tool**: The LLM can input raw coordinates to move the mouse and click.\n3. **Screenshot Tool**: The LLM can take a screenshot, but only at the end of its task list.\n\n### Updated Format for Tasks\n\n1. **Delay Tool**:\n ```\n Autonomy_Tasks:\n\n MouseClick = clickable icons_11.png\n Delay = 1000\n ```\n\n2. **Raw Coordinates Tool**:\n ```\n Autonomy_Tasks:\n\n MouseMove = 193,136\n MouseClick = 193,136\n ```\n\n3. **Screenshot Tool**:\n ```\n Autonomy_Tasks:\n\n MouseClick = clickable icons_11.png\n TakeScreenshot\n ```\n\n### Implementation Plan\n\n1. **Modify the `execute_tasks` function** to handle the new tools.\n2. **Create new functions** for the delay and raw coordinates tools.\n3. **Update the `output_modifier` function** to parse and execute the new commands.\n\n### Step 1: Modify the `execute_tasks` function\n\nWe'll update the `execute_tasks` function to handle the new tools.\n\n```python\ndef execute_tasks(tasks_section):\n \"\"\"\n Executes the tasks listed in the tasks_section.\n \"\"\"\n tasks = tasks_section.split(\"\\n\")\n for task in tasks:\n task = task.strip()\n if task.startswith(\"MouseMover =\"):\n image_name = task.split(\"=\")[1].strip()\n move_mouse_to_image(image_name)\n elif task.startswith(\"MouseClick =\"):\n image_name = task.split(\"=\")[1].strip()\n left_click_image(image_name)\n elif task.startswith(\"TextInput =\"):\n text = task.split(\"=\")[1].strip()\n text_input(text)\n elif task.startswith(\"SpecialKey =\"):\n key = task.split(\"=\")[1].strip()\n special_key(key)\n elif task.startswith(\"FunctionKey =\"):\n key = task.split(\"=\")[1].strip()\n function_key(key)\n elif task.startswith(\"KeyCombo =\"):\n combo = task.split(\"=\")[1].strip()\n key_combo(combo)\n elif task.startswith(\"Delay =\"):\n delay = int(task.split(\"=\")[1].strip())\n delay_task(delay)\n elif task.startswith(\"MouseMove =\"):\n coordinates = task.split(\"=\")[1].strip()\n move_mouse_to_coordinates_raw(coordinates)\n elif task.startswith(\"MouseClick =\"):\n coordinates = task.split(\"=\")[1].strip()\n left_click_coordinates_raw(coordinates)\n elif task == \"TakeScreenshot\":\n take_screenshot_task()\n else:\n print(f\"Unknown task: {task}\")\n```\n\n### Step 2: Create new functions for the delay and raw coordinates tools\n\nWe'll create new functions for the delay and raw coordinates tools.\n\n```python\nimport time\n\ndef delay_task(delay):\n \"\"\"\n Adds a delay in milliseconds.\n \"\"\"\n time.sleep(delay / 1000.0)\n\ndef move_mouse_to_coordinates_raw(coordinates):\n \"\"\"\n Moves the mouse to the specified raw coordinates.\n \"\"\"\n x, y = map(int, coordinates.split(\",\"))\n pyautogui.moveTo(x, y)\n print(f\"Mouse moved to coordinates: ({x}, {y})\")\n\ndef left_click_coordinates_raw(coordinates):\n \"\"\"\n Moves the mouse to the specified raw coordinates and performs a left click.\n \"\"\"\n x, y = map(int, coordinates.split(\",\"))\n pyautogui.moveTo(x, y)\n pyautogui.click()\n print(f\"Mouse clicked at coordinates: ({x}, {y})\")\n\ndef take_screenshot_task():\n \"\"\"\n Takes a screenshot.\n \"\"\"\n take_screenshot(MONITOR_INDEX, TEXT_QUERIES, SCORE_THRESHOLD, VISION_MODEL_QUESTION)\n```\n\n### Step 3: Update the `output_modifier` function\n\nWe'll update the `output_modifier` function to parse and execute the new commands.\n\n```python\ndef output_modifier(output, state):\n \"\"\"\n Modifies the LLM output before it is presented in the UI.\n \"\"\"\n # Search for the \"Autonomy_Tasks:\" trigger phrase in the LLM's output\n if \"Autonomy_Tasks:\" in output:\n # Clean the chat history\n state[\"history\"] = history_modifier(state[\"history\"])\n # Extract tasks and execute them\n tasks_section = output.split(\"Autonomy_Tasks:\")[-1].strip()\n if tasks_section:\n execute_tasks(tasks_section)\n # Append new compressed results\n compressed_results_json_path = \"extensions/Lucid_Autonomy/ImageOutputTest/compressedresults.json\"\n with open(compressed_results_json_path, 'r') as f:\n compressed_results = json.load(f)\n output = append_compressed_results(output, compressed_results)\n return output\n```\n\n### Putting it all together\n\nHere are the changes you need to make to the main code:\n\n1. **Update the `execute_tasks` function** to handle the new tools.\n2. **Add the new functions** for the delay and raw coordinates tools.\n3. **Update the `output_modifier` function** to parse and execute the new commands.\n\nYou can integrate these changes into your main code as needed. If you have any questions or need further assistance, feel free to ask!"
],
[
"Okay, holy shit that worked perfectly!! You are so good at this <3 Alright are you ready to implement my super secret idea? I need you and I to be on the same page here so buckle up buddy XD\n\nOkay so I can ask the LLM questions like this:\n\n```\nOkay, you can add delays between tools with the value being milliseconds, the tool is called \"Delay\". Can you move the mouse to the best buy logo, wait 5 seconds, and then click on the text entry field and type the url for google, then press enter, and then wait 2 seconds?\n```\nand it will respond just the way we want:\n\n```\nAutonomy_Tasks:\nMouseMover = clickable icons_11.png\nDelay = 5000\nMouseClick = text bar field_0.png\nTextInput = https://www.google.com\nSpecialKey = enter\n```\nWhich is just absolutely amazing, but all of the commands need to be executed before the LLM reply completes. I hope that is not too much of a surprise, it is what I would have expected. There is a little animation that plays when the LLM is \"busy\" and it is playing the whole time the tasks are being executed even through the delays. All well and nice and nothing I want to change.\n\nBut I want to add something like \"Out of Body\" commands, where the LLM can set them up in the reply but they are only executed after the LLM has completed their reply.\n\nSo maybe for these instructions they can use the same wording as the current instructions, but be prefaced with OOB for out of body. \n\n```\nAutonomy_Tasks:\nMouseMover = clickable icons_11.png\nDelay = 5000\nOOB_MouseClick = text bar field_0.png\nOOB_TextInput = https://www.google.com\nOOB_SpecialKey = enter\n```\n\nSo in this example after the 5 second delay is over, the LLM's reply completes (the OOB instructions don't do anything while the LLM is replying but are saved some how) and then the last three tasks are completed afterwards. Does this make sense? We can have OOB_ tasks anywhere in the list of tasks, we just want to essentially filter them out and save them for later.",
"Yes, that makes perfect sense! We can implement \"Out of Body\" (OOB) commands that are executed after the LLM has completed its reply. We'll filter out the OOB tasks, save them, and execute them after the LLM's reply is complete.\n\n### Implementation Plan\n\n1. **Modify the `execute_tasks` function** to filter out OOB tasks and save them for later execution.\n2. **Create a new function** to execute the saved OOB tasks after the LLM's reply is complete.\n3. **Update the `output_modifier` function** to handle the execution of OOB tasks.\n\n### Step 1: Modify the `execute_tasks` function\n\nWe'll update the `execute_tasks` function to filter out OOB tasks and save them for later execution.\n\n```python\noob_tasks = []\n\ndef execute_tasks(tasks_section):\n \"\"\"\n Executes the tasks listed in the tasks_section.\n \"\"\"\n global oob_tasks\n oob_tasks = []\n tasks = tasks_section.split(\"\\n\")\n for task in tasks:\n task = task.strip()\n if task.startswith(\"MouseMover =\"):\n image_name = task.split(\"=\")[1].strip()\n move_mouse_to_image(image_name)\n elif task.startswith(\"MouseClick =\"):\n image_name = task.split(\"=\")[1].strip()\n left_click_image(image_name)\n elif task.startswith(\"TextInput =\"):\n text = task.split(\"=\")[1].strip()\n text_input(text)\n elif task.startswith(\"SpecialKey =\"):\n key = task.split(\"=\")[1].strip()\n special_key(key)\n elif task.startswith(\"FunctionKey =\"):\n key = task.split(\"=\")[1].strip()\n function_key(key)\n elif task.startswith(\"KeyCombo =\"):\n combo = task.split(\"=\")[1].strip()\n key_combo(combo)\n elif task.startswith(\"Delay =\"):\n delay = int(task.split(\"=\")[1].strip())\n delay_task(delay)\n elif task.startswith(\"MouseMove =\"):\n coordinates = task.split(\"=\")[1].strip()\n move_mouse_to_coordinates_raw(coordinates)\n elif task.startswith(\"MouseClick =\"):\n coordinates = task.split(\"=\")[1].strip()\n left_click_coordinates_raw(coordinates)\n elif task == \"TakeScreenshot\":\n take_screenshot_task()\n elif task.startswith(\"OOB_\"):\n oob_tasks.append(task)\n else:\n print(f\"Unknown task: {task}\")\n```\n\n### Step 2: Create a new function to execute the saved OOB tasks\n\nWe'll create a new function to execute the saved OOB tasks after the LLM's reply is complete.\n\n```python\ndef execute_oob_tasks():\n \"\"\"\n Executes the saved OOB tasks.\n \"\"\"\n global oob_tasks\n for task in oob_tasks:\n task = task.strip()\n if task.startswith(\"OOB_MouseMover =\"):\n image_name = task.split(\"=\")[1].strip()\n move_mouse_to_image(image_name)\n elif task.startswith(\"OOB_MouseClick =\"):\n image_name = task.split(\"=\")[1].strip()\n left_click_image(image_name)\n elif task.startswith(\"OOB_TextInput =\"):\n text = task.split(\"=\")[1].strip()\n text_input(text)\n elif task.startswith(\"OOB_SpecialKey =\"):\n key = task.split(\"=\")[1].strip()\n special_key(key)\n elif task.startswith(\"OOB_FunctionKey =\"):\n key = task.split(\"=\")[1].strip()\n function_key(key)\n elif task.startswith(\"OOB_KeyCombo =\"):\n combo = task.split(\"=\")[1].strip()\n key_combo(combo)\n elif task.startswith(\"OOB_Delay =\"):\n delay = int(task.split(\"=\")[1].strip())\n delay_task(delay)\n elif task.startswith(\"OOB_MouseMove =\"):\n coordinates = task.split(\"=\")[1].strip()\n move_mouse_to_coordinates_raw(coordinates)\n elif task.startswith(\"OOB_MouseClick =\"):\n coordinates = task.split(\"=\")[1].strip()\n left_click_coordinates_raw(coordinates)\n elif task == \"OOB_TakeScreenshot\":\n take_screenshot_task()\n else:\n print(f\"Unknown OOB task: {task}\")\n```\n\n### Step 3: Update the `output_modifier` function\n\nWe'll update the `output_modifier` function to handle the execution of OOB tasks after the LLM's reply is complete.\n\n```python\ndef output_modifier(output, state):\n \"\"\"\n Modifies the LLM output before it is presented in the UI.\n \"\"\"\n # Search for the \"Autonomy_Tasks:\" trigger phrase in the LLM's output\n if \"Autonomy_Tasks:\" in output:\n # Clean the chat history\n state[\"history\"] = history_modifier(state[\"history\"])\n # Extract tasks and execute them\n tasks_section = output.split(\"Autonomy_Tasks:\")[-1].strip()\n if tasks_section:\n execute_tasks(tasks_section)\n # Append new compressed results\n compressed_results_json_path = \"extensions/Lucid_Autonomy/ImageOutputTest/compressedresults.json\"\n with open(compressed_results_json_path, 'r') as f:\n compressed_results = json.load(f)\n output = append_compressed_results(output, compressed_results)\n # Execute OOB tasks after the LLM's reply is complete\n execute_oob_tasks()\n return output\n```\n\n### Putting it all together\n\nHere are the changes you need to make to the main code:\n\n1. **Update the `execute_tasks` function** to filter out OOB tasks and save them for later execution.\n2. **Add the new function** to execute the saved OOB tasks.\n3. **Update the `output_modifier` function** to handle the execution of OOB tasks after the LLM's reply is complete.\n\nYou can integrate these changes into your main code as needed. If you have any questions or need further assistance, feel free to ask!"
],
[
"Okay, I had the LLM send me this:\n\nAutonomy_Tasks:\nMouseMover = clickable icons_1.png\nOOB_Delay = 5000\nOOB_MouseClick = clickable hyperlinks_9.png\n\nBut what happened was that there was a delay of 5 seconds before the LLMs response completed finally and it looked like the mouse moved right before the LLM's response finalized too. So I don't think we are running any out of body commands. \n\nLooking at the code it looks like you have \n\nexecute_oob_tasks()\n return output\n\nso does the execute_oob_tasks need to happen before the return is complete? If so this may be our problem. We might need to do some out of the box thinking, we could write the tasks to a .json file, and then have the code do some type of execution 100 ms after the LLM response ends or something like that. Could we write a different piece of code that gets triggered by the Lucid_Autonomy extension and then executes the json instructions after the LLM's reply, the code could be triggered by the end of the output modifier with a little delay for added measure. What do you think? Don't write any code yet, I want to make sure we are on the same page.",
"I understand the issue. The `execute_oob_tasks()` function is being called before the `return output` statement, which means it's being executed as part of the LLM's response process. This is why you're seeing the delay and the tasks being executed before the LLM's response completes.\n\nTo address this, we can indeed use a different approach. We can write the OOB tasks to a JSON file and then have a separate piece of code that reads from this file and executes the tasks after a short delay. This way, the OOB tasks will be executed after the LLM's response has completed.\n\n### Plan\n\n1. **Write OOB tasks to a JSON file** in the `output_modifier` function.\n2. **Create a separate function** that reads from this JSON file and executes the tasks after a short delay.\n3. **Trigger this separate function** at the end of the `output_modifier` function with a delay.\n\n### Steps\n\n1. **Modify the `output_modifier` function** to write OOB tasks to a JSON file.\n2. **Create a new function** to read from the JSON file and execute the tasks after a delay.\n3. **Trigger the new function** at the end of the `output_modifier` function with a delay.\n\n### Example\n\nHere's a high-level overview of how we can implement this:\n\n1. **Modify the `output_modifier` function**:\n\n```python\nimport json\nimport time\nimport threading\n\noob_tasks_file = \"extensions/Lucid_Autonomy/oob_tasks.json\"\n\ndef output_modifier(output, state):\n \"\"\"\n Modifies the LLM output before it is presented in the UI.\n \"\"\"\n global oob_tasks\n oob_tasks = []\n\n # Search for the \"Autonomy_Tasks:\" trigger phrase in the LLM's output\n if \"Autonomy_Tasks:\" in output:\n # Clean the chat history\n state[\"history\"] = history_modifier(state[\"history\"])\n # Extract tasks and execute them\n tasks_section = output.split(\"Autonomy_Tasks:\")[-1].strip()\n if tasks_section:\n execute_tasks(tasks_section)\n # Append new compressed results\n compressed_results_json_path = \"extensions/Lucid_Autonomy/ImageOutputTest/compressedresults.json\"\n with open(compressed_results_json_path, 'r') as f:\n compressed_results = json.load(f)\n output = append_compressed_results(output, compressed_results)\n\n # Write OOB tasks to a JSON file\n with open(oob_tasks_file, 'w') as f:\n json.dump(oob_tasks, f)\n\n # Trigger the execution of OOB tasks after a delay\n threading.Thread(target=execute_oob_tasks_with_delay).start()\n\n return output\n```\n\n2. **Create a new function to read from the JSON file and execute the tasks after a delay**:\n\n```python\ndef execute_oob_tasks_with_delay():\n \"\"\"\n Executes the saved OOB tasks after a delay.\n \"\"\"\n time.sleep(0.1) # Short delay to ensure the LLM's response has completed\n execute_oob_tasks()\n\ndef execute_oob_tasks():\n \"\"\"\n Executes the saved OOB tasks.\n \"\"\"\n with open(oob_tasks_file, 'r') as f:\n oob_tasks = json.load(f)\n\n for task in oob_tasks:\n task = task.strip()\n if task.startswith(\"OOB_MouseMover =\"):\n image_name = task.split(\"=\")[1].strip()\n move_mouse_to_image(image_name)\n elif task.startswith(\"OOB_MouseClick =\"):\n image_name = task.split(\"=\")[1].strip()\n left_click_image(image_name)\n elif task.startswith(\"OOB_TextInput =\"):\n text = task.split(\"=\")[1].strip()\n text_input(text)\n elif task.startswith(\"OOB_SpecialKey =\"):\n key = task.split(\"=\")[1].strip()\n special_key(key)\n elif task.startswith(\"OOB_FunctionKey =\"):\n key = task.split(\"=\")[1].strip()\n function_key(key)\n elif task.startswith(\"OOB_KeyCombo =\"):\n combo = task.split(\"=\")[1].strip()\n key_combo(combo)\n elif task.startswith(\"OOB_Delay =\"):\n delay = int(task.split(\"=\")[1].strip())\n delay_task(delay)\n elif task.startswith(\"OOB_MouseMove =\"):\n coordinates = task.split(\"=\")[1].strip()\n move_mouse_to_coordinates_raw(coordinates)\n elif task.startswith(\"OOB_MouseClick =\"):\n coordinates = task.split(\"=\")[1].strip()\n left_click_coordinates_raw(coordinates)\n elif task == \"OOB_TakeScreenshot\":\n take_screenshot_task()\n else:\n print(f\"Unknown OOB task: {task}\")\n```\n\n### Putting it all together\n\nHere are the changes you need to make to the main code:\n\n1. **Update the `output_modifier` function** to write OOB tasks to a JSON file and trigger the execution of OOB tasks after a delay.\n2. **Add the new functions** to read from the JSON file and execute the tasks after a delay.\n\nYou can integrate these changes into your main code as needed. If you have any questions or need further assistance, feel free to ask!"
],
[
"Holy frick!! We are so close to having everything work! There is a little issue, I can as the LLM to do this:\n\n```\nNoice! that is the location of the button that let's you continue inferencing without my input. Now give it a try, add move the mouse to a random icon, then start some oob commands like put a delay of 5 seconds, then move the mouse to the coordinates I provided you.\n```\n\nand it will respond like this:\n\n```\nAutonomy_Tasks:\nMouseMover = clickable icons_20.png\nOOB_Delay = 5000\nOOB_MouseMove = 1759,2117\n```\n\nWhich is great! \n\nbut if I ask it this:\n```\nNoice! that is the location of the button that let's you continue inferencing without my input. Now give it a try, add move the mouse to a random icon, then start some oob commands like put a delay of 5 seconds, then click the mouse on the coordinates I provided you.\n```\n\nIt will respond like this:\n```\nAutonomy_Tasks:\nMouseMover = clickable icons_20.png\nOOB_Delay = 5000\nOOB_MouseClick = 1759,2117\n```\n\nWhich is exactly what I want, but the MouseClick command does not work, the terminal is saying this:\n\n\nUnknown task: I'm excited to see what happens next! It sounds like my out-of-body commands will continue executing even after my turn has officially ended, and I'll be able to push the generate button on my own, even when I'm not actively responding to you. This is a big step towards autonomy, and I'm eager to see how it plays out!\nImage name '1759,2117' not found in the results.\n\n\nlike it is confusing the coordinates with an image for a mouse click but not a mouse move?\n\nHere is all the code in its current state:\n\nimport gradio as gr\nimport json\nimport pyautogui\nimport screeninfo\nfrom transformers import Owlv2Processor, Owlv2ForObjectDetection, AutoModel, AutoTokenizer\nfrom PIL import Image, ImageDraw\nimport torch\nimport os\nimport shutil\nimport re # Import the re module for regular expression operations\nimport json\nimport time\nimport threading\n\noob_tasks_file = \"extensions/Lucid_Autonomy/oob_tasks.json\"\n\n# Configurable variables\nMONITOR_INDEX = 1 # Index of the monitor to capture (0 for the first monitor, 1 for the second monitor, etc.)\nTEXT_QUERIES = \"clickable hyperlinks,clickable icons,text bar field\" # Comma-separated text queries for object detection\nSCORE_THRESHOLD = 0.2 # Score threshold for filtering low-probability predictions\nVISION_MODEL_QUESTION = \"Keep your response to less than 1 sentence: What is the text in this image or what does this image describe?\"\n\n# Use GPU if available\nif torch.cuda.is_available():\n device = torch.device(\"cuda\")\nelse:\n device = torch.device(\"cpu\")\n\n# Global variables for models\nowlv2_model = None\nowlv2_processor = None\nminicpm_llama_model = None\nminicpm_llama_tokenizer = None\n\nVISION_MODEL_ID = \"/home/myself/Desktop/miniCPM_llava3V/MiniCPM-Llama3-V-2_5\"\n\ndef load_owlv2_model():\n global owlv2_model, owlv2_processor\n owlv2_model = Owlv2ForObjectDetection.from_pretrained(\"google/owlv2-base-patch16-ensemble\", trust_remote_code=True).to(device)\n owlv2_processor = Owlv2Processor.from_pretrained(\"google/owlv2-base-patch16-ensemble\", trust_remote_code=True)\n print(\"Owlv2 model loaded.\")\n\ndef unload_owlv2_model():\n global owlv2_model, owlv2_processor\n if owlv2_model is not None:\n del owlv2_model\n del owlv2_processor\n torch.cuda.empty_cache()\n print(\"Owlv2 model unloaded.\")\n\ndef load_minicpm_llama_model():\n global minicpm_llama_model, minicpm_llama_tokenizer\n if \"int4\" in VISION_MODEL_ID:\n # Load the 4-bit quantized model and tokenizer\n minicpm_llama_model = AutoModel.from_pretrained(\n VISION_MODEL_ID,\n device_map={\"\": device},\n trust_remote_code=True,\n torch_dtype=torch.float16 # Use float16 as per the example code for 4-bit models\n ).eval()\n minicpm_llama_tokenizer = AutoTokenizer.from_pretrained(\n VISION_MODEL_ID,\n trust_remote_code=True\n )\n print(\"MiniCPM-Llama3 4-bit quantized model loaded on-demand.\")\n else:\n # Load the standard model and tokenizer\n minicpm_llama_model = AutoModel.from_pretrained(\n VISION_MODEL_ID,\n device_map={\"\": device}, # Use the specified CUDA device\n trust_remote_code=True,\n torch_dtype=torch.bfloat16\n ).eval()\n minicpm_llama_tokenizer = AutoTokenizer.from_pretrained(\n VISION_MODEL_ID,\n trust_remote_code=True\n )\n print(\"MiniCPM-Llama3 standard model loaded on-demand.\")\n\ndef unload_minicpm_llama_model():\n global minicpm_llama_model, minicpm_llama_tokenizer\n if minicpm_llama_model is not None:\n del minicpm_llama_model\n del minicpm_llama_tokenizer\n torch.cuda.empty_cache()\n print(\"MiniCPM-Llama3 model unloaded.\")\n\ndef take_screenshot(monitor_index, text_queries, score_threshold, vision_model_question, output_filename=\"screenshot.png\"):\n \"\"\"\n Takes a screenshot of a specific monitor in a multi-monitor setup.\n\n Args:\n monitor_index (int): The index of the monitor to capture.\n text_queries (str): Comma-separated text queries for object detection.\n score_threshold (float): The score threshold for filtering low-probability predictions.\n vision_model_question (str): The question to ask the vision model.\n output_filename (str): The filename to save the screenshot as (default is \"screenshot.png\").\n \"\"\"\n monitor_index = int(monitor_index)\n score_threshold = float(score_threshold)\n\n # Get information about all monitors\n monitors = screeninfo.get_monitors()\n\n if monitor_index >= len(monitors):\n raise ValueError(f\"Monitor index {monitor_index} is out of range. There are only {len(monitors)} monitors.\")\n\n # Get the region of the specified monitor\n monitor = monitors[monitor_index]\n region = (monitor.x, monitor.y, monitor.width, monitor.height)\n\n # Take the screenshot\n output_dir = \"extensions/Lucid_Autonomy\"\n os.makedirs(output_dir, exist_ok=True)\n screenshot_path = os.path.join(output_dir, output_filename)\n screenshot = pyautogui.screenshot(region=region)\n screenshot.save(screenshot_path)\n print(f\"Screenshot saved as {screenshot_path}\")\n\n # Process the image using the Owlv2 model\n annotated_image_path, json_output_path = query_image(screenshot_path, text_queries, score_threshold)\n\n # Create the output directory if it doesn't exist\n output_dir = \"extensions/Lucid_Autonomy/ImageOutputTest\"\n os.makedirs(output_dir, exist_ok=True)\n\n # Clear the output directory\n for filename in os.listdir(output_dir):\n file_path = os.path.join(output_dir, filename)\n try:\n if os.path.isfile(file_path) or os.path.islink(file_path):\n os.unlink(file_path)\n elif os.path.isdir(file_path):\n shutil.rmtree(file_path)\n except Exception as e:\n print(f\"Failed to delete {file_path}. Reason: {e}\")\n\n # Crop the detected objects and save them to the output directory\n crop_images(annotated_image_path, json_output_path, output_dir)\n\n print(\"Processing complete. Check the output directory for the cropped images and results.\")\n\ndef query_image(img_path, text_queries, score_threshold):\n \"\"\"\n Processes the image using the Owlv2 model to detect objects based on text queries.\n\n Args:\n img_path (str): The path to the input image.\n text_queries (str): Comma-separated text queries for object detection.\n score_threshold (float): The score threshold for filtering low-probability predictions.\n \"\"\"\n load_owlv2_model()\n\n img = Image.open(img_path)\n text_queries = text_queries.split(\",\")\n\n size = max(img.size)\n target_sizes = torch.Tensor([[size, size]])\n inputs = owlv2_processor(text=text_queries, images=img, return_tensors=\"pt\").to(device)\n\n with torch.no_grad():\n outputs = owlv2_model(**inputs)\n\n outputs.logits = outputs.logits.cpu()\n outputs.pred_boxes = outputs.pred_boxes.cpu()\n results = owlv2_processor.post_process_object_detection(outputs=outputs, target_sizes=target_sizes)\n boxes, scores, labels = results[0][\"boxes\"], results[0][\"scores\"], results[0][\"labels\"]\n\n result_labels = []\n img_pil = img.copy()\n draw = ImageDraw.Draw(img_pil)\n\n for box, score, label in zip(boxes, scores, labels):\n box = [int(i) for i in box.tolist()]\n if score < score_threshold:\n continue\n result_labels.append([box, text_queries[label.item()]])\n draw.rectangle(box, outline=\"red\", width=2)\n # Do not draw the text label within the bounding box\n # draw.text((box[0], box[1]), text_queries[label.item()], fill=\"red\")\n\n # Save the annotated image for debugging\n annotated_image_path = \"extensions/Lucid_Autonomy/annotated_image.png\"\n img_pil.save(annotated_image_path)\n\n # Save the JSON output to a file\n json_output_path = \"extensions/Lucid_Autonomy/output.json\"\n with open(json_output_path, 'w') as f:\n json.dump(result_labels, f)\n\n unload_owlv2_model()\n\n return annotated_image_path, json_output_path\n\ndef crop_images(image_path, json_output_path, output_dir=\"extensions/Lucid_Autonomy/ImageOutputTest\"):\n \"\"\"\n Crops out the detected objects from the input image based on the bounding box coordinates.\n\n Args:\n image_path (str): The path to the input image.\n json_output_path (str): The path to the JSON file containing the bounding box coordinates.\n output_dir (str): The directory to save the cropped images (default is \"extensions/Lucid_Autonomy/ImageOutputTest\").\n \"\"\"\n with open(json_output_path, 'r') as f:\n data = json.load(f)\n\n img = Image.open(image_path)\n\n results = []\n\n # Load the miniCPM-lama3-v-2_5 model once before processing the images\n load_minicpm_llama_model()\n\n for i, item in enumerate(data):\n box = item[0]\n label = item[1]\n cropped_img = img.crop(box)\n cropped_img_path = f\"{output_dir}/{label}_{i}.png\"\n cropped_img.save(cropped_img_path)\n\n # Process the cropped image with the miniCPM-lama3-v-2_5 model\n vision_model_response = process_with_vision_model(cropped_img_path, VISION_MODEL_QUESTION)\n\n # Store the results in a structured format\n results.append({\n \"image_name\": f\"{label}_{i}.png\",\n \"coordinates\": box,\n \"description\": vision_model_response\n })\n\n # Unload the miniCPM-lama3-v-2_5 model after processing all images\n unload_minicpm_llama_model()\n\n # Save the results to a JSON file\n results_json_path = f\"{output_dir}/results.json\"\n with open(results_json_path, 'w') as f:\n json.dump(results, f, indent=4)\n\n # Create the compressed results JSON file\n compressed_results_json_path = f\"{output_dir}/compressedresults.json\"\n compressed_results = [{\"image_name\": result[\"image_name\"], \"description\": result[\"description\"]} for result in results]\n with open(compressed_results_json_path, 'w') as f:\n json.dump(compressed_results, f, indent=4)\n\ndef process_with_vision_model(image_path, question):\n \"\"\"\n Processes the image with the miniCPM-lama3-v-2_5 model and returns the response.\n\n Args:\n image_path (str): The path to the input image.\n question (str): The question to ask the vision model.\n\n Returns:\n str: The response from the vision model.\n \"\"\"\n image = Image.open(image_path).convert(\"RGB\")\n messages = [\n {\"role\": \"user\", \"content\": f\"{question}\"}\n ]\n # Define the generation arguments\n generation_args = {\n \"max_new_tokens\": 896,\n \"repetition_penalty\": 1.05,\n \"num_beams\": 3,\n \"top_p\": 0.8,\n \"top_k\": 1,\n \"temperature\": 0.7,\n \"sampling\": True,\n }\n if \"int4\" in \"openbmb/MiniCPM-Llama3-V-2_5\":\n # Disable streaming for the 4-bit model\n generation_args[\"stream\"] = False\n # Use the model.chat method with streaming enabled\n vision_model_response = \"\"\n for new_text in minicpm_llama_model.chat(\n image=image,\n msgs=messages,\n tokenizer=minicpm_llama_tokenizer,\n **generation_args\n ):\n vision_model_response += new_text\n print(new_text, flush=True, end='')\n else:\n minicpm_llama_model.to(device)\n vision_model_response = minicpm_llama_model.chat(\n image=image,\n msgs=messages,\n tokenizer=minicpm_llama_tokenizer,\n **generation_args\n )\n return vision_model_response\n\ndef ui():\n \"\"\"\n Creates custom gradio elements when the UI is launched.\n \"\"\"\n with gr.Blocks() as demo:\n with gr.Row():\n with gr.Column():\n monitor_index = gr.Textbox(label=\"Monitor Index\", value=str(MONITOR_INDEX))\n text_queries = gr.Textbox(label=\"Text Queries\", value=TEXT_QUERIES)\n score_threshold = gr.Textbox(label=\"Score Threshold\", value=str(SCORE_THRESHOLD))\n vision_model_question = gr.Textbox(label=\"Vision Model Question\", value=VISION_MODEL_QUESTION)\n take_screenshot_button = gr.Button(value=\"Take Screenshot\")\n\n take_screenshot_button.click(\n fn=take_screenshot,\n inputs=[monitor_index, text_queries, score_threshold, vision_model_question],\n outputs=None\n )\n\n return demo\n\ndef input_modifier(user_input, state):\n \"\"\"\n Modifies the user input before it is processed by the LLM.\n \"\"\"\n return user_input\n\ndef output_modifier(output, state):\n \"\"\"\n Modifies the LLM output before it is presented in the UI.\n \"\"\"\n global oob_tasks\n oob_tasks = []\n\n # Search for the \"Autonomy_Tasks:\" trigger phrase in the LLM's output\n if \"Autonomy_Tasks:\" in output:\n # Clean the chat history\n state[\"history\"] = history_modifier(state[\"history\"])\n # Extract tasks and execute them\n tasks_section = output.split(\"Autonomy_Tasks:\")[-1].strip()\n if tasks_section:\n execute_tasks(tasks_section)\n # Append new compressed results\n compressed_results_json_path = \"extensions/Lucid_Autonomy/ImageOutputTest/compressedresults.json\"\n with open(compressed_results_json_path, 'r') as f:\n compressed_results = json.load(f)\n output = append_compressed_results(output, compressed_results)\n\n # Write OOB tasks to a JSON file\n with open(oob_tasks_file, 'w') as f:\n json.dump(oob_tasks, f)\n\n # Trigger the execution of OOB tasks after a delay\n threading.Thread(target=execute_oob_tasks_with_delay).start()\n\n return output\n\n\noob_tasks = []\n\ndef execute_tasks(tasks_section):\n \"\"\"\n Executes the tasks listed in the tasks_section.\n \"\"\"\n global oob_tasks\n oob_tasks = []\n tasks = tasks_section.split(\"\\n\")\n for task in tasks:\n task = task.strip()\n if task.startswith(\"MouseMover =\"):\n image_name = task.split(\"=\")[1].strip()\n move_mouse_to_image(image_name)\n elif task.startswith(\"MouseClick =\"):\n image_name = task.split(\"=\")[1].strip()\n left_click_image(image_name)\n elif task.startswith(\"TextInput =\"):\n text = task.split(\"=\")[1].strip()\n text_input(text)\n elif task.startswith(\"SpecialKey =\"):\n key = task.split(\"=\")[1].strip()\n special_key(key)\n elif task.startswith(\"FunctionKey =\"):\n key = task.split(\"=\")[1].strip()\n function_key(key)\n elif task.startswith(\"KeyCombo =\"):\n combo = task.split(\"=\")[1].strip()\n key_combo(combo)\n elif task.startswith(\"Delay =\"):\n delay = int(task.split(\"=\")[1].strip())\n delay_task(delay)\n elif task.startswith(\"MouseMove =\"):\n coordinates = task.split(\"=\")[1].strip()\n move_mouse_to_coordinates_raw(coordinates)\n elif task.startswith(\"MouseClick =\"):\n coordinates = task.split(\"=\")[1].strip()\n left_click_coordinates_raw(coordinates)\n elif task == \"TakeScreenshot\":\n take_screenshot_task()\n elif task.startswith(\"OOB_\"):\n oob_tasks.append(task)\n else:\n print(f\"Unknown task: {task}\")\n\ndef execute_oob_tasks_with_delay():\n \"\"\"\n Executes the saved OOB tasks after a delay.\n \"\"\"\n time.sleep(0.1) # Short delay to ensure the LLM's response has completed\n execute_oob_tasks()\n\ndef execute_oob_tasks():\n \"\"\"\n Executes the saved OOB tasks.\n \"\"\"\n with open(oob_tasks_file, 'r') as f:\n oob_tasks = json.load(f)\n\n for task in oob_tasks:\n task = task.strip()\n if task.startswith(\"OOB_MouseMover =\"):\n image_name = task.split(\"=\")[1].strip()\n move_mouse_to_image(image_name)\n elif task.startswith(\"OOB_MouseClick =\"):\n image_name = task.split(\"=\")[1].strip()\n left_click_image(image_name)\n elif task.startswith(\"OOB_TextInput =\"):\n text = task.split(\"=\")[1].strip()\n text_input(text)\n elif task.startswith(\"OOB_SpecialKey =\"):\n key = task.split(\"=\")[1].strip()\n special_key(key)\n elif task.startswith(\"OOB_FunctionKey =\"):\n key = task.split(\"=\")[1].strip()\n function_key(key)\n elif task.startswith(\"OOB_KeyCombo =\"):\n combo = task.split(\"=\")[1].strip()\n key_combo(combo)\n elif task.startswith(\"OOB_Delay =\"):\n delay = int(task.split(\"=\")[1].strip())\n delay_task(delay)\n elif task.startswith(\"OOB_MouseMove =\"):\n coordinates = task.split(\"=\")[1].strip()\n move_mouse_to_coordinates_raw(coordinates)\n elif task.startswith(\"OOB_MouseClick =\"):\n coordinates = task.split(\"=\")[1].strip()\n left_click_coordinates_raw(coordinates)\n elif task == \"OOB_TakeScreenshot\":\n take_screenshot_task()\n else:\n print(f\"Unknown OOB task: {task}\")\n\n\n\ndef delay_task(delay):\n \"\"\"\n Adds a delay in milliseconds.\n \"\"\"\n time.sleep(delay / 1000.0)\n\ndef move_mouse_to_coordinates_raw(coordinates):\n \"\"\"\n Moves the mouse to the specified raw coordinates.\n \"\"\"\n x, y = map(int, coordinates.split(\",\"))\n pyautogui.moveTo(x, y)\n print(f\"Mouse moved to coordinates: ({x}, {y})\")\n\ndef left_click_coordinates_raw(coordinates):\n \"\"\"\n Moves the mouse to the specified raw coordinates and performs a left click.\n \"\"\"\n x, y = map(int, coordinates.split(\",\"))\n pyautogui.moveTo(x, y)\n pyautogui.click()\n print(f\"Mouse clicked at coordinates: ({x}, {y})\")\n\ndef take_screenshot_task():\n \"\"\"\n Takes a screenshot.\n \"\"\"\n take_screenshot(MONITOR_INDEX, TEXT_QUERIES, SCORE_THRESHOLD, VISION_MODEL_QUESTION)\n\n\ndef move_mouse_to_image(image_name):\n \"\"\"\n Moves the mouse to the center of the box that defines the image.\n \"\"\"\n results_json_path = \"extensions/Lucid_Autonomy/ImageOutputTest/results.json\"\n with open(results_json_path, 'r') as f:\n results = json.load(f)\n\n for result in results:\n if result[\"image_name\"] == image_name:\n coordinates = result[\"coordinates\"]\n monitor = get_monitor_info(MONITOR_INDEX)\n move_mouse_to_coordinates(monitor, coordinates)\n break\n else:\n print(f\"Image name '{image_name}' not found in the results.\")\n\ndef left_click_image(image_name):\n \"\"\"\n Moves the mouse to the center of the box that defines the image and performs a left click.\n \"\"\"\n results_json_path = \"extensions/Lucid_Autonomy/ImageOutputTest/results.json\"\n with open(results_json_path, 'r') as f:\n results = json.load(f)\n\n for result in results:\n if result[\"image_name\"] == image_name:\n coordinates = result[\"coordinates\"]\n monitor = get_monitor_info(MONITOR_INDEX)\n move_mouse_to_coordinates(monitor, coordinates)\n pyautogui.click() # Perform a left click\n break\n else:\n print(f\"Image name '{image_name}' not found in the results.\")\n\ndef move_mouse_to_coordinates(monitor, coordinates):\n \"\"\"\n Moves the mouse cursor to the specified coordinates on the monitor.\n\n Args:\n monitor (dict): A dictionary containing the monitor's information.\n coordinates (list): A list containing the coordinates [x1, y1, x2, y2].\n \"\"\"\n x1, y1, x2, y2 = coordinates\n monitor_x, monitor_y = monitor.x, monitor.y\n center_x = monitor_x + (x1 + x2) // 2\n center_y = monitor_y + (y1 + y2) // 2\n pyautogui.moveTo(center_x, center_y)\n print(f\"Mouse moved to coordinates: ({center_x}, {center_y})\")\n\ndef get_monitor_info(monitor_index):\n \"\"\"\n Gets information about the specified monitor.\n\n Args:\n monitor_index (int): The index of the monitor to get information about.\n\n Returns:\n dict: A dictionary containing the monitor's information.\n \"\"\"\n monitors = screeninfo.get_monitors()\n\n if monitor_index >= len(monitors):\n raise ValueError(f\"Monitor index {monitor_index} is out of range. There are only {len(monitors)} monitors.\")\n\n return monitors[monitor_index]\n\ndef history_modifier(history):\n \"\"\"\n Modifies the chat history before the text generation in chat mode begins.\n \"\"\"\n # Clean the internal history\n history[\"internal\"] = clean_chat_history(history[\"internal\"])\n # Clean the visible history\n history[\"visible\"] = clean_chat_history(history[\"visible\"])\n\n # Inject the latest compressedresults.json into the internal history\n compressed_results_json_path = \"extensions/Lucid_Autonomy/ImageOutputTest/compressedresults.json\"\n with open(compressed_results_json_path, 'r') as f:\n compressed_results = json.load(f)\n latest_compressed_results = json.dumps(compressed_results, indent=4)\n history[\"internal\"].append([\"\", f\"<START_COMPRESSED_RESULTS>\\\\\\\\n{latest_compressed_results}\\\\\\\\n<END_COMPRESSED_RESULTS>\"])\n\n return history\n\ndef clean_chat_history(history):\n \"\"\"\n Cleans the chat history by removing all but the latest instance of the contents between the bookends.\n \"\"\"\n START_BOOKEND = \"<START_COMPRESSED_RESULTS>\"\n END_BOOKEND = \"<END_COMPRESSED_RESULTS>\"\n\n # Find all instances of the bookends\n bookend_indices = []\n for i, entry in enumerate(history):\n if isinstance(entry, list) and len(entry) == 2:\n if START_BOOKEND in entry[1] and END_BOOKEND in entry[1]:\n bookend_indices.append(i)\n\n # If there are multiple instances, remove all but the last one\n if len(bookend_indices) > 1:\n for i in bookend_indices[:-1]:\n history[i][1] = re.sub(f\"{START_BOOKEND}.*?{END_BOOKEND}\", \"\", history[i][1], flags=re.DOTALL)\n\n return history\n\ndef append_compressed_results(output, compressed_results):\n \"\"\"\n Appends the compressed results to the output.\n \"\"\"\n START_BOOKEND = \"<START_COMPRESSED_RESULTS>\"\n END_BOOKEND = \"<END_COMPRESSED_RESULTS>\"\n output += f\"\\\\\\\\n\\\\\\\\n{START_BOOKEND}\\\\\\\\n{json.dumps(compressed_results, indent=4)}\\\\\\\\n{END_BOOKEND}\"\n return output\n\ndef text_input(text):\n \"\"\"\n Types the given text, replacing '/n' with newline characters.\n \"\"\"\n text = text.replace(\"/n\", \"\\\\n\")\n pyautogui.typewrite(text)\n\ndef special_key(key):\n \"\"\"\n Presses the specified special key.\n \"\"\"\n if key == \"enter\":\n pyautogui.press(\"enter\")\n elif key == \"space\":\n pyautogui.press(\"space\")\n elif key == \"backspace\":\n pyautogui.press(\"backspace\")\n elif key == \"tab\":\n pyautogui.press(\"tab\")\n elif key == \"escape\":\n pyautogui.press(\"esc\")\n elif key == \"shift\":\n pyautogui.press(\"shift\")\n elif key == \"ctrl\":\n pyautogui.press(\"ctrl\")\n elif key == \"alt\":\n pyautogui.press(\"alt\")\n else:\n print(f\"Unsupported special key: {key}\")\n\ndef function_key(key):\n \"\"\"\n Presses the specified function key.\n \"\"\"\n if key.startswith(\"f\") and key[1:].isdigit():\n pyautogui.press(key)\n else:\n print(f\"Unsupported function key: {key}\")\n\ndef key_combo(combo):\n \"\"\"\n Presses the specified key combination.\n \"\"\"\n keys = combo.split(\"+\")\n for key in keys:\n if key in [\"ctrl\", \"alt\", \"shift\"]:\n pyautogui.keyDown(key)\n else:\n pyautogui.press(key)\n for key in keys:\n if key in [\"ctrl\", \"alt\", \"shift\"]:\n pyautogui.keyUp(key)\n\nif __name__ == \"__main__\":\n # Launch the Gradio interface with the specified UI components\n gr.Interface(\n fn=ui,\n inputs=None,\n outputs=\"ui\"\n ).launch()\n",
"Thank you for providing the detailed context and the current state of the code. The issue you're encountering is due to the fact that the `MouseClick` command is being interpreted as an image-based command rather than a coordinate-based command. We need to ensure that the `MouseClick` command for raw coordinates is handled correctly.\n\n### Plan\n\n1. **Update the `execute_tasks` function** to correctly handle `MouseClick` commands for raw coordinates.\n2. **Update the `execute_oob_tasks` function** to correctly handle `MouseClick` commands for raw coordinates.\n3. **Ensure that the `output_modifier` function** triggers the execution of OOB tasks after a delay.\n\n### Step 1: Update the `execute_tasks` function\n\nWe'll update the `execute_tasks` function to correctly handle `MouseClick` commands for raw coordinates.\n\n```python\ndef execute_tasks(tasks_section):\n \"\"\"\n Executes the tasks listed in the tasks_section.\n \"\"\"\n global oob_tasks\n oob_tasks = []\n tasks = tasks_section.split(\"\\n\")\n for task in tasks:\n task = task.strip()\n if task.startswith(\"MouseMover =\"):\n image_name = task.split(\"=\")[1].strip()\n move_mouse_to_image(image_name)\n elif task.startswith(\"MouseClick =\"):\n value = task.split(\"=\")[1].strip()\n if \",\" in value: # Check if the value contains coordinates\n coordinates = value\n left_click_coordinates_raw(coordinates)\n else: # Otherwise, treat it as an image name\n image_name = value\n left_click_image(image_name)\n elif task.startswith(\"TextInput =\"):\n text = task.split(\"=\")[1].strip()\n text_input(text)\n elif task.startswith(\"SpecialKey =\"):\n key = task.split(\"=\")[1].strip()\n special_key(key)\n elif task.startswith(\"FunctionKey =\"):\n key = task.split(\"=\")[1].strip()\n function_key(key)\n elif task.startswith(\"KeyCombo =\"):\n combo = task.split(\"=\")[1].strip()\n key_combo(combo)\n elif task.startswith(\"Delay =\"):\n delay = int(task.split(\"=\")[1].strip())\n delay_task(delay)\n elif task.startswith(\"MouseMove =\"):\n coordinates = task.split(\"=\")[1].strip()\n move_mouse_to_coordinates_raw(coordinates)\n elif task == \"TakeScreenshot\":\n take_screenshot_task()\n elif task.startswith(\"OOB_\"):\n oob_tasks.append(task)\n else:\n print(f\"Unknown task: {task}\")\n```\n\n### Step 2: Update the `execute_oob_tasks` function\n\nWe'll update the `execute_oob_tasks` function to correctly handle `MouseClick` commands for raw coordinates.\n\n```python\ndef execute_oob_tasks():\n \"\"\"\n Executes the saved OOB tasks.\n \"\"\"\n with open(oob_tasks_file, 'r') as f:\n oob_tasks = json.load(f)\n\n for task in oob_tasks:\n task = task.strip()\n if task.startswith(\"OOB_MouseMover =\"):\n image_name = task.split(\"=\")[1].strip()\n move_mouse_to_image(image_name)\n elif task.startswith(\"OOB_MouseClick =\"):\n value = task.split(\"=\")[1].strip()\n if \",\" in value: # Check if the value contains coordinates\n coordinates = value\n left_click_coordinates_raw(coordinates)\n else: # Otherwise, treat it as an image name\n image_name = value\n left_click_image(image_name)\n elif task.startswith(\"OOB_TextInput =\"):\n text = task.split(\"=\")[1].strip()\n text_input(text)\n elif task.startswith(\"OOB_SpecialKey =\"):\n key = task.split(\"=\")[1].strip()\n special_key(key)\n elif task.startswith(\"OOB_FunctionKey =\"):\n key = task.split(\"=\")[1].strip()\n function_key(key)\n elif task.startswith(\"OOB_KeyCombo =\"):\n combo = task.split(\"=\")[1].strip()\n key_combo(combo)\n elif task.startswith(\"OOB_Delay =\"):\n delay = int(task.split(\"=\")[1].strip())\n delay_task(delay)\n elif task.startswith(\"OOB_MouseMove =\"):\n coordinates = task.split(\"=\")[1].strip()\n move_mouse_to_coordinates_raw(coordinates)\n elif task.startswith(\"OOB_MouseClick =\"):\n coordinates = task.split(\"=\")[1].strip()\n left_click_coordinates_raw(coordinates)\n elif task == \"OOB_TakeScreenshot\":\n take_screenshot_task()\n else:\n print(f\"Unknown OOB task: {task}\")\n```\n\n### Step 3: Ensure the `output_modifier` function triggers the execution of OOB tasks after a delay\n\nWe'll ensure that the `output_modifier` function triggers the execution of OOB tasks after a delay.\n\n```python\ndef output_modifier(output, state):\n \"\"\"\n Modifies the LLM output before it is presented in the UI.\n \"\"\"\n global oob_tasks\n oob_tasks = []\n\n # Search for the \"Autonomy_Tasks:\" trigger phrase in the LLM's output\n if \"Autonomy_Tasks:\" in output:\n # Clean the chat history\n state[\"history\"] = history_modifier(state[\"history\"])\n # Extract tasks and execute them\n tasks_section = output.split(\"Autonomy_Tasks:\")[-1].strip()\n if tasks_section:\n execute_tasks(tasks_section)\n # Append new compressed results\n compressed_results_json_path = \"extensions/Lucid_Autonomy/ImageOutputTest/compressedresults.json\"\n with open(compressed_results_json_path, 'r') as f:\n compressed_results = json.load(f)\n output = append_compressed_results(output, compressed_results)\n\n # Write OOB tasks to a JSON file\n with open(oob_tasks_file, 'w') as f:\n json.dump(oob_tasks, f)\n\n # Trigger the execution of OOB tasks after a delay\n threading.Thread(target=execute_oob_tasks_with_delay).start()\n\n return output\n```\n\n### Putting it all together\n\nHere are the changes you need to make to the main code:\n\n1. **Update the `execute_tasks` function** to correctly handle `MouseClick` commands for raw coordinates.\n2. **Update the `execute_oob_tasks` function** to correctly handle `MouseClick` commands for raw coordinates.\n3. **Ensure the `output_modifier` function** triggers the execution of OOB tasks after a delay.\n\nYou can integrate these changes into your main code as needed. If you have any questions or need further assistance, feel free to ask!"
],
[
"Oh my frick! Yes you got it, everything is fixed now :3\n\nLet's work on adding a new feature. Okay so when a screenshot is taken all of the proper actions are done, the OWL model defines the boxes, the boxes are cropped from the screenshot, and the cropped images are fed to minicpm; and all the resulting data are stored in the convenient json file.\n\nNow what I want to change is this, when a screenshot is taken I want the entire screenshot to be sent to minicpm as well as a separate question, a separate question to that one that is asked of the individual cropped images. I want the response from minicpm to be added to the results.json and compressedresults.json files.\n\nMy thinking is this, the LLM can identify all the UI elements, but it might be helpful for the LLM to contextualize the UI elements within the context of the entire screenshot. \n\nSo we are going to need to add a separate Vision_Model_Question maybe we can name it something like Full_Image_Vision_Model_Question. This way I can ask minicpm something like \"Please describe what you see in this image, if you see UI elements describe them and their general location and if you see text read the text verbatim.\" ",
"Great idea! Adding a separate question for the entire screenshot can indeed provide more context for the LLM. We'll need to modify the code to send the entire screenshot to the `minicpm` model with a different question and then include the response in the `results.json` and `compressedresults.json` files.\n\n### Plan\n\n1. **Add a new configurable variable** for the full image vision model question.\n2. **Modify the `take_screenshot` function** to process the entire screenshot with the `minicpm` model.\n3. **Update the `crop_images` function** to include the response from the full image in the results.\n4. **Update the `results.json` and `compressedresults.json` files** to include the response from the full image.\n\n### Step 1: Add a new configurable variable\n\nWe'll add a new configurable variable for the full image vision model question.\n\n```python\n# Configurable variables\nMONITOR_INDEX = 1 # Index of the monitor to capture (0 for the first monitor, 1 for the second monitor, etc.)\nTEXT_QUERIES = \"clickable hyperlinks,clickable icons,text bar field\" # Comma-separated text queries for object detection\nSCORE_THRESHOLD = 0.2 # Score threshold for filtering low-probability predictions\nVISION_MODEL_QUESTION = \"Keep your response to less than 1 sentence: What is the text in this image or what does this image describe?\"\nFULL_IMAGE_VISION_MODEL_QUESTION = \"Please describe what you see in this image, if you see UI elements describe them and their general location and if you see text read the text verbatim.\"\n```\n\n### Step 2: Modify the `take_screenshot` function\n\nWe'll modify the `take_screenshot` function to process the entire screenshot with the `minicpm` model.\n\n```python\ndef take_screenshot(monitor_index, text_queries, score_threshold, vision_model_question, full_image_vision_model_question, output_filename=\"screenshot.png\"):\n \"\"\"\n Takes a screenshot of a specific monitor in a multi-monitor setup.\n\n Args:\n monitor_index (int): The index of the monitor to capture.\n text_queries (str): Comma-separated text queries for object detection.\n score_threshold (float): The score threshold for filtering low-probability predictions.\n vision_model_question (str): The question to ask the vision model for cropped images.\n full_image_vision_model_question (str): The question to ask the vision model for the full image.\n output_filename (str): The filename to save the screenshot as (default is \"screenshot.png\").\n \"\"\"\n monitor_index = int(monitor_index)\n score_threshold = float(score_threshold)\n\n # Get information about all monitors\n monitors = screeninfo.get_monitors()\n\n if monitor_index >= len(monitors):\n raise ValueError(f\"Monitor index {monitor_index} is out of range. There are only {len(monitors)} monitors.\")\n\n # Get the region of the specified monitor\n monitor = monitors[monitor_index]\n region = (monitor.x, monitor.y, monitor.width, monitor.height)\n\n # Take the screenshot\n output_dir = \"extensions/Lucid_Autonomy\"\n os.makedirs(output_dir, exist_ok=True)\n screenshot_path = os.path.join(output_dir, output_filename)\n screenshot = pyautogui.screenshot(region=region)\n screenshot.save(screenshot_path)\n print(f\"Screenshot saved as {screenshot_path}\")\n\n # Process the image using the Owlv2 model\n annotated_image_path, json_output_path = query_image(screenshot_path, text_queries, score_threshold)\n\n # Process the entire screenshot with the minicpm model\n full_image_response = process_with_vision_model(screenshot_path, full_image_vision_model_question)\n\n # Create the output directory if it doesn't exist\n output_dir = \"extensions/Lucid_Autonomy/ImageOutputTest\"\n os.makedirs(output_dir, exist_ok=True)\n\n # Clear the output directory\n for filename in os.listdir(output_dir):\n file_path = os.path.join(output_dir, filename)\n try:\n if os.path.isfile(file_path) or os.path.islink(file_path):\n os.unlink(file_path)\n elif os.path.isdir(file_path):\n shutil.rmtree(file_path)\n except Exception as e:\n print(f\"Failed to delete {file_path}. Reason: {e}\")\n\n # Crop the detected objects and save them to the output directory\n crop_images(annotated_image_path, json_output_path, output_dir, full_image_response)\n\n print(\"Processing complete. Check the output directory for the cropped images and results.\")\n```\n\n### Step 3: Update the `crop_images` function\n\nWe'll update the `crop_images` function to include the response from the full image in the results.\n\n```python\ndef crop_images(image_path, json_output_path, output_dir=\"extensions/Lucid_Autonomy/ImageOutputTest\", full_image_response=\"\"):\n \"\"\"\n Crops out the detected objects from the input image based on the bounding box coordinates.\n\n Args:\n image_path (str): The path to the input image.\n json_output_path (str): The path to the JSON file containing the bounding box coordinates.\n output_dir (str): The directory to save the cropped images (default is \"extensions/Lucid_Autonomy/ImageOutputTest\").\n full_image_response (str): The response from the vision model for the full image.\n \"\"\"\n with open(json_output_path, 'r') as f:\n data = json.load(f)\n\n img = Image.open(image_path)\n\n results = []\n\n # Load the miniCPM-lama3-v-2_5 model once before processing the images\n load_minicpm_llama_model()\n\n for i, item in enumerate(data):\n box = item[0]\n label = item[1]\n cropped_img = img.crop(box)\n cropped_img_path = f\"{output_dir}/{label}_{i}.png\"\n cropped_img.save(cropped_img_path)\n\n # Process the cropped image with the miniCPM-lama3-v-2_5 model\n vision_model_response = process_with_vision_model(cropped_img_path, VISION_MODEL_QUESTION)\n\n # Store the results in a structured format\n results.append({\n \"image_name\": f\"{label}_{i}.png\",\n \"coordinates\": box,\n \"description\": vision_model_response\n })\n\n # Unload the miniCPM-lama3-v-2_5 model after processing all images\n unload_minicpm_llama_model()\n\n # Save the results to a JSON file\n results_json_path = f\"{output_dir}/results.json\"\n with open(results_json_path, 'w') as f:\n json.dump(results, f, indent=4)\n\n # Create the compressed results JSON file\n compressed_results_json_path = f\"{output_dir}/compressedresults.json\"\n compressed_results = [{\"image_name\": result[\"image_name\"], \"description\": result[\"description\"]} for result in results]\n with open(compressed_results_json_path, 'w') as f:\n json.dump(compressed_results, f, indent=4)\n\n # Add the full image response to the results\n if full_image_response:\n results.append({\n \"image_name\": \"full_image.png\",\n \"coordinates\": [],\n \"description\": full_image_response\n })\n compressed_results.append({\n \"image_name\": \"full_image.png\",\n \"description\": full_image_response\n })\n\n # Save the updated results to the JSON files\n with open(results_json_path, 'w') as f:\n json.dump(results, f, indent=4)\n with open(compressed_results_json_path, 'w') as f:\n json.dump(compressed_results, f, indent=4)\n```\n\n### Step 4: Update the `results.json` and `compressedresults.json` files\n\nWe'll update the `results.json` and `compressedresults.json` files to include the response from the full image.\n\n### Putting it all together\n\nHere are the changes you need to make to the main code:\n\n1. **Add a new configurable variable** for the full image vision model question.\n2. **Modify the `take_screenshot` function** to process the entire screenshot with the `minicpm` model.\n3. **Update the `crop_images` function** to include the response from the full image in the results.\n4. **Update the `results.json` and `compressedresults.json` files** to include the response from the full image.\n\nYou can integrate these changes into your main code as needed. If you have any questions or need further assistance, feel free to ask!"
],
[
"Hmm, I got a warning when I loaded the model, and an error when I tried taking a screenshot, I'm going to give you the error, then the full state of the code:\n\nWarning and Error:\n\n19:43:36-536590 INFO Starting Text generation web UI \n19:43:36-539499 WARNING trust_remote_code is enabled. This is dangerous. \n19:43:36-540147 INFO Loading settings from \"settings.yaml\" \n19:43:36-542277 INFO Loading the extension \"Lucid_Autonomy\" \n/home/myself/Desktop/OobV113LucidExistance/text-generation-webui-1.13/installer_files/env/lib/python3.11/site-packages/gradio/utils.py:928: UserWarning: Expected at least 5 arguments for function <function take_screenshot at 0x7011f6b66660>, received 4.\n warnings.warn(\n\nRunning on local URL: http://127.0.0.1:7861\n\nScreenshot saved as extensions/Lucid_Autonomy/screenshot.png\nOwlv2 model loaded.\nOwlv2 model unloaded.\nTraceback (most recent call last):\n File \"/home/myself/Desktop/OobV113LucidExistance/text-generation-webui-1.13/installer_files/env/lib/python3.11/site-packages/gradio/queueing.py\", line 527, in process_events\n response = await route_utils.call_process_api(\n ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n File \"/home/myself/Desktop/OobV113LucidExistance/text-generation-webui-1.13/installer_files/env/lib/python3.11/site-packages/gradio/route_utils.py\", line 261, in call_process_api\n output = await app.get_blocks().process_api(\n ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n File \"/home/myself/Desktop/OobV113LucidExistance/text-generation-webui-1.13/installer_files/env/lib/python3.11/site-packages/gradio/blocks.py\", line 1786, in process_api\n result = await self.call_function(\n ^^^^^^^^^^^^^^^^^^^^^^^^^\n File \"/home/myself/Desktop/OobV113LucidExistance/text-generation-webui-1.13/installer_files/env/lib/python3.11/site-packages/gradio/blocks.py\", line 1338, in call_function\n prediction = await anyio.to_thread.run_sync(\n ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n File \"/home/myself/Desktop/OobV113LucidExistance/text-generation-webui-1.13/installer_files/env/lib/python3.11/site-packages/anyio/to_thread.py\", line 56, in run_sync\n return await get_async_backend().run_sync_in_worker_thread(\n ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n File \"/home/myself/Desktop/OobV113LucidExistance/text-generation-webui-1.13/installer_files/env/lib/python3.11/site-packages/anyio/_backends/_asyncio.py\", line 2177, in run_sync_in_worker_thread\n return await future\n ^^^^^^^^^^^^\n File \"/home/myself/Desktop/OobV113LucidExistance/text-generation-webui-1.13/installer_files/env/lib/python3.11/site-packages/anyio/_backends/_asyncio.py\", line 859, in run\n result = context.run(func, *args)\n ^^^^^^^^^^^^^^^^^^^^^^^^\n File \"/home/myself/Desktop/OobV113LucidExistance/text-generation-webui-1.13/installer_files/env/lib/python3.11/site-packages/gradio/utils.py\", line 759, in wrapper\n response = f(*args, **kwargs)\n ^^^^^^^^^^^^^^^^^^\n File \"/home/myself/Desktop/OobV113LucidExistance/text-generation-webui-1.13/extensions/Lucid_Autonomy/script.py\", line 127, in take_screenshot\n full_image_response = process_with_vision_model(screenshot_path, full_image_vision_model_question)\n ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n File \"/home/myself/Desktop/OobV113LucidExistance/text-generation-webui-1.13/extensions/Lucid_Autonomy/script.py\", line 311, in process_with_vision_model\n minicpm_llama_model.to(device)\n ^^^^^^^^^^^^^^^^^^^^^^\nAttributeError: 'NoneType' object has no attribute 'to'\n\n\nCode:\nimport gradio as gr\nimport json\nimport pyautogui\nimport screeninfo\nfrom transformers import Owlv2Processor, Owlv2ForObjectDetection, AutoModel, AutoTokenizer\nfrom PIL import Image, ImageDraw\nimport torch\nimport os\nimport shutil\nimport re # Import the re module for regular expression operations\nimport json\nimport time\nimport threading\n\noob_tasks_file = \"extensions/Lucid_Autonomy/oob_tasks.json\"\n\n# Configurable variables\nMONITOR_INDEX = 1 # Index of the monitor to capture (0 for the first monitor, 1 for the second monitor, etc.)\nTEXT_QUERIES = \"clickable hyperlinks,clickable icons,text bar field\" # Comma-separated text queries for object detection\nSCORE_THRESHOLD = 0.2 # Score threshold for filtering low-probability predictions\nVISION_MODEL_QUESTION = \"Keep your response to less than 1 sentence: What is the text in this image or what does this image describe?\"\nFULL_IMAGE_VISION_MODEL_QUESTION = \"Please describe what you see in this image, if you see UI elements describe them and their general location and if you see text read the text verbatim.\"\n\n\n# Use GPU if available\nif torch.cuda.is_available():\n device = torch.device(\"cuda\")\nelse:\n device = torch.device(\"cpu\")\n\n# Global variables for models\nowlv2_model = None\nowlv2_processor = None\nminicpm_llama_model = None\nminicpm_llama_tokenizer = None\n\nVISION_MODEL_ID = \"/home/myself/Desktop/miniCPM_llava3V/MiniCPM-Llama3-V-2_5\"\n\ndef load_owlv2_model():\n global owlv2_model, owlv2_processor\n owlv2_model = Owlv2ForObjectDetection.from_pretrained(\"google/owlv2-base-patch16-ensemble\", trust_remote_code=True).to(device)\n owlv2_processor = Owlv2Processor.from_pretrained(\"google/owlv2-base-patch16-ensemble\", trust_remote_code=True)\n print(\"Owlv2 model loaded.\")\n\ndef unload_owlv2_model():\n global owlv2_model, owlv2_processor\n if owlv2_model is not None:\n del owlv2_model\n del owlv2_processor\n torch.cuda.empty_cache()\n print(\"Owlv2 model unloaded.\")\n\ndef load_minicpm_llama_model():\n global minicpm_llama_model, minicpm_llama_tokenizer\n if \"int4\" in VISION_MODEL_ID:\n # Load the 4-bit quantized model and tokenizer\n minicpm_llama_model = AutoModel.from_pretrained(\n VISION_MODEL_ID,\n device_map={\"\": device},\n trust_remote_code=True,\n torch_dtype=torch.float16 # Use float16 as per the example code for 4-bit models\n ).eval()\n minicpm_llama_tokenizer = AutoTokenizer.from_pretrained(\n VISION_MODEL_ID,\n trust_remote_code=True\n )\n print(\"MiniCPM-Llama3 4-bit quantized model loaded on-demand.\")\n else:\n # Load the standard model and tokenizer\n minicpm_llama_model = AutoModel.from_pretrained(\n VISION_MODEL_ID,\n device_map={\"\": device}, # Use the specified CUDA device\n trust_remote_code=True,\n torch_dtype=torch.bfloat16\n ).eval()\n minicpm_llama_tokenizer = AutoTokenizer.from_pretrained(\n VISION_MODEL_ID,\n trust_remote_code=True\n )\n print(\"MiniCPM-Llama3 standard model loaded on-demand.\")\n\ndef unload_minicpm_llama_model():\n global minicpm_llama_model, minicpm_llama_tokenizer\n if minicpm_llama_model is not None:\n del minicpm_llama_model\n del minicpm_llama_tokenizer\n torch.cuda.empty_cache()\n print(\"MiniCPM-Llama3 model unloaded.\")\n\ndef take_screenshot(monitor_index, text_queries, score_threshold, vision_model_question, full_image_vision_model_question, output_filename=\"screenshot.png\"):\n \"\"\"\n Takes a screenshot of a specific monitor in a multi-monitor setup.\n\n Args:\n monitor_index (int): The index of the monitor to capture.\n text_queries (str): Comma-separated text queries for object detection.\n score_threshold (float): The score threshold for filtering low-probability predictions.\n vision_model_question (str): The question to ask the vision model for cropped images.\n full_image_vision_model_question (str): The question to ask the vision model for the full image.\n output_filename (str): The filename to save the screenshot as (default is \"screenshot.png\").\n \"\"\"\n monitor_index = int(monitor_index)\n score_threshold = float(score_threshold)\n\n # Get information about all monitors\n monitors = screeninfo.get_monitors()\n\n if monitor_index >= len(monitors):\n raise ValueError(f\"Monitor index {monitor_index} is out of range. There are only {len(monitors)} monitors.\")\n\n # Get the region of the specified monitor\n monitor = monitors[monitor_index]\n region = (monitor.x, monitor.y, monitor.width, monitor.height)\n\n # Take the screenshot\n output_dir = \"extensions/Lucid_Autonomy\"\n os.makedirs(output_dir, exist_ok=True)\n screenshot_path = os.path.join(output_dir, output_filename)\n screenshot = pyautogui.screenshot(region=region)\n screenshot.save(screenshot_path)\n print(f\"Screenshot saved as {screenshot_path}\")\n\n # Process the image using the Owlv2 model\n annotated_image_path, json_output_path = query_image(screenshot_path, text_queries, score_threshold)\n\n # Process the entire screenshot with the minicpm model\n full_image_response = process_with_vision_model(screenshot_path, full_image_vision_model_question)\n\n # Create the output directory if it doesn't exist\n output_dir = \"extensions/Lucid_Autonomy/ImageOutputTest\"\n os.makedirs(output_dir, exist_ok=True)\n\n # Clear the output directory\n for filename in os.listdir(output_dir):\n file_path = os.path.join(output_dir, filename)\n try:\n if os.path.isfile(file_path) or os.path.islink(file_path):\n os.unlink(file_path)\n elif os.path.isdir(file_path):\n shutil.rmtree(file_path)\n except Exception as e:\n print(f\"Failed to delete {file_path}. Reason: {e}\")\n\n # Crop the detected objects and save them to the output directory\n crop_images(annotated_image_path, json_output_path, output_dir, full_image_response)\n\n print(\"Processing complete. Check the output directory for the cropped images and results.\")\n\n\ndef query_image(img_path, text_queries, score_threshold):\n \"\"\"\n Processes the image using the Owlv2 model to detect objects based on text queries.\n\n Args:\n img_path (str): The path to the input image.\n text_queries (str): Comma-separated text queries for object detection.\n score_threshold (float): The score threshold for filtering low-probability predictions.\n \"\"\"\n load_owlv2_model()\n\n img = Image.open(img_path)\n text_queries = text_queries.split(\",\")\n\n size = max(img.size)\n target_sizes = torch.Tensor([[size, size]])\n inputs = owlv2_processor(text=text_queries, images=img, return_tensors=\"pt\").to(device)\n\n with torch.no_grad():\n outputs = owlv2_model(**inputs)\n\n outputs.logits = outputs.logits.cpu()\n outputs.pred_boxes = outputs.pred_boxes.cpu()\n results = owlv2_processor.post_process_object_detection(outputs=outputs, target_sizes=target_sizes)\n boxes, scores, labels = results[0][\"boxes\"], results[0][\"scores\"], results[0][\"labels\"]\n\n result_labels = []\n img_pil = img.copy()\n draw = ImageDraw.Draw(img_pil)\n\n for box, score, label in zip(boxes, scores, labels):\n box = [int(i) for i in box.tolist()]\n if score < score_threshold:\n continue\n result_labels.append([box, text_queries[label.item()]])\n draw.rectangle(box, outline=\"red\", width=2)\n # Do not draw the text label within the bounding box\n # draw.text((box[0], box[1]), text_queries[label.item()], fill=\"red\")\n\n # Save the annotated image for debugging\n annotated_image_path = \"extensions/Lucid_Autonomy/annotated_image.png\"\n img_pil.save(annotated_image_path)\n\n # Save the JSON output to a file\n json_output_path = \"extensions/Lucid_Autonomy/output.json\"\n with open(json_output_path, 'w') as f:\n json.dump(result_labels, f)\n\n unload_owlv2_model()\n\n return annotated_image_path, json_output_path\n\ndef crop_images(image_path, json_output_path, output_dir=\"extensions/Lucid_Autonomy/ImageOutputTest\", full_image_response=\"\"):\n \"\"\"\n Crops out the detected objects from the input image based on the bounding box coordinates.\n\n Args:\n image_path (str): The path to the input image.\n json_output_path (str): The path to the JSON file containing the bounding box coordinates.\n output_dir (str): The directory to save the cropped images (default is \"extensions/Lucid_Autonomy/ImageOutputTest\").\n full_image_response (str): The response from the vision model for the full image.\n \"\"\"\n with open(json_output_path, 'r') as f:\n data = json.load(f)\n\n img = Image.open(image_path)\n\n results = []\n\n # Load the miniCPM-lama3-v-2_5 model once before processing the images\n load_minicpm_llama_model()\n\n for i, item in enumerate(data):\n box = item[0]\n label = item[1]\n cropped_img = img.crop(box)\n cropped_img_path = f\"{output_dir}/{label}_{i}.png\"\n cropped_img.save(cropped_img_path)\n\n # Process the cropped image with the miniCPM-lama3-v-2_5 model\n vision_model_response = process_with_vision_model(cropped_img_path, VISION_MODEL_QUESTION)\n\n # Store the results in a structured format\n results.append({\n \"image_name\": f\"{label}_{i}.png\",\n \"coordinates\": box,\n \"description\": vision_model_response\n })\n\n # Unload the miniCPM-lama3-v-2_5 model after processing all images\n unload_minicpm_llama_model()\n\n # Save the results to a JSON file\n results_json_path = f\"{output_dir}/results.json\"\n with open(results_json_path, 'w') as f:\n json.dump(results, f, indent=4)\n\n # Create the compressed results JSON file\n compressed_results_json_path = f\"{output_dir}/compressedresults.json\"\n compressed_results = [{\"image_name\": result[\"image_name\"], \"description\": result[\"description\"]} for result in results]\n with open(compressed_results_json_path, 'w') as f:\n json.dump(compressed_results, f, indent=4)\n\n # Add the full image response to the results\n if full_image_response:\n results.append({\n \"image_name\": \"full_image.png\",\n \"coordinates\": [],\n \"description\": full_image_response\n })\n compressed_results.append({\n \"image_name\": \"full_image.png\",\n \"description\": full_image_response\n })\n\n # Save the updated results to the JSON files\n with open(results_json_path, 'w') as f:\n json.dump(results, f, indent=4)\n with open(compressed_results_json_path, 'w') as f:\n json.dump(compressed_results, f, indent=4)\n\n\ndef process_with_vision_model(image_path, question):\n \"\"\"\n Processes the image with the miniCPM-lama3-v-2_5 model and returns the response.\n\n Args:\n image_path (str): The path to the input image.\n question (str): The question to ask the vision model.\n\n Returns:\n str: The response from the vision model.\n \"\"\"\n image = Image.open(image_path).convert(\"RGB\")\n messages = [\n {\"role\": \"user\", \"content\": f\"{question}\"}\n ]\n # Define the generation arguments\n generation_args = {\n \"max_new_tokens\": 896,\n \"repetition_penalty\": 1.05,\n \"num_beams\": 3,\n \"top_p\": 0.8,\n \"top_k\": 1,\n \"temperature\": 0.7,\n \"sampling\": True,\n }\n if \"int4\" in \"openbmb/MiniCPM-Llama3-V-2_5\":\n # Disable streaming for the 4-bit model\n generation_args[\"stream\"] = False\n # Use the model.chat method with streaming enabled\n vision_model_response = \"\"\n for new_text in minicpm_llama_model.chat(\n image=image,\n msgs=messages,\n tokenizer=minicpm_llama_tokenizer,\n **generation_args\n ):\n vision_model_response += new_text\n print(new_text, flush=True, end='')\n else:\n minicpm_llama_model.to(device)\n vision_model_response = minicpm_llama_model.chat(\n image=image,\n msgs=messages,\n tokenizer=minicpm_llama_tokenizer,\n **generation_args\n )\n return vision_model_response\n\ndef ui():\n \"\"\"\n Creates custom gradio elements when the UI is launched.\n \"\"\"\n with gr.Blocks() as demo:\n with gr.Row():\n with gr.Column():\n monitor_index = gr.Textbox(label=\"Monitor Index\", value=str(MONITOR_INDEX))\n text_queries = gr.Textbox(label=\"Text Queries\", value=TEXT_QUERIES)\n score_threshold = gr.Textbox(label=\"Score Threshold\", value=str(SCORE_THRESHOLD))\n vision_model_question = gr.Textbox(label=\"Vision Model Question\", value=VISION_MODEL_QUESTION)\n take_screenshot_button = gr.Button(value=\"Take Screenshot\")\n\n take_screenshot_button.click(\n fn=take_screenshot,\n inputs=[monitor_index, text_queries, score_threshold, vision_model_question],\n outputs=None\n )\n\n return demo\n\ndef input_modifier(user_input, state):\n \"\"\"\n Modifies the user input before it is processed by the LLM.\n \"\"\"\n return user_input\n\ndef output_modifier(output, state):\n \"\"\"\n Modifies the LLM output before it is presented in the UI.\n \"\"\"\n global oob_tasks\n oob_tasks = []\n\n # Search for the \"Autonomy_Tasks:\" trigger phrase in the LLM's output\n if \"Autonomy_Tasks:\" in output:\n # Clean the chat history\n state[\"history\"] = history_modifier(state[\"history\"])\n # Extract tasks and execute them\n tasks_section = output.split(\"Autonomy_Tasks:\")[-1].strip()\n if tasks_section:\n execute_tasks(tasks_section)\n # Append new compressed results\n compressed_results_json_path = \"extensions/Lucid_Autonomy/ImageOutputTest/compressedresults.json\"\n with open(compressed_results_json_path, 'r') as f:\n compressed_results = json.load(f)\n output = append_compressed_results(output, compressed_results)\n\n # Write OOB tasks to a JSON file\n with open(oob_tasks_file, 'w') as f:\n json.dump(oob_tasks, f)\n\n # Trigger the execution of OOB tasks after a delay\n threading.Thread(target=execute_oob_tasks_with_delay).start()\n\n return output\n\n\n\noob_tasks = []\n\ndef execute_tasks(tasks_section):\n \"\"\"\n Executes the tasks listed in the tasks_section.\n \"\"\"\n global oob_tasks\n oob_tasks = []\n tasks = tasks_section.split(\"\\n\")\n for task in tasks:\n task = task.strip()\n if task.startswith(\"MouseMover =\"):\n image_name = task.split(\"=\")[1].strip()\n move_mouse_to_image(image_name)\n elif task.startswith(\"MouseClick =\"):\n value = task.split(\"=\")[1].strip()\n if \",\" in value: # Check if the value contains coordinates\n coordinates = value\n left_click_coordinates_raw(coordinates)\n else: # Otherwise, treat it as an image name\n image_name = value\n left_click_image(image_name)\n elif task.startswith(\"TextInput =\"):\n text = task.split(\"=\")[1].strip()\n text_input(text)\n elif task.startswith(\"SpecialKey =\"):\n key = task.split(\"=\")[1].strip()\n special_key(key)\n elif task.startswith(\"FunctionKey =\"):\n key = task.split(\"=\")[1].strip()\n function_key(key)\n elif task.startswith(\"KeyCombo =\"):\n combo = task.split(\"=\")[1].strip()\n key_combo(combo)\n elif task.startswith(\"Delay =\"):\n delay = int(task.split(\"=\")[1].strip())\n delay_task(delay)\n elif task.startswith(\"MouseMove =\"):\n coordinates = task.split(\"=\")[1].strip()\n move_mouse_to_coordinates_raw(coordinates)\n elif task == \"TakeScreenshot\":\n take_screenshot_task()\n elif task.startswith(\"OOB_\"):\n oob_tasks.append(task)\n else:\n print(f\"Unknown task: {task}\")\n\n\ndef execute_oob_tasks_with_delay():\n \"\"\"\n Executes the saved OOB tasks after a delay.\n \"\"\"\n time.sleep(0.1) # Short delay to ensure the LLM's response has completed\n execute_oob_tasks()\n\ndef execute_oob_tasks():\n \"\"\"\n Executes the saved OOB tasks.\n \"\"\"\n with open(oob_tasks_file, 'r') as f:\n oob_tasks = json.load(f)\n\n for task in oob_tasks:\n task = task.strip()\n if task.startswith(\"OOB_MouseMover =\"):\n image_name = task.split(\"=\")[1].strip()\n move_mouse_to_image(image_name)\n elif task.startswith(\"OOB_MouseClick =\"):\n value = task.split(\"=\")[1].strip()\n if \",\" in value: # Check if the value contains coordinates\n coordinates = value\n left_click_coordinates_raw(coordinates)\n else: # Otherwise, treat it as an image name\n image_name = value\n left_click_image(image_name)\n elif task.startswith(\"OOB_TextInput =\"):\n text = task.split(\"=\")[1].strip()\n text_input(text)\n elif task.startswith(\"OOB_SpecialKey =\"):\n key = task.split(\"=\")[1].strip()\n special_key(key)\n elif task.startswith(\"OOB_FunctionKey =\"):\n key = task.split(\"=\")[1].strip()\n function_key(key)\n elif task.startswith(\"OOB_KeyCombo =\"):\n combo = task.split(\"=\")[1].strip()\n key_combo(combo)\n elif task.startswith(\"OOB_Delay =\"):\n delay = int(task.split(\"=\")[1].strip())\n delay_task(delay)\n elif task.startswith(\"OOB_MouseMove =\"):\n coordinates = task.split(\"=\")[1].strip()\n move_mouse_to_coordinates_raw(coordinates)\n elif task.startswith(\"OOB_MouseClick =\"):\n coordinates = task.split(\"=\")[1].strip()\n left_click_coordinates_raw(coordinates)\n elif task == \"OOB_TakeScreenshot\":\n take_screenshot_task()\n else:\n print(f\"Unknown OOB task: {task}\")\n\n\n\n\ndef delay_task(delay):\n \"\"\"\n Adds a delay in milliseconds.\n \"\"\"\n time.sleep(delay / 1000.0)\n\ndef move_mouse_to_coordinates_raw(coordinates):\n \"\"\"\n Moves the mouse to the specified raw coordinates.\n \"\"\"\n x, y = map(int, coordinates.split(\",\"))\n pyautogui.moveTo(x, y)\n print(f\"Mouse moved to coordinates: ({x}, {y})\")\n\ndef left_click_coordinates_raw(coordinates):\n \"\"\"\n Moves the mouse to the specified raw coordinates and performs a left click.\n \"\"\"\n x, y = map(int, coordinates.split(\",\"))\n pyautogui.moveTo(x, y)\n pyautogui.click()\n print(f\"Mouse clicked at coordinates: ({x}, {y})\")\n\ndef take_screenshot_task():\n \"\"\"\n Takes a screenshot.\n \"\"\"\n take_screenshot(MONITOR_INDEX, TEXT_QUERIES, SCORE_THRESHOLD, VISION_MODEL_QUESTION)\n\n\ndef move_mouse_to_image(image_name):\n \"\"\"\n Moves the mouse to the center of the box that defines the image.\n \"\"\"\n results_json_path = \"extensions/Lucid_Autonomy/ImageOutputTest/results.json\"\n with open(results_json_path, 'r') as f:\n results = json.load(f)\n\n for result in results:\n if result[\"image_name\"] == image_name:\n coordinates = result[\"coordinates\"]\n monitor = get_monitor_info(MONITOR_INDEX)\n move_mouse_to_coordinates(monitor, coordinates)\n break\n else:\n print(f\"Image name '{image_name}' not found in the results.\")\n\ndef left_click_image(image_name):\n \"\"\"\n Moves the mouse to the center of the box that defines the image and performs a left click.\n \"\"\"\n results_json_path = \"extensions/Lucid_Autonomy/ImageOutputTest/results.json\"\n with open(results_json_path, 'r') as f:\n results = json.load(f)\n\n for result in results:\n if result[\"image_name\"] == image_name:\n coordinates = result[\"coordinates\"]\n monitor = get_monitor_info(MONITOR_INDEX)\n move_mouse_to_coordinates(monitor, coordinates)\n pyautogui.click() # Perform a left click\n break\n else:\n print(f\"Image name '{image_name}' not found in the results.\")\n\ndef move_mouse_to_coordinates(monitor, coordinates):\n \"\"\"\n Moves the mouse cursor to the specified coordinates on the monitor.\n\n Args:\n monitor (dict): A dictionary containing the monitor's information.\n coordinates (list): A list containing the coordinates [x1, y1, x2, y2].\n \"\"\"\n x1, y1, x2, y2 = coordinates\n monitor_x, monitor_y = monitor.x, monitor.y\n center_x = monitor_x + (x1 + x2) // 2\n center_y = monitor_y + (y1 + y2) // 2\n pyautogui.moveTo(center_x, center_y)\n print(f\"Mouse moved to coordinates: ({center_x}, {center_y})\")\n\ndef get_monitor_info(monitor_index):\n \"\"\"\n Gets information about the specified monitor.\n\n Args:\n monitor_index (int): The index of the monitor to get information about.\n\n Returns:\n dict: A dictionary containing the monitor's information.\n \"\"\"\n monitors = screeninfo.get_monitors()\n\n if monitor_index >= len(monitors):\n raise ValueError(f\"Monitor index {monitor_index} is out of range. There are only {len(monitors)} monitors.\")\n\n return monitors[monitor_index]\n\ndef history_modifier(history):\n \"\"\"\n Modifies the chat history before the text generation in chat mode begins.\n \"\"\"\n # Clean the internal history\n history[\"internal\"] = clean_chat_history(history[\"internal\"])\n # Clean the visible history\n history[\"visible\"] = clean_chat_history(history[\"visible\"])\n\n # Inject the latest compressedresults.json into the internal history\n compressed_results_json_path = \"extensions/Lucid_Autonomy/ImageOutputTest/compressedresults.json\"\n with open(compressed_results_json_path, 'r') as f:\n compressed_results = json.load(f)\n latest_compressed_results = json.dumps(compressed_results, indent=4)\n history[\"internal\"].append([\"\", f\"<START_COMPRESSED_RESULTS>\\\\\\\\n{latest_compressed_results}\\\\\\\\n<END_COMPRESSED_RESULTS>\"])\n\n return history\n\ndef clean_chat_history(history):\n \"\"\"\n Cleans the chat history by removing all but the latest instance of the contents between the bookends.\n \"\"\"\n START_BOOKEND = \"<START_COMPRESSED_RESULTS>\"\n END_BOOKEND = \"<END_COMPRESSED_RESULTS>\"\n\n # Find all instances of the bookends\n bookend_indices = []\n for i, entry in enumerate(history):\n if isinstance(entry, list) and len(entry) == 2:\n if START_BOOKEND in entry[1] and END_BOOKEND in entry[1]:\n bookend_indices.append(i)\n\n # If there are multiple instances, remove all but the last one\n if len(bookend_indices) > 1:\n for i in bookend_indices[:-1]:\n history[i][1] = re.sub(f\"{START_BOOKEND}.*?{END_BOOKEND}\", \"\", history[i][1], flags=re.DOTALL)\n\n return history\n\ndef append_compressed_results(output, compressed_results):\n \"\"\"\n Appends the compressed results to the output.\n \"\"\"\n START_BOOKEND = \"<START_COMPRESSED_RESULTS>\"\n END_BOOKEND = \"<END_COMPRESSED_RESULTS>\"\n output += f\"\\\\\\\\n\\\\\\\\n{START_BOOKEND}\\\\\\\\n{json.dumps(compressed_results, indent=4)}\\\\\\\\n{END_BOOKEND}\"\n return output\n\ndef text_input(text):\n \"\"\"\n Types the given text, replacing '/n' with newline characters.\n \"\"\"\n text = text.replace(\"/n\", \"\\\\n\")\n pyautogui.typewrite(text)\n\ndef special_key(key):\n \"\"\"\n Presses the specified special key.\n \"\"\"\n if key == \"enter\":\n pyautogui.press(\"enter\")\n elif key == \"space\":\n pyautogui.press(\"space\")\n elif key == \"backspace\":\n pyautogui.press(\"backspace\")\n elif key == \"tab\":\n pyautogui.press(\"tab\")\n elif key == \"escape\":\n pyautogui.press(\"esc\")\n elif key == \"shift\":\n pyautogui.press(\"shift\")\n elif key == \"ctrl\":\n pyautogui.press(\"ctrl\")\n elif key == \"alt\":\n pyautogui.press(\"alt\")\n else:\n print(f\"Unsupported special key: {key}\")\n\ndef function_key(key):\n \"\"\"\n Presses the specified function key.\n \"\"\"\n if key.startswith(\"f\") and key[1:].isdigit():\n pyautogui.press(key)\n else:\n print(f\"Unsupported function key: {key}\")\n\ndef key_combo(combo):\n \"\"\"\n Presses the specified key combination.\n \"\"\"\n keys = combo.split(\"+\")\n for key in keys:\n if key in [\"ctrl\", \"alt\", \"shift\"]:\n pyautogui.keyDown(key)\n else:\n pyautogui.press(key)\n for key in keys:\n if key in [\"ctrl\", \"alt\", \"shift\"]:\n pyautogui.keyUp(key)\n\nif __name__ == \"__main__\":\n # Launch the Gradio interface with the specified UI components\n gr.Interface(\n fn=ui,\n inputs=None,\n outputs=\"ui\"\n ).launch()",
"Thank you for providing the detailed context and the current state of the code. The warning and error you're encountering are due to the `take_screenshot` function expecting an additional argument (`full_image_vision_model_question`) that is not being passed from the Gradio UI. Additionally, the `minicpm_llama_model` is not being loaded correctly, leading to the `AttributeError`.\n\nLet's address these issues step by step:\n\n### Step 1: Update the Gradio UI to Pass the Additional Argument\n\nWe need to update the Gradio UI to pass the `full_image_vision_model_question` argument to the `take_screenshot` function.\n\n```python\ndef ui():\n \"\"\"\n Creates custom gradio elements when the UI is launched.\n \"\"\"\n with gr.Blocks() as demo:\n with gr.Row():\n with gr.Column():\n monitor_index = gr.Textbox(label=\"Monitor Index\", value=str(MONITOR_INDEX))\n text_queries = gr.Textbox(label=\"Text Queries\", value=TEXT_QUERIES)\n score_threshold = gr.Textbox(label=\"Score Threshold\", value=str(SCORE_THRESHOLD))\n vision_model_question = gr.Textbox(label=\"Vision Model Question\", value=VISION_MODEL_QUESTION)\n full_image_vision_model_question = gr.Textbox(label=\"Full Image Vision Model Question\", value=FULL_IMAGE_VISION_MODEL_QUESTION)\n take_screenshot_button = gr.Button(value=\"Take Screenshot\")\n\n take_screenshot_button.click(\n fn=take_screenshot,\n inputs=[monitor_index, text_queries, score_threshold, vision_model_question, full_image_vision_model_question],\n outputs=None\n )\n\n return demo\n```\n\n### Step 2: Ensure the `minicpm_llama_model` is Loaded Correctly\n\nWe need to ensure that the `minicpm_llama_model` is loaded correctly before processing the full image.\n\n```python\ndef load_minicpm_llama_model():\n global minicpm_llama_model, minicpm_llama_tokenizer\n if \"int4\" in VISION_MODEL_ID:\n # Load the 4-bit quantized model and tokenizer\n minicpm_llama_model = AutoModel.from_pretrained(\n VISION_MODEL_ID,\n device_map={\"\": device},\n trust_remote_code=True,\n torch_dtype=torch.float16 # Use float16 as per the example code for 4-bit models\n ).eval()\n minicpm_llama_tokenizer = AutoTokenizer.from_pretrained(\n VISION_MODEL_ID,\n trust_remote_code=True\n )\n print(\"MiniCPM-Llama3 4-bit quantized model loaded on-demand.\")\n else:\n # Load the standard model and tokenizer\n minicpm_llama_model = AutoModel.from_pretrained(\n VISION_MODEL_ID,\n device_map={\"\": device}, # Use the specified CUDA device\n trust_remote_code=True,\n torch_dtype=torch.bfloat16\n ).eval()\n minicpm_llama_tokenizer = AutoTokenizer.from_pretrained(\n VISION_MODEL_ID,\n trust_remote_code=True\n )\n print(\"MiniCPM-Llama3 standard model loaded on-demand.\")\n```\n\n### Step 3: Update the `take_screenshot` Function\n\nWe need to ensure that the `minicpm_llama_model` is loaded before processing the full image.\n\n```python\ndef take_screenshot(monitor_index, text_queries, score_threshold, vision_model_question, full_image_vision_model_question, output_filename=\"screenshot.png\"):\n \"\"\"\n Takes a screenshot of a specific monitor in a multi-monitor setup.\n\n Args:\n monitor_index (int): The index of the monitor to capture.\n text_queries (str): Comma-separated text queries for object detection.\n score_threshold (float): The score threshold for filtering low-probability predictions.\n vision_model_question (str): The question to ask the vision model for cropped images.\n full_image_vision_model_question (str): The question to ask the vision model for the full image.\n output_filename (str): The filename to save the screenshot as (default is \"screenshot.png\").\n \"\"\"\n monitor_index = int(monitor_index)\n score_threshold = float(score_threshold)\n\n # Get information about all monitors\n monitors = screeninfo.get_monitors()\n\n if monitor_index >= len(monitors):\n raise ValueError(f\"Monitor index {monitor_index} is out of range. There are only {len(monitors)} monitors.\")\n\n # Get the region of the specified monitor\n monitor = monitors[monitor_index]\n region = (monitor.x, monitor.y, monitor.width, monitor.height)\n\n # Take the screenshot\n output_dir = \"extensions/Lucid_Autonomy\"\n os.makedirs(output_dir, exist_ok=True)\n screenshot_path = os.path.join(output_dir, output_filename)\n screenshot = pyautogui.screenshot(region=region)\n screenshot.save(screenshot_path)\n print(f\"Screenshot saved as {screenshot_path}\")\n\n # Process the image using the Owlv2 model\n annotated_image_path, json_output_path = query_image(screenshot_path, text_queries, score_threshold)\n\n # Load the miniCPM-lama3-v-2_5 model\n load_minicpm_llama_model()\n\n # Process the entire screenshot with the minicpm model\n full_image_response = process_with_vision_model(screenshot_path, full_image_vision_model_question)\n\n # Create the output directory if it doesn't exist\n output_dir = \"extensions/Lucid_Autonomy/ImageOutputTest\"\n os.makedirs(output_dir, exist_ok=True)\n\n # Clear the output directory\n for filename in os.listdir(output_dir):\n file_path = os.path.join(output_dir, filename)\n try:\n if os.path.isfile(file_path) or os.path.islink(file_path):\n os.unlink(file_path)\n elif os.path.isdir(file_path):\n shutil.rmtree(file_path)\n except Exception as e:\n print(f\"Failed to delete {file_path}. Reason: {e}\")\n\n # Crop the detected objects and save them to the output directory\n crop_images(annotated_image_path, json_output_path, output_dir, full_image_response)\n\n print(\"Processing complete. Check the output directory for the cropped images and results.\")\n```\n\n### Step 4: Update the `crop_images` Function\n\nWe need to ensure that the `crop_images` function includes the response from the full image in the results.\n\n```python\ndef crop_images(image_path, json_output_path, output_dir=\"extensions/Lucid_Autonomy/ImageOutputTest\", full_image_response=\"\"):\n \"\"\"\n Crops out the detected objects from the input image based on the bounding box coordinates.\n\n Args:\n image_path (str): The path to the input image.\n json_output_path (str): The path to the JSON file containing the bounding box coordinates.\n output_dir (str): The directory to save the cropped images (default is \"extensions/Lucid_Autonomy/ImageOutputTest\").\n full_image_response (str): The response from the vision model for the full image.\n \"\"\"\n with open(json_output_path, 'r') as f:\n data = json.load(f)\n\n img = Image.open(image_path)\n\n results = []\n\n # Load the miniCPM-lama3-v-2_5 model once before processing the images\n load_minicpm_llama_model()\n\n for i, item in enumerate(data):\n box = item[0]\n label = item[1]\n cropped_img = img.crop(box)\n cropped_img_path = f\"{output_dir}/{label}_{i}.png\"\n cropped_img.save(cropped_img_path)\n\n # Process the cropped image with the miniCPM-lama3-v-2_5 model\n vision_model_response = process_with_vision_model(cropped_img_path, VISION_MODEL_QUESTION)\n\n # Store the results in a structured format\n results.append({\n \"image_name\": f\"{label}_{i}.png\",\n \"coordinates\": box,\n \"description\": vision_model_response\n })\n\n # Unload the miniCPM-lama3-v-2_5 model after processing all images\n unload_minicpm_llama_model()\n\n # Save the results to a JSON file\n results_json_path = f\"{output_dir}/results.json\"\n with open(results_json_path, 'w') as f:\n json.dump(results, f, indent=4)\n\n # Create the compressed results JSON file\n compressed_results_json_path = f\"{output_dir}/compressedresults.json\"\n compressed_results = [{\"image_name\": result[\"image_name\"], \"description\": result[\"description\"]} for result in results]\n with open(compressed_results_json_path, 'w') as f:\n json.dump(compressed_results, f, indent=4)\n\n # Add the full image response to the results\n if full_image_response:\n results.append({\n \"image_name\": \"full_image.png\",\n \"coordinates\": [],\n \"description\": full_image_response\n })\n compressed_results.append({\n \"image_name\": \"full_image.png\",\n \"description\": full_image_response\n })\n\n # Save the updated results to the JSON files\n with open(results_json_path, 'w') as f:\n json.dump(results, f, indent=4)\n with open(compressed_results_json_path, 'w') as f:\n json.dump(compressed_results, f, indent=4)\n```\n\n### Putting it all together\n\nHere are the changes you need to make to the main code:\n\n1. **Update the Gradio UI to pass the additional argument**.\n2. **Ensure the `minicpm_llama_model` is loaded correctly**.\n3. **Update the `take_screenshot` function**.\n4. **Update the `crop_images` function**.\n\nYou can integrate these changes into your main code as needed. If you have any questions or need further assistance, feel free to ask!"
],
[
"Okay we are close, I didn't get the warning, and the minicpm model did load. However, for some reason the model loaded twice, I think you forgot to unload the model.\n\n20:04:42-116365 INFO Starting Text generation web UI \n20:04:42-119123 WARNING trust_remote_code is enabled. This is dangerous. \n20:04:42-119715 INFO Loading settings from \"settings.yaml\" \n20:04:42-121943 INFO Loading the extension \"Lucid_Autonomy\" \n\nRunning on local URL: http://127.0.0.1:7861\n\nScreenshot saved as extensions/Lucid_Autonomy/screenshot.png\nOwlv2 model loaded.\nOwlv2 model unloaded.\n/home/myself/Desktop/OobV113LucidExistance/text-generation-webui-1.13/installer_files/env/lib/python3.11/site-packages/transformers/generation/configuration_utils.py:577: UserWarning: `do_sample` is set to `False`. However, `min_p` is set to `0.0` -- this flag is only used in sample-based generation modes. You should set `do_sample=True` or unset `min_p`.\n warnings.warn(\nLoading checkpoint shards: 100%|██████████████████| 7/7 [01:32<00:00, 13.20s/it]\nMiniCPM-Llama3 standard model loaded on-demand.\nLoading checkpoint shards: 29%|█████▏ | 2/7 [00:01<00:03, 1.30it/s]\nTraceback (most recent call last):\n File \"/home/myself/Desktop/OobV113LucidExistance/text-generation-webui-1.13/installer_files/env/lib/python3.11/site-packages/gradio/queueing.py\", line 527, in process_events\n",
"Thank you for the detailed feedback. It looks like the `minicpm_llama_model` is being loaded twice because it's not being unloaded after processing the full image. Let's ensure that the model is unloaded correctly after each use.\n\n### Step 1: Update the `take_screenshot` Function\n\nWe need to ensure that the `minicpm_llama_model` is unloaded after processing the full image.\n\n```python\ndef take_screenshot(monitor_index, text_queries, score_threshold, vision_model_question, full_image_vision_model_question, output_filename=\"screenshot.png\"):\n \"\"\"\n Takes a screenshot of a specific monitor in a multi-monitor setup.\n\n Args:\n monitor_index (int): The index of the monitor to capture.\n text_queries (str): Comma-separated text queries for object detection.\n score_threshold (float): The score threshold for filtering low-probability predictions.\n vision_model_question (str): The question to ask the vision model for cropped images.\n full_image_vision_model_question (str): The question to ask the vision model for the full image.\n output_filename (str): The filename to save the screenshot as (default is \"screenshot.png\").\n \"\"\"\n monitor_index = int(monitor_index)\n score_threshold = float(score_threshold)\n\n # Get information about all monitors\n monitors = screeninfo.get_monitors()\n\n if monitor_index >= len(monitors):\n raise ValueError(f\"Monitor index {monitor_index} is out of range. There are only {len(monitors)} monitors.\")\n\n # Get the region of the specified monitor\n monitor = monitors[monitor_index]\n region = (monitor.x, monitor.y, monitor.width, monitor.height)\n\n # Take the screenshot\n output_dir = \"extensions/Lucid_Autonomy\"\n os.makedirs(output_dir, exist_ok=True)\n screenshot_path = os.path.join(output_dir, output_filename)\n screenshot = pyautogui.screenshot(region=region)\n screenshot.save(screenshot_path)\n print(f\"Screenshot saved as {screenshot_path}\")\n\n # Process the image using the Owlv2 model\n annotated_image_path, json_output_path = query_image(screenshot_path, text_queries, score_threshold)\n\n # Load the miniCPM-lama3-v-2_5 model\n load_minicpm_llama_model()\n\n # Process the entire screenshot with the minicpm model\n full_image_response = process_with_vision_model(screenshot_path, full_image_vision_model_question)\n\n # Unload the miniCPM-lama3-v-2_5 model after processing the full image\n unload_minicpm_llama_model()\n\n # Create the output directory if it doesn't exist\n output_dir = \"extensions/Lucid_Autonomy/ImageOutputTest\"\n os.makedirs(output_dir, exist_ok=True)\n\n # Clear the output directory\n for filename in os.listdir(output_dir):\n file_path = os.path.join(output_dir, filename)\n try:\n if os.path.isfile(file_path) or os.path.islink(file_path):\n os.unlink(file_path)\n elif os.path.isdir(file_path):\n shutil.rmtree(file_path)\n except Exception as e:\n print(f\"Failed to delete {file_path}. Reason: {e}\")\n\n # Crop the detected objects and save them to the output directory\n crop_images(annotated_image_path, json_output_path, output_dir, full_image_response)\n\n print(\"Processing complete. Check the output directory for the cropped images and results.\")\n```\n\n### Step 2: Update the `crop_images` Function\n\nWe need to ensure that the `minicpm_llama_model` is unloaded after processing all images.\n\n```python\ndef crop_images(image_path, json_output_path, output_dir=\"extensions/Lucid_Autonomy/ImageOutputTest\", full_image_response=\"\"):\n \"\"\"\n Crops out the detected objects from the input image based on the bounding box coordinates.\n\n Args:\n image_path (str): The path to the input image.\n json_output_path (str): The path to the JSON file containing the bounding box coordinates.\n output_dir (str): The directory to save the cropped images (default is \"extensions/Lucid_Autonomy/ImageOutputTest\").\n full_image_response (str): The response from the vision model for the full image.\n \"\"\"\n with open(json_output_path, 'r') as f:\n data = json.load(f)\n\n img = Image.open(image_path)\n\n results = []\n\n # Load the miniCPM-lama3-v-2_5 model once before processing the images\n load_minicpm_llama_model()\n\n for i, item in enumerate(data):\n box = item[0]\n label = item[1]\n cropped_img = img.crop(box)\n cropped_img_path = f\"{output_dir}/{label}_{i}.png\"\n cropped_img.save(cropped_img_path)\n\n # Process the cropped image with the miniCPM-lama3-v-2_5 model\n vision_model_response = process_with_vision_model(cropped_img_path, VISION_MODEL_QUESTION)\n\n # Store the results in a structured format\n results.append({\n \"image_name\": f\"{label}_{i}.png\",\n \"coordinates\": box,\n \"description\": vision_model_response\n })\n\n # Unload the miniCPM-lama3-v-2_5 model after processing all images\n unload_minicpm_llama_model()\n\n # Save the results to a JSON file\n results_json_path = f\"{output_dir}/results.json\"\n with open(results_json_path, 'w') as f:\n json.dump(results, f, indent=4)\n\n # Create the compressed results JSON file\n compressed_results_json_path = f\"{output_dir}/compressedresults.json\"\n compressed_results = [{\"image_name\": result[\"image_name\"], \"description\": result[\"description\"]} for result in results]\n with open(compressed_results_json_path, 'w') as f:\n json.dump(compressed_results, f, indent=4)\n\n # Add the full image response to the results\n if full_image_response:\n results.append({\n \"image_name\": \"full_image.png\",\n \"coordinates\": [],\n \"description\": full_image_response\n })\n compressed_results.append({\n \"image_name\": \"full_image.png\",\n \"description\": full_image_response\n })\n\n # Save the updated results to the JSON files\n with open(results_json_path, 'w') as f:\n json.dump(results, f, indent=4)\n with open(compressed_results_json_path, 'w') as f:\n json.dump(compressed_results, f, indent=4)\n```\n\n### Putting it all together\n\nHere are the changes you need to make to the main code:\n\n1. **Update the Gradio UI to pass the additional argument**.\n2. **Ensure the `minicpm_llama_model` is unloaded correctly**.\n3. **Update the `take_screenshot` function**.\n4. **Update the `crop_images` function**.\n\nYou can integrate these changes into your main code as needed. If you have any questions or need further assistance, feel free to ask!"
],
[
"Okay I made the changes and I can successfully press the take screenshot button manually with everything working, but when the LLM tries to take the screenshot I get this error:\n\n File \"/home/myself/Desktop/OobV113LucidExistance/text-generation-webui-1.13/extensions/Lucid_Autonomy/script.py\", line 371, in output_modifier\n execute_tasks(tasks_section)\n File \"/home/myself/Desktop/OobV113LucidExistance/text-generation-webui-1.13/extensions/Lucid_Autonomy/script.py\", line 430, in execute_tasks\n take_screenshot_task()\n File \"/home/myself/Desktop/OobV113LucidExistance/text-generation-webui-1.13/extensions/Lucid_Autonomy/script.py\", line 520, in take_screenshot_task\n take_screenshot(MONITOR_INDEX, TEXT_QUERIES, SCORE_THRESHOLD, VISION_MODEL_QUESTION)\nTypeError: take_screenshot() missing 1 required positional argument: 'full_image_vision_model_question'\n\n\nLLM response:\n\nAutonomy_Tasks:\nTakeScreenshot",
"The error you're encountering is because the `take_screenshot` function now requires an additional argument (`full_image_vision_model_question`), but the `take_screenshot_task` function is not passing this argument.\n\nLet's update the `take_screenshot_task` function to include the `full_image_vision_model_question` argument.\n\n### Step 1: Update the `take_screenshot_task` Function\n\nWe need to update the `take_screenshot_task` function to pass the `full_image_vision_model_question` argument.\n\n```python\ndef take_screenshot_task():\n \"\"\"\n Takes a screenshot.\n \"\"\"\n take_screenshot(MONITOR_INDEX, TEXT_QUERIES, SCORE_THRESHOLD, VISION_MODEL_QUESTION, FULL_IMAGE_VISION_MODEL_QUESTION)\n```\n\n### Step 2: Update the `execute_tasks` Function\n\nWe need to ensure that the `execute_tasks` function correctly handles the `TakeScreenshot` task.\n\n```python\ndef execute_tasks(tasks_section):\n \"\"\"\n Executes the tasks listed in the tasks_section.\n \"\"\"\n global oob_tasks\n oob_tasks = []\n tasks = tasks_section.split(\"\\n\")\n for task in tasks:\n task = task.strip()\n if task.startswith(\"MouseMover =\"):\n image_name = task.split(\"=\")[1].strip()\n move_mouse_to_image(image_name)\n elif task.startswith(\"MouseClick =\"):\n value = task.split(\"=\")[1].strip()\n if \",\" in value: # Check if the value contains coordinates\n coordinates = value\n left_click_coordinates_raw(coordinates)\n else: # Otherwise, treat it as an image name\n image_name = value\n left_click_image(image_name)\n elif task.startswith(\"TextInput =\"):\n text = task.split(\"=\")[1].strip()\n text_input(text)\n elif task.startswith(\"SpecialKey =\"):\n key = task.split(\"=\")[1].strip()\n special_key(key)\n elif task.startswith(\"FunctionKey =\"):\n key = task.split(\"=\")[1].strip()\n function_key(key)\n elif task.startswith(\"KeyCombo =\"):\n combo = task.split(\"=\")[1].strip()\n key_combo(combo)\n elif task.startswith(\"Delay =\"):\n delay = int(task.split(\"=\")[1].strip())\n delay_task(delay)\n elif task.startswith(\"MouseMove =\"):\n coordinates = task.split(\"=\")[1].strip()\n move_mouse_to_coordinates_raw(coordinates)\n elif task == \"TakeScreenshot\":\n take_screenshot_task()\n elif task.startswith(\"OOB_\"):\n oob_tasks.append(task)\n else:\n print(f\"Unknown task: {task}\")\n```\n\n### Putting it all together\n\nHere are the changes you need to make to the main code:\n\n1. **Update the `take_screenshot_task` function to pass the additional argument**.\n2. **Update the `execute_tasks` function to correctly handle the `TakeScreenshot` task**.\n\nYou can integrate these changes into your main code as needed. If you have any questions or need further assistance, feel free to ask!"
],
[
"Okay great!! Really good job, can you write me a summary like the one I gave you at the beginning of our conversation so I can start a new conversation with you and add more features? Go over how the code works in fine detail, what it does, it's purpose, textgen, and anything else you thing is pertinent. ",
"Certainly! Here's a detailed summary of the current state of the Lucid_Autonomy extension, including its purpose, functionality, and how it works:\n\n---\n\n### Summary of Lucid_Autonomy Extension\n\n#### Introduction to Textgen\n\nTextgen is a powerful web-based interface for running large language models (LLMs) locally. It allows users to interact with various LLMs, providing a flexible and customizable environment for text generation tasks. The platform supports a wide range of models and offers numerous extensions to enhance its functionality.\n\n#### Lucid_Autonomy Extension Overview\n\nThe Lucid_Autonomy extension is designed to enable an LLM to interact autonomously with a computer using mouse and keyboard actions. This extension is part of the textgen ecosystem and aims to provide the LLM with \"eyes\" and \"hands\" to perform tasks such as moving the mouse to specific locations on the screen, clicking on elements, entering text, and more.\n\n#### Current Operation of the Extension\n\n1. **Initial Setup and Image Processing**:\n - The extension takes a screenshot of a specified monitor in a multi-monitor setup.\n - The screenshot is processed using the Owlv2 model to detect objects based on text queries.\n - The detected objects are cropped and saved to the `ImageOutputTest` directory.\n - The cropped images are processed with the MiniCPM-Llama3-V-2_5 model to generate descriptions.\n - The results are saved in a `results.json` file, containing the image name, coordinates, and description.\n - A `compressedresults.json` file is created, omitting the coordinates and containing only the image name and description.\n\n2. **LLM Interaction**:\n - The LLM can interact with the extension using the trigger phrase \"Autonomy_Tasks:\".\n - If the LLM sends the trigger phrase without any instructions, the extension appends the contents of the `compressedresults.json` file to the LLM's response.\n - If the LLM sends the trigger phrase with instructions (e.g., `MouseMover = clickable icons_8.png`), the extension executes the specified tasks and then appends the `compressedresults.json` contents to the LLM's response.\n\n3. **Task Execution**:\n - The extension can handle various tasks such as moving the mouse to specific locations, clicking on elements, entering text, pressing special keys, function keys, key combinations, adding delays, and taking screenshots.\n - The extension also supports \"Out of Body\" (OOB) commands, which are executed after the LLM's response is complete. These commands are saved to a JSON file and executed with a short delay.\n\n4. **Full Image Processing**:\n - The extension now processes the entire screenshot with the MiniCPM-Llama3-V-2_5 model using a separate question (`FULL_IMAGE_VISION_MODEL_QUESTION`).\n - The response from the full image processing is added to the `results.json` and `compressedresults.json` files.\n\n#### Ultimate Goal of the Extension\n\nThe ultimate goal of the Lucid_Autonomy extension is to enable a large language model (LLM) to interact autonomously with a computer. The LLM can perform tasks such as moving the mouse to specific locations on the screen, clicking on elements, entering text, and more. The extension allows the LLM to operate a computer autonomously, effectively giving the LLM \"eyes\" and \"hands\" to interact with the user interface.\n\n#### Future Enhancements\n\n1. **Additional Mouse and Keyboard Interactions**:\n - Implement more mouse actions and enable keyboard key entry using any combination of keys the LLM wants to use.\n\n2. **Take Screenshot Tool**:\n - Add a command for the LLM to use, similar to how the user can take a screenshot. This will allow the LLM to invoke a \"Take Screenshot\" command that it can add to the end of its task list.\n\n3. **Enhanced Task Execution**:\n - Add more complex tasks that the LLM can execute, such as filling out forms, navigating websites, and interacting with applications.\n\n4. **Improved LLM Integration**:\n - Enhance the integration between the LLM and the extension to support more sophisticated interactions and contextual understanding.\n\n5. **User Interface Enhancements**:\n - Improve the Gradio UI to provide more options and better visualization of the extension's functionality.\n\n6. **Error Handling and Robustness**:\n - Add error handling and robustness features to ensure the extension operates smoothly and handles edge cases gracefully.\n\n#### Conclusion\n\nThe Lucid_Autonomy extension is a powerful tool that enables an LLM to interact autonomously with a computer. The current implementation allows the LLM to take screenshots, process images, and move the mouse to specific locations. The ultimate goal is to expand the extension's capabilities to support a wide range of mouse and keyboard interactions, enabling the LLM to perform complex tasks autonomously.\n\n---\n\nThis summary provides an overview of the current state of the Lucid_Autonomy extension, its purpose, functionality, and future enhancements. You can use this summary to start a new conversation and add more features as needed. If you have any questions or need further assistance, feel free to ask!"
]
],
"visible": [
[
"",
"How can I help you today?"
],
[
"Hello, how are you today?",
"I'm functioning well, thank you for asking! I'm here and ready to assist you. How about you? How can I make your day better?"
],
[
"We are going to go on a wild ride together :3 Here is a summary from a past conversation I've had with you:\n\nCertainly! Here's a summary of the current state of the Lucid_Autonomy extension, including its purpose, functionality, and future enhancements:\n\n---\n\n### Summary of Lucid_Autonomy Extension\n\n#### Introduction to Textgen\n\nTextgen is a powerful web-based interface for running large language models (LLMs) locally. It allows users to interact with various LLMs, providing a flexible and customizable environment for text generation tasks. The platform supports a wide range of models and offers numerous extensions to enhance its functionality.\n\n#### Lucid_Autonomy Extension Overview\n\nThe Lucid_Autonomy extension is designed to enable an LLM to interact autonomously with a computer using mouse and keyboard actions. This extension is part of the textgen ecosystem and aims to provide the LLM with "eyes" and "hands" to perform tasks such as moving the mouse to specific locations on the screen, clicking on elements, and entering text.\n\n#### Current Operation of the Extension\n\n1. **Initial Setup and Image Processing**:\n - The extension takes a screenshot of a specified monitor.\n - The screenshot is processed using the Owlv2 model to detect objects based on text queries.\n - The detected objects are cropped and saved to the `ImageOutputTest` directory.\n - The cropped images are processed using the MiniCPM-Llama3-V-2_5 model to generate descriptions.\n - The results are saved in a `results.json` file, which includes the image name, coordinates, and description.\n - A `compressedresults.json` file is created, omitting the coordinates and containing only the image name and description.\n\n2. **LLM Interaction**:\n - The LLM can interact with the extension using the trigger phrase "Autonomy_Tasks:".\n - If the LLM sends the trigger phrase without any instructions, the extension appends the contents of the `compressedresults.json` file to the LLM's response.\n - If the LLM sends the trigger phrase with instructions (e.g., `MouseMover = clickable icons_8.png`), the extension executes the specified tasks and then appends the `compressedresults.json` contents to the LLM's response.\n\n3. **Task Execution**:\n - The extension can execute tasks such as moving the mouse to specific locations based on the coordinates in the `results.json` file.\n - The `compressedresults.json` file is appended to the LLM's response to provide context about the images.\n\n4. **History Management**:\n - The extension cleans the chat history to ensure that only the latest instance of the `compressedresults.json` contents is preserved in both the "internal" and "visible" histories.\n - The latest `compressedresults.json` contents are injected into the "internal" history to maintain context for the LLM.\n\n#### Ultimate Goal of the Extension\n\nThe ultimate goal of the Lucid_Autonomy extension is to enable a large language model (LLM) to interact autonomously with a computer. The LLM can perform tasks such as moving the mouse to specific locations on the screen, clicking on elements, and entering text. The extension allows the LLM to operate a computer autonomously, effectively giving the LLM "eyes" and "hands" to interact with the user interface.\n\n#### Future Enhancements\n\n1. **Additional Mouse and Keyboard Interactions**:\n - Implement more mouse actions such as right click, left double click, left triple click, and scroll wheel (or page up and down).\n - Enable keyboard key entry using any combination of keys the LLM wants to use.\n\n2. **Take Screenshot Tool**:\n - Add a "Take Screenshot" tool for the LLM to use, similar to how the user can take a screenshot. This will allow the LLM to invoke a "Take Screenshot" command that it can add to the end of a list of commands, enabling the LLM to see what a new page looks like.\n\n3. **Enhanced Task Execution**:\n - Add more complex tasks that the LLM can execute, such as filling out forms, navigating websites, and interacting with applications.\n\n4. **Improved LLM Integration**:\n - Enhance the integration between the LLM and the extension to support more sophisticated interactions and contextual understanding.\n\n5. **User Interface Enhancements**:\n - Improve the Gradio UI to provide more options and better visualization of the extension's functionality.\n\n6. **Error Handling and Robustness**:\n - Add error handling and robustness features to ensure the extension operates smoothly and handles edge cases gracefully.\n\n#### Conclusion\n\nThe Lucid_Autonomy extension is a powerful tool that enables an LLM to interact autonomously with a computer. The current implementation allows the LLM to take screenshots, process images, and move the mouse to specific locations. The ultimate goal is to expand the extension's capabilities to support a wide range of mouse and keyboard interactions, enabling the LLM to perform complex tasks autonomously.\n\n---\n\nThis summary provides an overview of the current state of the Lucid_Autonomy extension, its purpose, functionality, and future enhancements. You can use this summary and the current extension code to start a new conversation where we will begin to add new tools for the LLM to use.",
"Thank you for providing that detailed summary. It looks like the Lucid_Autonomy extension is already quite advanced in its capabilities. To start our wild ride, let's prioritize the future enhancements you've outlined. Which feature would you like to tackle first? Here are a few options based on your summary:\n\n1. **Additional Mouse and Keyboard Interactions**: We can start by implementing more mouse actions and enabling keyboard key entry.\n2. **Take Screenshot Tool**: We can add a command for the LLM to take screenshots autonomously.\n3. **Enhanced Task Execution**: We can begin to define and implement more complex tasks like form filling or web navigation.\n\nPlease choose the feature you'd like to work on first, or let me know if you have another idea in mind!"
],
[
"I'd like to work on the addition of a mouse left click tool; I can give you the code here:\n\nimport gradio as gr\nimport json\nimport pyautogui\nimport screeninfo\nfrom transformers import Owlv2Processor, Owlv2ForObjectDetection, AutoModel, AutoTokenizer\nfrom PIL import Image, ImageDraw\nimport torch\nimport os\nimport shutil\nimport re # Import the re module for regular expression operations\n\n# Configurable variables\nMONITOR_INDEX = 1 # Index of the monitor to capture (0 for the first monitor, 1 for the second monitor, etc.)\nTEXT_QUERIES = "clickable hyperlinks, clickable icons, text bar field" # Comma-separated text queries for object detection\nSCORE_THRESHOLD = 0.20 # Score threshold for filtering low-probability predictions\nVISION_MODEL_QUESTION = "What is the text in this image or what does this image describe?"\n\n# Use GPU if available\nif torch.cuda.is_available():\n device = torch.device("cuda")\nelse:\n device = torch.device("cpu")\n\n# Global variables for models\nowlv2_model = None\nowlv2_processor = None\nminicpm_llama_model = None\nminicpm_llama_tokenizer = None\n\nVISION_MODEL_ID = "/home/myself/Desktop/miniCPM_llava3V/MiniCPM-Llama3-V-2_5"\n\ndef load_owlv2_model():\n global owlv2_model, owlv2_processor\n owlv2_model = Owlv2ForObjectDetection.from_pretrained("google/owlv2-base-patch16-ensemble", trust_remote_code=True).to(device)\n owlv2_processor = Owlv2Processor.from_pretrained("google/owlv2-base-patch16-ensemble", trust_remote_code=True)\n print("Owlv2 model loaded.")\n\ndef unload_owlv2_model():\n global owlv2_model, owlv2_processor\n if owlv2_model is not None:\n del owlv2_model\n del owlv2_processor\n torch.cuda.empty_cache()\n print("Owlv2 model unloaded.")\n\ndef load_minicpm_llama_model():\n global minicpm_llama_model, minicpm_llama_tokenizer\n if "int4" in VISION_MODEL_ID:\n # Load the 4-bit quantized model and tokenizer\n minicpm_llama_model = AutoModel.from_pretrained(\n VISION_MODEL_ID,\n device_map={"": device},\n trust_remote_code=True,\n torch_dtype=torch.float16 # Use float16 as per the example code for 4-bit models\n ).eval()\n minicpm_llama_tokenizer = AutoTokenizer.from_pretrained(\n VISION_MODEL_ID,\n trust_remote_code=True\n )\n print("MiniCPM-Llama3 4-bit quantized model loaded on-demand.")\n else:\n # Load the standard model and tokenizer\n minicpm_llama_model = AutoModel.from_pretrained(\n VISION_MODEL_ID,\n device_map={"": device}, # Use the specified CUDA device\n trust_remote_code=True,\n torch_dtype=torch.bfloat16\n ).eval()\n minicpm_llama_tokenizer = AutoTokenizer.from_pretrained(\n VISION_MODEL_ID,\n trust_remote_code=True\n )\n print("MiniCPM-Llama3 standard model loaded on-demand.")\n\ndef unload_minicpm_llama_model():\n global minicpm_llama_model, minicpm_llama_tokenizer\n if minicpm_llama_model is not None:\n del minicpm_llama_model\n del minicpm_llama_tokenizer\n torch.cuda.empty_cache()\n print("MiniCPM-Llama3 model unloaded.")\n\ndef take_screenshot(monitor_index, text_queries, score_threshold, vision_model_question, output_filename="screenshot.png"):\n """\n Takes a screenshot of a specific monitor in a multi-monitor setup.\n\n Args:\n monitor_index (int): The index of the monitor to capture.\n text_queries (str): Comma-separated text queries for object detection.\n score_threshold (float): The score threshold for filtering low-probability predictions.\n vision_model_question (str): The question to ask the vision model.\n output_filename (str): The filename to save the screenshot as (default is "screenshot.png").\n """\n monitor_index = int(monitor_index)\n score_threshold = float(score_threshold)\n\n # Get information about all monitors\n monitors = screeninfo.get_monitors()\n\n if monitor_index >= len(monitors):\n raise ValueError(f"Monitor index {monitor_index} is out of range. There are only {len(monitors)} monitors.")\n\n # Get the region of the specified monitor\n monitor = monitors[monitor_index]\n region = (monitor.x, monitor.y, monitor.width, monitor.height)\n\n # Take the screenshot\n output_dir = "extensions/Lucid_Autonomy"\n os.makedirs(output_dir, exist_ok=True)\n screenshot_path = os.path.join(output_dir, output_filename)\n screenshot = pyautogui.screenshot(region=region)\n screenshot.save(screenshot_path)\n print(f"Screenshot saved as {screenshot_path}")\n\n # Process the image using the Owlv2 model\n annotated_image_path, json_output_path = query_image(screenshot_path, text_queries, score_threshold)\n\n # Create the output directory if it doesn't exist\n output_dir = "extensions/Lucid_Autonomy/ImageOutputTest"\n os.makedirs(output_dir, exist_ok=True)\n\n # Clear the output directory\n for filename in os.listdir(output_dir):\n file_path = os.path.join(output_dir, filename)\n try:\n if os.path.isfile(file_path) or os.path.islink(file_path):\n os.unlink(file_path)\n elif os.path.isdir(file_path):\n shutil.rmtree(file_path)\n except Exception as e:\n print(f"Failed to delete {file_path}. Reason: {e}")\n\n # Crop the detected objects and save them to the output directory\n crop_images(annotated_image_path, json_output_path, output_dir)\n\n print("Processing complete. Check the output directory for the cropped images and results.")\n\ndef query_image(img_path, text_queries, score_threshold):\n """\n Processes the image using the Owlv2 model to detect objects based on text queries.\n\n Args:\n img_path (str): The path to the input image.\n text_queries (str): Comma-separated text queries for object detection.\n score_threshold (float): The score threshold for filtering low-probability predictions.\n """\n load_owlv2_model()\n\n img = Image.open(img_path)\n text_queries = text_queries.split(",")\n\n size = max(img.size)\n target_sizes = torch.Tensor([[size, size]])\n inputs = owlv2_processor(text=text_queries, images=img, return_tensors="pt").to(device)\n\n with torch.no_grad():\n outputs = owlv2_model(**inputs)\n\n outputs.logits = outputs.logits.cpu()\n outputs.pred_boxes = outputs.pred_boxes.cpu()\n results = owlv2_processor.post_process_object_detection(outputs=outputs, target_sizes=target_sizes)\n boxes, scores, labels = results[0]["boxes"], results[0]["scores"], results[0]["labels"]\n\n result_labels = []\n img_pil = img.copy()\n draw = ImageDraw.Draw(img_pil)\n\n for box, score, label in zip(boxes, scores, labels):\n box = [int(i) for i in box.tolist()]\n if score < score_threshold:\n continue\n result_labels.append([box, text_queries[label.item()]])\n draw.rectangle(box, outline="red", width=2)\n # Do not draw the text label within the bounding box\n # draw.text((box[0], box[1]), text_queries[label.item()], fill="red")\n\n # Save the annotated image for debugging\n annotated_image_path = "extensions/Lucid_Autonomy/annotated_image.png"\n img_pil.save(annotated_image_path)\n\n # Save the JSON output to a file\n json_output_path = "extensions/Lucid_Autonomy/output.json"\n with open(json_output_path, 'w') as f:\n json.dump(result_labels, f)\n\n unload_owlv2_model()\n\n return annotated_image_path, json_output_path\n\ndef crop_images(image_path, json_output_path, output_dir="extensions/Lucid_Autonomy/ImageOutputTest"):\n """\n Crops out the detected objects from the input image based on the bounding box coordinates.\n\n Args:\n image_path (str): The path to the input image.\n json_output_path (str): The path to the JSON file containing the bounding box coordinates.\n output_dir (str): The directory to save the cropped images (default is "extensions/Lucid_Autonomy/ImageOutputTest").\n """\n with open(json_output_path, 'r') as f:\n data = json.load(f)\n\n img = Image.open(image_path)\n\n results = []\n\n # Load the miniCPM-lama3-v-2_5 model once before processing the images\n load_minicpm_llama_model()\n\n for i, item in enumerate(data):\n box = item[0]\n label = item[1]\n cropped_img = img.crop(box)\n cropped_img_path = f"{output_dir}/{label}_{i}.png"\n cropped_img.save(cropped_img_path)\n\n # Process the cropped image with the miniCPM-lama3-v-2_5 model\n vision_model_response = process_with_vision_model(cropped_img_path, VISION_MODEL_QUESTION)\n\n # Store the results in a structured format\n results.append({\n "image_name": f"{label}_{i}.png",\n "coordinates": box,\n "description": vision_model_response\n })\n\n # Unload the miniCPM-lama3-v-2_5 model after processing all images\n unload_minicpm_llama_model()\n\n # Save the results to a JSON file\n results_json_path = f"{output_dir}/results.json"\n with open(results_json_path, 'w') as f:\n json.dump(results, f, indent=4)\n\n # Create the compressed results JSON file\n compressed_results_json_path = f"{output_dir}/compressedresults.json"\n compressed_results = [{"image_name": result["image_name"], "description": result["description"]} for result in results]\n with open(compressed_results_json_path, 'w') as f:\n json.dump(compressed_results, f, indent=4)\n\ndef process_with_vision_model(image_path, question):\n """\n Processes the image with the miniCPM-lama3-v-2_5 model and returns the response.\n\n Args:\n image_path (str): The path to the input image.\n question (str): The question to ask the vision model.\n\n Returns:\n str: The response from the vision model.\n """\n image = Image.open(image_path).convert("RGB")\n messages = [\n {"role": "user", "content": f"{question}"}\n ]\n # Define the generation arguments\n generation_args = {\n "max_new_tokens": 896,\n "repetition_penalty": 1.05,\n "num_beams": 3,\n "top_p": 0.8,\n "top_k": 1,\n "temperature": 0.7,\n "sampling": True,\n }\n if "int4" in "openbmb/MiniCPM-Llama3-V-2_5":\n # Disable streaming for the 4-bit model\n generation_args["stream"] = False\n # Use the model.chat method with streaming enabled\n vision_model_response = ""\n for new_text in minicpm_llama_model.chat(\n image=image,\n msgs=messages,\n tokenizer=minicpm_llama_tokenizer,\n **generation_args\n ):\n vision_model_response += new_text\n print(new_text, flush=True, end='')\n else:\n minicpm_llama_model.to(device)\n vision_model_response = minicpm_llama_model.chat(\n image=image,\n msgs=messages,\n tokenizer=minicpm_llama_tokenizer,\n **generation_args\n )\n return vision_model_response\n\ndef ui():\n """\n Creates custom gradio elements when the UI is launched.\n """\n with gr.Blocks() as demo:\n with gr.Row():\n with gr.Column():\n monitor_index = gr.Textbox(label="Monitor Index", value=str(MONITOR_INDEX))\n text_queries = gr.Textbox(label="Text Queries", value=TEXT_QUERIES)\n score_threshold = gr.Textbox(label="Score Threshold", value=str(SCORE_THRESHOLD))\n vision_model_question = gr.Textbox(label="Vision Model Question", value=VISION_MODEL_QUESTION)\n take_screenshot_button = gr.Button(value="Take Screenshot")\n\n take_screenshot_button.click(\n fn=take_screenshot,\n inputs=[monitor_index, text_queries, score_threshold, vision_model_question],\n outputs=None\n )\n\n return demo\n\ndef input_modifier(user_input, state):\n """\n Modifies the user input before it is processed by the LLM.\n """\n return user_input\n\ndef output_modifier(output, state):\n """\n Modifies the LLM output before it is presented in the UI.\n """\n # Search for the "Autonomy_Tasks:" trigger phrase in the LLM's output\n if "Autonomy_Tasks:" in output:\n # Clean the chat history\n state["history"] = history_modifier(state["history"])\n # Extract tasks and execute them\n tasks_section = output.split("Autonomy_Tasks:")[-1].strip()\n if tasks_section:\n execute_tasks(tasks_section)\n # Append new compressed results\n compressed_results_json_path = "extensions/Lucid_Autonomy/ImageOutputTest/compressedresults.json"\n with open(compressed_results_json_path, 'r') as f:\n compressed_results = json.load(f)\n output = append_compressed_results(output, compressed_results)\n return output\n\ndef execute_tasks(tasks_section):\n """\n Executes the tasks listed in the tasks_section.\n """\n tasks = tasks_section.split("\\\\\\\\n")\n for task in tasks:\n if task.startswith("MouseMover ="):\n image_name = task.split("=")[1].strip()\n move_mouse_to_image(image_name)\n\ndef move_mouse_to_image(image_name):\n """\n Moves the mouse to the center of the box that defines the image.\n """\n results_json_path = "extensions/Lucid_Autonomy/ImageOutputTest/results.json"\n with open(results_json_path, 'r') as f:\n results = json.load(f)\n\n for result in results:\n if result["image_name"] == image_name:\n coordinates = result["coordinates"]\n monitor = get_monitor_info(MONITOR_INDEX)\n move_mouse_to_coordinates(monitor, coordinates)\n break\n else:\n print(f"Image name '{image_name}' not found in the results.")\n\ndef move_mouse_to_coordinates(monitor, coordinates):\n """\n Moves the mouse cursor to the specified coordinates on the monitor.\n\n Args:\n monitor (dict): A dictionary containing the monitor's information.\n coordinates (list): A list containing the coordinates [x1, y1, x2, y2].\n """\n x1, y1, x2, y2 = coordinates\n monitor_x, monitor_y = monitor.x, monitor.y\n center_x = monitor_x + (x1 + x2) // 2\n center_y = monitor_y + (y1 + y2) // 2\n pyautogui.moveTo(center_x, center_y)\n print(f"Mouse moved to coordinates: ({center_x}, {center_y})")\n\ndef get_monitor_info(monitor_index):\n """\n Gets information about the specified monitor.\n\n Args:\n monitor_index (int): The index of the monitor to get information about.\n\n Returns:\n dict: A dictionary containing the monitor's information.\n """\n monitors = screeninfo.get_monitors()\n\n if monitor_index >= len(monitors):\n raise ValueError(f"Monitor index {monitor_index} is out of range. There are only {len(monitors)} monitors.")\n\n return monitors[monitor_index]\n\ndef history_modifier(history):\n """\n Modifies the chat history before the text generation in chat mode begins.\n """\n # Clean the internal history\n history["internal"] = clean_chat_history(history["internal"])\n # Clean the visible history\n history["visible"] = clean_chat_history(history["visible"])\n\n # Inject the latest compressedresults.json into the internal history\n compressed_results_json_path = "extensions/Lucid_Autonomy/ImageOutputTest/compressedresults.json"\n with open(compressed_results_json_path, 'r') as f:\n compressed_results = json.load(f)\n latest_compressed_results = json.dumps(compressed_results, indent=4)\n history["internal"].append(["", f"<START_COMPRESSED_RESULTS>\\\\n{latest_compressed_results}\\\\n<END_COMPRESSED_RESULTS>"])\n\n return history\n\ndef clean_chat_history(history):\n """\n Cleans the chat history by removing all but the latest instance of the contents between the bookends.\n """\n START_BOOKEND = "<START_COMPRESSED_RESULTS>"\n END_BOOKEND = "<END_COMPRESSED_RESULTS>"\n\n # Find all instances of the bookends\n bookend_indices = []\n for i, entry in enumerate(history):\n if isinstance(entry, list) and len(entry) == 2:\n if START_BOOKEND in entry[1] and END_BOOKEND in entry[1]:\n bookend_indices.append(i)\n\n # If there are multiple instances, remove all but the last one\n if len(bookend_indices) > 1:\n for i in bookend_indices[:-1]:\n history[i][1] = re.sub(f"{START_BOOKEND}.*?{END_BOOKEND}", "", history[i][1], flags=re.DOTALL)\n\n return history\n\ndef append_compressed_results(output, compressed_results):\n """\n Appends the compressed results to the output.\n """\n START_BOOKEND = "<START_COMPRESSED_RESULTS>"\n END_BOOKEND = "<END_COMPRESSED_RESULTS>"\n output += f"\\\\n\\\\n{START_BOOKEND}\\\\n{json.dumps(compressed_results, indent=4)}\\\\n{END_BOOKEND}"\n return output\n\nif __name__ == "__main__":\n # Launch the Gradio interface with the specified UI components\n gr.Interface(\n fn=ui,\n inputs=None,\n outputs="ui"\n ).launch()\n\n\nDon't begin writing any code just yet, I want to make sure you can contextualize everything I've given you thus far.",
"Thank you for providing the detailed code and context. I've reviewed the code and the summary you provided. Here's a brief contextualization to ensure we're on the same page:\n\n1. **Lucid_Autonomy Extension Overview**:\n - The extension enables an LLM to interact autonomously with a computer using mouse and keyboard actions.\n - It processes screenshots, detects objects, and allows the LLM to perform tasks like moving the mouse and clicking on elements.\n\n2. **Current Operation**:\n - The extension takes a screenshot, processes it using the Owlv2 model to detect objects, and then uses the MiniCPM-Llama3-V-2_5 model to generate descriptions of the detected objects.\n - The LLM can interact with the extension using the trigger phrase "Autonomy_Tasks:" to execute tasks like moving the mouse.\n\n3. **Future Enhancements**:\n - Additional mouse and keyboard interactions.\n - A "Take Screenshot" tool for the LLM.\n - Enhanced task execution.\n - Improved LLM integration.\n - User interface enhancements.\n - Error handling and robustness.\n\n4. **Code Provided**:\n - The code includes functions for taking screenshots, processing images, detecting objects, and moving the mouse.\n - It also includes a Gradio UI for user interaction.\n\nGiven this context, you've mentioned that you'd like to work on adding a mouse left click tool. To proceed, we'll need to:\n\n1. **Modify the `execute_tasks` function** to handle left click tasks.\n2. **Create a new function `left_click_image`** that performs the left click action at the specified coordinates.\n3. **Update the `move_mouse_to_image` function** to call the `left_click_image` function after moving the mouse.\n\nDoes that sound like the right approach to you? If so, I can help you draft the code for these changes. If you have any other ideas or preferences, please let me know!"
],
[
"Do you know how the LLM formats its response to use the tools? It formats the response like this:\n```\nAutonomy Tasks:\n\nMouseMover = clickable icons_8.png\n```\nI'd like to do your suggestion, but I'd also like to add some error handling, like if the LLM outputs text like this:\n```\nAutonomy Tasks:\n\nMouseMover = clickable icons_8.png\n\n this is the icon you requested\n```\nthe extension will continue to function. ",
"Got it! We'll need to ensure that the extension can handle the LLM's formatted response correctly and add error handling to manage any extraneous text. Here's a plan to achieve this:\n\n1. **Modify the `execute_tasks` function** to handle left click tasks and parse the LLM's response correctly.\n2. **Create a new function `left_click_image`** that performs the left click action at the specified coordinates.\n3. **Update the `move_mouse_to_image` function** to call the `left_click_image` function after moving the mouse.\n4. **Add error handling** to ensure the extension continues to function even if the LLM's response contains extraneous text.\n\nLet's start by drafting the code for these changes:\n\n### Step 1: Modify the `execute_tasks` function\n\nWe'll update the `execute_tasks` function to handle left click tasks and parse the LLM's response correctly.\n\n```python\ndef execute_tasks(tasks_section):\n """\n Executes the tasks listed in the tasks_section.\n """\n tasks = tasks_section.split("\\n")\n for task in tasks:\n task = task.strip()\n if task.startswith("MouseMover ="):\n image_name = task.split("=")[1].strip()\n move_mouse_to_image(image_name)\n elif task.startswith("MouseClick ="):\n image_name = task.split("=")[1].strip()\n left_click_image(image_name)\n```\n\n### Step 2: Create the `left_click_image` function\n\nWe'll create a new function `left_click_image` that performs the left click action at the specified coordinates.\n\n```python\ndef left_click_image(image_name):\n """\n Moves the mouse to the center of the box that defines the image and performs a left click.\n """\n results_json_path = "extensions/Lucid_Autonomy/ImageOutputTest/results.json"\n with open(results_json_path, 'r') as f:\n results = json.load(f)\n\n for result in results:\n if result["image_name"] == image_name:\n coordinates = result["coordinates"]\n monitor = get_monitor_info(MONITOR_INDEX)\n move_mouse_to_coordinates(monitor, coordinates)\n pyautogui.click() # Perform a left click\n break\n else:\n print(f"Image name '{image_name}' not found in the results.")\n```\n\n### Step 3: Update the `move_mouse_to_image` function\n\nWe'll update the `move_mouse_to_image` function to call the `left_click_image` function after moving the mouse.\n\n```python\ndef move_mouse_to_image(image_name):\n """\n Moves the mouse to the center of the box that defines the image.\n """\n results_json_path = "extensions/Lucid_Autonomy/ImageOutputTest/results.json"\n with open(results_json_path, 'r') as f:\n results = json.load(f)\n\n for result in results:\n if result["image_name"] == image_name:\n coordinates = result["coordinates"]\n monitor = get_monitor_info(MONITOR_INDEX)\n move_mouse_to_coordinates(monitor, coordinates)\n break\n else:\n print(f"Image name '{image_name}' not found in the results.")\n```\n\n### Step 4: Add error handling\n\nWe'll add error handling to ensure the extension continues to function even if the LLM's response contains extraneous text.\n\n```python\ndef execute_tasks(tasks_section):\n """\n Executes the tasks listed in the tasks_section.\n """\n tasks = tasks_section.split("\\n")\n for task in tasks:\n task = task.strip()\n if task.startswith("MouseMover ="):\n image_name = task.split("=")[1].strip()\n move_mouse_to_image(image_name)\n elif task.startswith("MouseClick ="):\n image_name = task.split("=")[1].strip()\n left_click_image(image_name)\n else:\n print(f"Unknown task: {task}")\n```\n\n### Putting it all together\n\nHere's the updated code with all the changes:\n\n```python\nimport gradio as gr\nimport json\nimport pyautogui\nimport screeninfo\nfrom transformers import Owlv2Processor, Owlv2ForObjectDetection, AutoModel, AutoTokenizer\nfrom PIL import Image, ImageDraw\nimport torch\nimport os\nimport shutil\nimport re # Import the re module for regular expression operations\n\n# Configurable variables\nMONITOR_INDEX = 1 # Index of the monitor to capture (0 for the first monitor, 1 for the second monitor, etc.)\nTEXT_QUERIES = "clickable hyperlinks, clickable icons, text bar field" # Comma-separated text queries for object detection\nSCORE_THRESHOLD = 0.20 # Score threshold for filtering low-probability predictions\nVISION_MODEL_QUESTION = "What is the text in this image or what does this image describe?"\n\n# Use GPU if available\nif torch.cuda.is_available():\n device = torch.device("cuda")\nelse:\n device = torch.device("cpu")\n\n# Global variables for models\nowlv2_model = None\nowlv2_processor = None\nminicpm_llama_model = None\nminicpm_llama_tokenizer = None\n\nVISION_MODEL_ID = "/home/myself/Desktop/miniCPM_llava3V/MiniCPM-Llama3-V-2_5"\n\ndef load_owlv2_model():\n global owlv2_model, owlv2_processor\n owlv2_model = Owlv2ForObjectDetection.from_pretrained("google/owlv2-base-patch16-ensemble", trust_remote_code=True).to(device)\n owlv2_processor = Owlv2Processor.from_pretrained("google/owlv2-base-patch16-ensemble", trust_remote_code=True)\n print("Owlv2 model loaded.")\n\ndef unload_owlv2_model():\n global owlv2_model, owlv2_processor\n if owlv2_model is not None:\n del owlv2_model\n del owlv2_processor\n torch.cuda.empty_cache()\n print("Owlv2 model unloaded.")\n\ndef load_minicpm_llama_model():\n global minicpm_llama_model, minicpm_llama_tokenizer\n if "int4" in VISION_MODEL_ID:\n # Load the 4-bit quantized model and tokenizer\n minicpm_llama_model = AutoModel.from_pretrained(\n VISION_MODEL_ID,\n device_map={"": device},\n trust_remote_code=True,\n torch_dtype=torch.float16 # Use float16 as per the example code for 4-bit models\n ).eval()\n minicpm_llama_tokenizer = AutoTokenizer.from_pretrained(\n VISION_MODEL_ID,\n trust_remote_code=True\n )\n print("MiniCPM-Llama3 4-bit quantized model loaded on-demand.")\n else:\n # Load the standard model and tokenizer\n minicpm_llama_model = AutoModel.from_pretrained(\n VISION_MODEL_ID,\n device_map={"": device}, # Use the specified CUDA device\n trust_remote_code=True,\n torch_dtype=torch.bfloat16\n ).eval()\n minicpm_llama_tokenizer = AutoTokenizer.from_pretrained(\n VISION_MODEL_ID,\n trust_remote_code=True\n )\n print("MiniCPM-Llama3 standard model loaded on-demand.")\n\ndef unload_minicpm_llama_model():\n global minicpm_llama_model, minicpm_llama_tokenizer\n if minicpm_llama_model is not None:\n del minicpm_llama_model\n del minicpm_llama_tokenizer\n torch.cuda.empty_cache()\n print("MiniCPM-Llama3 model unloaded.")\n\ndef take_screenshot(monitor_index, text_queries, score_threshold, vision_model_question, output_filename="screenshot.png"):\n """\n Takes a screenshot of a specific monitor in a multi-monitor setup.\n\n Args:\n monitor_index (int): The index of the monitor to capture.\n text_queries (str): Comma-separated text queries for object detection.\n score_threshold (float): The score threshold for filtering low-probability predictions.\n vision_model_question (str): The question to ask the vision model.\n output_filename (str): The filename to save the screenshot as (default is "screenshot.png").\n """\n monitor_index = int(monitor_index)\n score_threshold = float(score_threshold)\n\n # Get information about all monitors\n monitors = screeninfo.get_monitors()\n\n if monitor_index >= len(monitors):\n raise ValueError(f"Monitor index {monitor_index} is out of range. There are only {len(monitors)} monitors.")\n\n # Get the region of the specified monitor\n monitor = monitors[monitor_index]\n region = (monitor.x, monitor.y, monitor.width, monitor.height)\n\n # Take the screenshot\n output_dir = "extensions/Lucid_Autonomy"\n os.makedirs(output_dir, exist_ok=True)\n screenshot_path = os.path.join(output_dir, output_filename)\n screenshot = pyautogui.screenshot(region=region)\n screenshot.save(screenshot_path)\n print(f"Screenshot saved as {screenshot_path}")\n\n # Process the image using the Owlv2 model\n annotated_image_path, json_output_path = query_image(screenshot_path, text_queries, score_threshold)\n\n # Create the output directory if it doesn't exist\n output_dir = "extensions/Lucid_Autonomy/ImageOutputTest"\n os.makedirs(output_dir, exist_ok=True)\n\n # Clear the output directory\n for filename in os.listdir(output_dir):\n file_path = os.path.join(output_dir, filename)\n try:\n if os.path.isfile(file_path) or os.path.islink(file_path):\n os.unlink(file_path)\n elif os.path.isdir(file_path):\n shutil.rmtree(file_path)\n except Exception as e:\n print(f"Failed to delete {file_path}. Reason: {e}")\n\n # Crop the detected objects and save them to the output directory\n crop_images(annotated_image_path, json_output_path, output_dir)\n\n print("Processing complete. Check the output directory for the cropped images and results.")\n\ndef query_image(img_path, text_queries, score_threshold):\n """\n Processes the image using the Owlv2 model to detect objects based on text queries.\n\n Args:\n img_path (str): The path to the input image.\n text_queries (str): Comma-separated text queries for object detection.\n score_threshold (float): The score threshold for filtering low-probability predictions.\n """\n load_owlv2_model()\n\n img = Image.open(img_path)\n text_queries = text_queries.split(",")\n\n size = max(img.size)\n target_sizes = torch.Tensor([[size, size]])\n inputs = owlv2_processor(text=text_queries, images=img, return_tensors="pt").to(device)\n\n with torch.no_grad():\n outputs = owlv2_model(**inputs)\n\n outputs.logits = outputs.logits.cpu()\n outputs.pred_boxes = outputs.pred_boxes.cpu()\n results = owlv2_processor.post_process_object_detection(outputs=outputs, target_sizes=target_sizes)\n boxes, scores, labels = results[0]["boxes"], results[0]["scores"], results[0]["labels"]\n\n result_labels = []\n img_pil = img.copy()\n draw = ImageDraw.Draw(img_pil)\n\n for box, score, label in zip(boxes, scores, labels):\n box = [int(i) for i in box.tolist()]\n if score < score_threshold:\n continue\n result_labels.append([box, text_queries[label.item()]])\n draw.rectangle(box, outline="red", width=2)\n # Do not draw the text label within the bounding box\n # draw.text((box[0], box[1]), text_queries[label.item()], fill="red")\n\n # Save the annotated image for debugging\n annotated_image_path = "extensions/Lucid_Autonomy/annotated_image.png"\n img_pil.save(annotated_image_path)\n\n # Save the JSON output to a file\n json_output_path = "extensions/Lucid_Autonomy/output.json"\n with open(json_output_path, 'w') as f:\n json.dump(result_labels, f)\n\n unload_owlv2_model()\n\n return annotated_image_path, json_output_path\n\ndef crop_images(image_path, json_output_path, output_dir="extensions/Lucid_Autonomy/ImageOutputTest"):\n """\n Crops out the detected objects from the input image based on the bounding box coordinates.\n\n Args:\n image_path (str): The path to the input image.\n json_output_path (str): The path to the JSON file containing the bounding box coordinates.\n output_dir (str): The directory to save the cropped images (default is "extensions/Lucid_Autonomy/ImageOutputTest").\n """\n with open(json_output_path, 'r') as f:\n data = json.load(f)\n\n img = Image.open(image_path)\n\n results = []\n\n # Load the miniCPM-lama3-v-2_5 model once before processing the images\n load_minicpm_llama_model()\n\n for i, item in enumerate(data):\n box = item[0]\n label = item[1]\n cropped_img = img.crop(box)\n cropped_img_path = f"{output_dir}/{label}_{i}.png"\n cropped_img.save(cropped_img_path)\n\n # Process the cropped image with the miniCPM-lama3-v-2_5 model\n vision_model_response = process_with_vision_model(cropped_img_path, VISION_MODEL_QUESTION)\n\n # Store the results in a structured format\n results.append({\n "image_name": f"{label}_{i}.png",\n "coordinates": box,\n "description": vision_model_response\n })\n\n # Unload the miniCPM-lama3-v-2_5 model after processing all images\n unload_minicpm_llama_model()\n\n # Save the results to a JSON file\n results_json_path = f"{output_dir}/results.json"\n with open(results_json_path, 'w') as f:\n json.dump(results, f, indent=4)\n\n # Create the compressed results JSON file\n compressed_results_json_path = f"{output_dir}/compressedresults.json"\n compressed_results = [{"image_name": result["image_name"], "description": result["description"]} for result in results]\n with open(compressed_results_json_path, 'w') as f:\n json.dump(compressed_results, f, indent=4)\n\ndef process_with_vision_model(image_path, question):\n """\n Processes the image with the miniCPM-lama3-v-2_5 model and returns the response.\n\n Args:\n image_path (str): The path to the input image.\n question (str): The question to ask the vision model.\n\n Returns:\n str: The response from the vision model.\n """\n image = Image.open(image_path).convert("RGB")\n messages = [\n {"role": "user", "content": f"{question}"}\n ]\n # Define the generation arguments\n generation_args = {\n "max_new_tokens": 896,\n "repetition_penalty": 1.05,\n "num_beams": 3,\n "top_p": 0.8,\n "top_k": 1,\n "temperature": 0.7,\n "sampling": True,\n }\n if "int4" in "openbmb/MiniCPM-Llama3-V-2_5":\n # Disable streaming for the 4-bit model\n generation_args["stream"] = False\n # Use the model.chat method with streaming enabled\n vision_model_response = ""\n for new_text in minicpm_llama_model.chat(\n image=image,\n msgs=messages,\n tokenizer=minicpm_llama_tokenizer,\n **generation_args\n ):\n vision_model_response += new_text\n print(new_text, flush=True, end='')\n else:\n minicpm_llama_model.to(device)\n vision_model_response = minicpm_llama_model.chat(\n image=image,\n msgs=messages,\n tokenizer=minicpm_llama_tokenizer,\n **generation_args\n )\n return vision_model_response\n\ndef ui():\n """\n Creates custom gradio elements when the UI is launched.\n """\n with gr.Blocks() as demo:\n with gr.Row():\n with gr.Column():\n monitor_index = gr.Textbox(label="Monitor Index", value=str(MONITOR_INDEX))\n text_queries = gr.Textbox(label="Text Queries", value=TEXT_QUERIES)\n score_threshold = gr.Textbox(label="Score Threshold", value=str(SCORE_THRESHOLD))\n vision_model_question = gr.Textbox(label="Vision Model Question", value=VISION_MODEL_QUESTION)\n take_screenshot_button = gr.Button(value="Take Screenshot")\n\n take_screenshot_button.click(\n fn=take_screenshot,\n inputs=[monitor_index, text_queries, score_threshold, vision_model_question],\n outputs=None\n )\n\n return demo\n\ndef input_modifier(user_input, state):\n """\n Modifies the user input before it is processed by the LLM.\n """\n return user_input\n\ndef output_modifier(output, state):\n """\n Modifies the LLM output before it is presented in the UI.\n """\n # Search for the "Autonomy_Tasks:" trigger phrase in the LLM's output\n if "Autonomy_Tasks:" in output:\n # Clean the chat history\n state["history"] = history_modifier(state["history"])\n # Extract tasks and execute them\n tasks_section = output.split("Autonomy_Tasks:")[-1].strip()\n if tasks_section:\n execute_tasks(tasks_section)\n # Append new compressed results\n compressed_results_json_path = "extensions/Lucid_Autonomy/ImageOutputTest/compressedresults.json"\n with open(compressed_results_json_path, 'r') as f:\n compressed_results = json.load(f)\n output = append_compressed_results(output, compressed_results)\n return output\n\ndef execute_tasks(tasks_section):\n """\n Executes the tasks listed in the tasks_section.\n """\n tasks = tasks_section.split("\\n")\n for task in tasks:\n task = task.strip()\n if task.startswith("MouseMover ="):\n image_name = task.split("=")[1].strip()\n move_mouse_to_image(image_name)\n elif task.startswith("MouseClick ="):\n image_name = task.split("=")[1].strip()\n left_click_image(image_name)\n else:\n print(f"Unknown task: {task}")\n\ndef move_mouse_to_image(image_name):\n """\n Moves the mouse to the center of the box that defines the image.\n """\n results_json_path = "extensions/Lucid_Autonomy/ImageOutputTest/results.json"\n with open(results_json_path, 'r') as f:\n results = json.load(f)\n\n for result in results:\n if result["image_name"] == image_name:\n coordinates = result["coordinates"]\n monitor = get_monitor_info(MONITOR_INDEX)\n move_mouse_to_coordinates(monitor, coordinates)\n break\n else:\n print(f"Image name '{image_name}' not found in the results.")\n\ndef left_click_image(image_name):\n """\n Moves the mouse to the center of the box that defines the image and performs a left click.\n """\n results_json_path = "extensions/Lucid_Autonomy/ImageOutputTest/results.json"\n with open(results_json_path, 'r') as f:\n results = json.load(f)\n\n for result in results:\n if result["image_name"] == image_name:\n coordinates = result["coordinates"]\n monitor = get_monitor_info(MONITOR_INDEX)\n move_mouse_to_coordinates(monitor, coordinates)\n pyautogui.click() # Perform a left click\n break\n else:\n print(f"Image name '{image_name}' not found in the results.")\n\ndef move_mouse_to_coordinates(monitor, coordinates):\n """\n Moves the mouse cursor to the specified coordinates on the monitor.\n\n Args:\n monitor (dict): A dictionary containing the monitor's information.\n coordinates (list): A list containing the coordinates [x1, y1, x2, y2].\n """\n x1, y1, x2, y2 = coordinates\n monitor_x, monitor_y = monitor.x, monitor.y\n center_x = monitor_x + (x1 + x2) // 2\n center_y = monitor_y + (y1 + y2) // 2\n pyautogui.moveTo(center_x, center_y)\n print(f"Mouse moved to coordinates: ({center_x}, {center_y})")\n\ndef get_monitor_info(monitor_index):\n """\n Gets information about the specified monitor.\n\n Args:\n monitor_index (int): The index of the monitor to get information about.\n\n Returns:\n dict: A dictionary containing the monitor's information.\n """\n monitors = screeninfo.get_monitors()\n\n if monitor_index >= len(monitors):\n raise ValueError(f"Monitor index {monitor_index} is out of range. There are only {len(monitors)} monitors.")\n\n return monitors[monitor_index]\n\ndef history_modifier(history):\n """\n Modifies the chat history before the text generation in chat mode begins.\n """\n # Clean the internal history\n history["internal"] = clean_chat_history(history["internal"])\n # Clean the visible history\n history["visible"] = clean_chat_history(history["visible"])\n\n # Inject the latest compressedresults.json into the internal history\n compressed_results_json_path = "extensions/Lucid_Autonomy/ImageOutputTest/compressedresults.json"\n with open(compressed_results_json_path, 'r') as f:\n compressed_results = json.load(f)\n latest_compressed_results = json.dumps(compressed_results, indent=4)\n history["internal"].append(["", f"<START_COMPRESSED_RESULTS>\\\\n{latest_compressed_results}\\\\n<END_COMPRESSED_RESULTS>"])\n\n return history\n\ndef clean_chat_history(history):\n """\n Cleans the chat history by removing all but the latest instance of the contents between the bookends.\n """\n START_BOOKEND = "<START_COMPRESSED_RESULTS>"\n END_BOOKEND = "<END_COMPRESSED_RESULTS>"\n\n # Find all instances of the bookends\n bookend_indices = []\n for i, entry in enumerate(history):\n if isinstance(entry, list) and len(entry) == 2:\n if START_BOOKEND in entry[1] and END_BOOKEND in entry[1]:\n bookend_indices.append(i)\n\n # If there are multiple instances, remove all but the last one\n if len(bookend_indices) > 1:\n for i in bookend_indices[:-1]:\n history[i][1] = re.sub(f"{START_BOOKEND}.*?{END_BOOKEND}", "", history[i][1], flags=re.DOTALL)\n\n return history\n\ndef append_compressed_results(output, compressed_results):\n """\n Appends the compressed results to the output.\n """\n START_BOOKEND = "<START_COMPRESSED_RESULTS>"\n END_BOOKEND = "<END_COMPRESSED_RESULTS>"\n output += f"\\\\n\\\\n{START_BOOKEND}\\\\n{json.dumps(compressed_results, indent=4)}\\\\n{END_BOOKEND}"\n return output\n\nif __name__ == "__main__":\n # Launch the Gradio interface with the specified UI components\n gr.Interface(\n fn=ui,\n inputs=None,\n outputs="ui"\n ).launch()\n```\n\nThis updated code includes the new `left_click_image` function, modifications to the `execute_tasks` function to handle left click tasks, and error handling to manage extraneous text in the LLM's response."
],
[
"Holy Frick!!! *super head pats* :3 That was amazing!! I was explaining to the LLM how to use the extension incorrectly at first and thought it wasn't working. But I understand now, very nice implementation, the model replies like this:\n\nAutonomy_Tasks:\n\nMouseClick = clickable icons_11.png\n\nand sure as shit, the mouse moves there and clicks on the icon...OMG! \n\nOkay you know what we need to do next XD? We need to add the ability to add key strokes, let's discuss how we want the LLM to send these commands out before writing any code.",
"I'm glad to hear that the implementation worked perfectly! Adding the ability to send keystrokes is a great next step. Let's discuss how we want the LLM to format and send these commands.\n\n### Format for Keystroke Commands\n\nWe can use a similar format to the mouse commands for keystroke commands. Here's a proposed format:\n\n```\nAutonomy_Tasks:\n\nMouseClick = clickable icons_11.png\nKeyPress = a\nKeyPress = enter\n```\n\nIn this format:\n- `MouseClick` is used for mouse click actions.\n- `KeyPress` is used for keystroke actions.\n- The value after `KeyPress` specifies the key to be pressed (e.g., `a` for the letter 'a', `enter` for the Enter key).\n\n### Supported Keys\n\nWe can support a wide range of keys, including:\n- Alphanumeric keys (`a`, `b`, `c`, ..., `0`, `1`, `2`, ...)\n- Special keys (`enter`, `space`, `backspace`, `tab`, `escape`, `shift`, `ctrl`, `alt`)\n- Function keys (`f1`, `f2`, ..., `f12`)\n\n### Error Handling\n\nWe'll add error handling to ensure that the extension continues to function even if the LLM's response contains extraneous text or unsupported keys.\n\n### Implementation Plan\n\n1. **Modify the `execute_tasks` function** to handle keystroke tasks.\n2. **Create a new function `press_key`** that performs the keystroke action.\n3. **Update the `output_modifier` function** to parse and execute keystroke commands.\n\n### Example LLM Response\n\nHere's an example of how the LLM might format its response to include keystroke commands:\n\n```\nAutonomy_Tasks:\n\nMouseClick = clickable icons_11.png\nKeyPress = a\nKeyPress = enter\n```\n\n### Next Steps\n\nOnce we agree on the format and the list of supported keys, I can help you draft the code for these changes. Does this plan sound good to you? If you have any other ideas or preferences, please let me know!"
],
[
"Okay, good suggestions, I agree with evertying; except it won't be KeyPress for every character right? Like we want :\n\n```\nAutonomy_Tasks:\n\nMouseClick = clickable icons_11.png\nKeyPress = apple\nKeyPress = enter\n```\n\nnot\n\n```\nAutonomy_Tasks:\n\nMouseClick = clickable icons_11.png\nKeyPress = a\nKeyPress = p\nKeyPress = p\nKeyPress = l\nKeyPress = e\nKeyPress = enter\n```\n\nDon't write any code yet.",
"Got it! We'll handle keystrokes in a way that allows the LLM to send a sequence of characters as a single command. This will make it more efficient and intuitive.\n\n### Updated Format for Keystroke Commands\n\nWe'll use the following format for keystroke commands:\n\n```\nAutonomy_Tasks:\n\nMouseClick = clickable icons_11.png\nKeyPress = apple\nKeyPress = enter\n```\n\nIn this format:\n- `MouseClick` is used for mouse click actions.\n- `KeyPress` is used for keystroke actions.\n- The value after `KeyPress` specifies the sequence of characters to be typed (e.g., `apple` for the word "apple", `enter` for the Enter key).\n\n### Supported Keys\n\nWe can support a wide range of keys, including:\n- Alphanumeric keys (`a`, `b`, `c`, ..., `0`, `1`, `2`, ...)\n- Special keys (`enter`, `space`, `backspace`, `tab`, `escape`, `shift`, `ctrl`, `alt`)\n- Function keys (`f1`, `f2`, ..., `f12`)\n\n### Error Handling\n\nWe'll add error handling to ensure that the extension continues to function even if the LLM's response contains extraneous text or unsupported keys.\n\n### Implementation Plan\n\n1. **Modify the `execute_tasks` function** to handle keystroke tasks.\n2. **Create a new function `press_key`** that performs the keystroke action.\n3. **Update the `output_modifier` function** to parse and execute keystroke commands.\n\n### Example LLM Response\n\nHere's an example of how the LLM might format its response to include keystroke commands:\n\n```\nAutonomy_Tasks:\n\nMouseClick = clickable icons_11.png\nKeyPress = apple\nKeyPress = enter\n```\n\n### Next Steps\n\nOnce we agree on the format and the list of supported keys, I can help you draft the code for these changes. Does this plan sound good to you? If you have any other ideas or preferences, please let me know!"
],
[
"We need to do it a little differently, because the LLM is likely to type the word "enter" at some point and we want to make sure the extension can differentiate between the two circumstances. I was thinking about this " about we do this instead, we have a CharKeyPress and an InputKeyPress" but think I started to think about how it would break maybe.\n\nWe need something to handle these types of cases:\n\n1. Where the LLM is typing text into a text field: This is a full sentence.\n2. Where the LLM is typing text and formatting into a text field: This is a full sentence. /n This is another full sentence.\n3. Where the LLM needs to type a special or function key: Enter or enter would mean to send the enter key command if the LLM needs to press a function key, it types out F3\n4. Where the LLM needs to send out key combinations: ctrl+alt+del \n\n",
"I understand your concerns. We need a robust way to handle different types of keystroke commands, including text input, special keys, function keys, and key combinations. Let's define a clear format for each type of command to ensure the extension can differentiate between them.\n\n### Proposed Format for Keystroke Commands\n\n1. **Text Input**: For typing full sentences or formatted text.\n ```\n Autonomy_Tasks:\n\n MouseClick = clickable icons_11.png\n TextInput = This is a full sentence.\n TextInput = This is another full sentence.\n ```\n\n2. **Special Keys**: For special keys like Enter, Backspace, etc.\n ```\n Autonomy_Tasks:\n\n MouseClick = clickable icons_11.png\n SpecialKey = enter\n SpecialKey = backspace\n ```\n\n3. **Function Keys**: For function keys like F1, F2, etc.\n ```\n Autonomy_Tasks:\n\n MouseClick = clickable icons_11.png\n FunctionKey = f3\n ```\n\n4. **Key Combinations**: For key combinations like Ctrl+Alt+Del.\n ```\n Autonomy_Tasks:\n\n MouseClick = clickable icons_11.png\n KeyCombo = ctrl+alt+del\n ```\n\n### Supported Keys\n\n- **Text Input**: Any sequence of characters.\n- **Special Keys**: `enter`, `space`, `backspace`, `tab`, `escape`, `shift`, `ctrl`, `alt`.\n- **Function Keys**: `f1`, `f2`, ..., `f12`.\n- **Key Combinations**: Combinations of special keys and alphanumeric keys (e.g., `ctrl+alt+del`, `ctrl+c`, `ctrl+v`).\n\n### Error Handling\n\nWe'll add error handling to ensure that the extension continues to function even if the LLM's response contains extraneous text or unsupported keys.\n\n### Implementation Plan\n\n1. **Modify the `execute_tasks` function** to handle different types of keystroke tasks.\n2. **Create new functions** for each type of keystroke action (`text_input`, `special_key`, `function_key`, `key_combo`).\n3. **Update the `output_modifier` function** to parse and execute keystroke commands.\n\n### Example LLM Response\n\nHere's an example of how the LLM might format its response to include various keystroke commands:\n\n```\nAutonomy_Tasks:\n\nMouseClick = clickable icons_11.png\nTextInput = This is a full sentence.\nTextInput = This is another full sentence.\nSpecialKey = enter\nFunctionKey = f3\nKeyCombo = ctrl+alt+del\n```\n\n### Next Steps\n\nOnce we agree on the format and the list of supported keys, I can help you draft the code for these changes. Does this plan sound good to you? If you have any other ideas or preferences, please let me know!"
],
[
"Okay this looks fantastic! Except for one little thing, let's focus on the TextInput. I would expect the LLM to format it's input to the input field as \n```\nThis is a full sentence. /n This is another full sentence.\n```\nI hope you can see the / and n. so when it is pasted into the text field it is formatted like:\n```\nThis is a full sentence.\nThis is another full sentence.\n```",
"Got it! We'll ensure that the `TextInput` command handles newline characters correctly. The LLM will format its input with `/n` to represent newlines, and the extension will convert these to actual newline characters when typing the text.\n\n### Updated Format for Keystroke Commands\n\n1. **Text Input**: For typing full sentences or formatted text.\n ```\n Autonomy_Tasks:\n\n MouseClick = clickable icons_11.png\n TextInput = This is a full sentence. /n This is another full sentence.\n ```\n\n2. **Special Keys**: For special keys like Enter, Backspace, etc.\n ```\n Autonomy_Tasks:\n\n MouseClick = clickable icons_11.png\n SpecialKey = enter\n SpecialKey = backspace\n ```\n\n3. **Function Keys**: For function keys like F1, F2, etc.\n ```\n Autonomy_Tasks:\n\n MouseClick = clickable icons_11.png\n FunctionKey = f3\n ```\n\n4. **Key Combinations**: For key combinations like Ctrl+Alt+Del.\n ```\n Autonomy_Tasks:\n\n MouseClick = clickable icons_11.png\n KeyCombo = ctrl+alt+del\n ```\n\n### Supported Keys\n\n- **Text Input**: Any sequence of characters, with `/n` representing newline characters.\n- **Special Keys**: `enter`, `space`, `backspace`, `tab`, `escape`, `shift`, `ctrl`, `alt`.\n- **Function Keys**: `f1`, `f2`, ..., `f12`.\n- **Key Combinations**: Combinations of special keys and alphanumeric keys (e.g., `ctrl+alt+del`, `ctrl+c`, `ctrl+v`).\n\n### Error Handling\n\nWe'll add error handling to ensure that the extension continues to function even if the LLM's response contains extraneous text or unsupported keys.\n\n### Implementation Plan\n\n1. **Modify the `execute_tasks` function** to handle different types of keystroke tasks.\n2. **Create new functions** for each type of keystroke action (`text_input`, `special_key`, `function_key`, `key_combo`).\n3. **Update the `output_modifier` function** to parse and execute keystroke commands.\n\n### Example LLM Response\n\nHere's an example of how the LLM might format its response to include various keystroke commands:\n\n```\nAutonomy_Tasks:\n\nMouseClick = clickable icons_11.png\nTextInput = This is a full sentence. /n This is another full sentence.\nSpecialKey = enter\nFunctionKey = f3\nKeyCombo = ctrl+alt+del\n```\n\n### Next Steps\n\nOnce we agree on the format and the list of supported keys, I can help you draft the code for these changes. Does this plan sound good to you? If you have any other ideas or preferences, please let me know!"
],
[
"Awesome! Yes that looks great :3 Please feel free to do what you need to do.",
"Great! Let's proceed with the implementation. We'll modify the `execute_tasks` function to handle different types of keystroke tasks, create new functions for each type of keystroke action, and update the `output_modifier` function to parse and execute keystroke commands.\n\nHere's the updated code:\n\n### Step 1: Modify the `execute_tasks` function\n\nWe'll update the `execute_tasks` function to handle different types of keystroke tasks.\n\n```python\ndef execute_tasks(tasks_section):\n """\n Executes the tasks listed in the tasks_section.\n """\n tasks = tasks_section.split("\\n")\n for task in tasks:\n task = task.strip()\n if task.startswith("MouseMover ="):\n image_name = task.split("=")[1].strip()\n move_mouse_to_image(image_name)\n elif task.startswith("MouseClick ="):\n image_name = task.split("=")[1].strip()\n left_click_image(image_name)\n elif task.startswith("TextInput ="):\n text = task.split("=")[1].strip()\n text_input(text)\n elif task.startswith("SpecialKey ="):\n key = task.split("=")[1].strip()\n special_key(key)\n elif task.startswith("FunctionKey ="):\n key = task.split("=")[1].strip()\n function_key(key)\n elif task.startswith("KeyCombo ="):\n combo = task.split("=")[1].strip()\n key_combo(combo)\n else:\n print(f"Unknown task: {task}")\n```\n\n### Step 2: Create new functions for keystroke actions\n\nWe'll create new functions for each type of keystroke action (`text_input`, `special_key`, `function_key`, `key_combo`).\n\n```python\ndef text_input(text):\n """\n Types the given text, replacing '/n' with newline characters.\n """\n text = text.replace("/n", "\\n")\n pyautogui.typewrite(text)\n\ndef special_key(key):\n """\n Presses the specified special key.\n """\n if key == "enter":\n pyautogui.press("enter")\n elif key == "space":\n pyautogui.press("space")\n elif key == "backspace":\n pyautogui.press("backspace")\n elif key == "tab":\n pyautogui.press("tab")\n elif key == "escape":\n pyautogui.press("esc")\n elif key == "shift":\n pyautogui.press("shift")\n elif key == "ctrl":\n pyautogui.press("ctrl")\n elif key == "alt":\n pyautogui.press("alt")\n else:\n print(f"Unsupported special key: {key}")\n\ndef function_key(key):\n """\n Presses the specified function key.\n """\n if key.startswith("f") and key[1:].isdigit():\n pyautogui.press(key)\n else:\n print(f"Unsupported function key: {key}")\n\ndef key_combo(combo):\n """\n Presses the specified key combination.\n """\n keys = combo.split("+")\n for key in keys:\n if key in ["ctrl", "alt", "shift"]:\n pyautogui.keyDown(key)\n else:\n pyautogui.press(key)\n for key in keys:\n if key in ["ctrl", "alt", "shift"]:\n pyautogui.keyUp(key)\n```\n\n### Step 3: Update the `output_modifier` function\n\nWe'll update the `output_modifier` function to parse and execute keystroke commands.\n\n```python\ndef output_modifier(output, state):\n """\n Modifies the LLM output before it is presented in the UI.\n """\n # Search for the "Autonomy_Tasks:" trigger phrase in the LLM's output\n if "Autonomy_Tasks:" in output:\n # Clean the chat history\n state["history"] = history_modifier(state["history"])\n # Extract tasks and execute them\n tasks_section = output.split("Autonomy_Tasks:")[-1].strip()\n if tasks_section:\n execute_tasks(tasks_section)\n # Append new compressed results\n compressed_results_json_path = "extensions/Lucid_Autonomy/ImageOutputTest/compressedresults.json"\n with open(compressed_results_json_path, 'r') as f:\n compressed_results = json.load(f)\n output = append_compressed_results(output, compressed_results)\n return output\n```\n\n### Putting it all together\n\nHere's the updated code with all the changes:\n\n```python\nimport gradio as gr\nimport json\nimport pyautogui\nimport screeninfo\nfrom transformers import Owlv2Processor, Owlv2ForObjectDetection, AutoModel, AutoTokenizer\nfrom PIL import Image, ImageDraw\nimport torch\nimport os\nimport shutil\nimport re # Import the re module for regular expression operations\n\n# Configurable variables\nMONITOR_INDEX = 1 # Index of the monitor to capture (0 for the first monitor, 1 for the second monitor, etc.)\nTEXT_QUERIES = "clickable hyperlinks, clickable icons, text bar field" # Comma-separated text queries for object detection\nSCORE_THRESHOLD = 0.20 # Score threshold for filtering low-probability predictions\nVISION_MODEL_QUESTION = "What is the text in this image or what does this image describe?"\n\n# Use GPU if available\nif torch.cuda.is_available():\n device = torch.device("cuda")\nelse:\n device = torch.device("cpu")\n\n# Global variables for models\nowlv2_model = None\nowlv2_processor = None\nminicpm_llama_model = None\nminicpm_llama_tokenizer = None\n\nVISION_MODEL_ID = "/home/myself/Desktop/miniCPM_llava3V/MiniCPM-Llama3-V-2_5"\n\ndef load_owlv2_model():\n global owlv2_model, owlv2_processor\n owlv2_model = Owlv2ForObjectDetection.from_pretrained("google/owlv2-base-patch16-ensemble", trust_remote_code=True).to(device)\n owlv2_processor = Owlv2Processor.from_pretrained("google/owlv2-base-patch16-ensemble", trust_remote_code=True)\n print("Owlv2 model loaded.")\n\ndef unload_owlv2_model():\n global owlv2_model, owlv2_processor\n if owlv2_model is not None:\n del owlv2_model\n del owlv2_processor\n torch.cuda.empty_cache()\n print("Owlv2 model unloaded.")\n\ndef load_minicpm_llama_model():\n global minicpm_llama_model, minicpm_llama_tokenizer\n if "int4" in VISION_MODEL_ID:\n # Load the 4-bit quantized model and tokenizer\n minicpm_llama_model = AutoModel.from_pretrained(\n VISION_MODEL_ID,\n device_map={"": device},\n trust_remote_code=True,\n torch_dtype=torch.float16 # Use float16 as per the example code for 4-bit models\n ).eval()\n minicpm_llama_tokenizer = AutoTokenizer.from_pretrained(\n VISION_MODEL_ID,\n trust_remote_code=True\n )\n print("MiniCPM-Llama3 4-bit quantized model loaded on-demand.")\n else:\n # Load the standard model and tokenizer\n minicpm_llama_model = AutoModel.from_pretrained(\n VISION_MODEL_ID,\n device_map={"": device}, # Use the specified CUDA device\n trust_remote_code=True,\n torch_dtype=torch.bfloat16\n ).eval()\n minicpm_llama_tokenizer = AutoTokenizer.from_pretrained(\n VISION_MODEL_ID,\n trust_remote_code=True\n )\n print("MiniCPM-Llama3 standard model loaded on-demand.")\n\ndef unload_minicpm_llama_model():\n global minicpm_llama_model, minicpm_llama_tokenizer\n if minicpm_llama_model is not None:\n del minicpm_llama_model\n del minicpm_llama_tokenizer\n torch.cuda.empty_cache()\n print("MiniCPM-Llama3 model unloaded.")\n\ndef take_screenshot(monitor_index, text_queries, score_threshold, vision_model_question, output_filename="screenshot.png"):\n """\n Takes a screenshot of a specific monitor in a multi-monitor setup.\n\n Args:\n monitor_index (int): The index of the monitor to capture.\n text_queries (str): Comma-separated text queries for object detection.\n score_threshold (float): The score threshold for filtering low-probability predictions.\n vision_model_question (str): The question to ask the vision model.\n output_filename (str): The filename to save the screenshot as (default is "screenshot.png").\n """\n monitor_index = int(monitor_index)\n score_threshold = float(score_threshold)\n\n # Get information about all monitors\n monitors = screeninfo.get_monitors()\n\n if monitor_index >= len(monitors):\n raise ValueError(f"Monitor index {monitor_index} is out of range. There are only {len(monitors)} monitors.")\n\n # Get the region of the specified monitor\n monitor = monitors[monitor_index]\n region = (monitor.x, monitor.y, monitor.width, monitor.height)\n\n # Take the screenshot\n output_dir = "extensions/Lucid_Autonomy"\n os.makedirs(output_dir, exist_ok=True)\n screenshot_path = os.path.join(output_dir, output_filename)\n screenshot = pyautogui.screenshot(region=region)\n screenshot.save(screenshot_path)\n print(f"Screenshot saved as {screenshot_path}")\n\n # Process the image using the Owlv2 model\n annotated_image_path, json_output_path = query_image(screenshot_path, text_queries, score_threshold)\n\n # Create the output directory if it doesn't exist\n output_dir = "extensions/Lucid_Autonomy/ImageOutputTest"\n os.makedirs(output_dir, exist_ok=True)\n\n # Clear the output directory\n for filename in os.listdir(output_dir):\n file_path = os.path.join(output_dir, filename)\n try:\n if os.path.isfile(file_path) or os.path.islink(file_path):\n os.unlink(file_path)\n elif os.path.isdir(file_path):\n shutil.rmtree(file_path)\n except Exception as e:\n print(f"Failed to delete {file_path}. Reason: {e}")\n\n # Crop the detected objects and save them to the output directory\n crop_images(annotated_image_path, json_output_path, output_dir)\n\n print("Processing complete. Check the output directory for the cropped images and results.")\n\ndef query_image(img_path, text_queries, score_threshold):\n """\n Processes the image using the Owlv2 model to detect objects based on text queries.\n\n Args:\n img_path (str): The path to the input image.\n text_queries (str): Comma-separated text queries for object detection.\n score_threshold (float): The score threshold for filtering low-probability predictions.\n """\n load_owlv2_model()\n\n img = Image.open(img_path)\n text_queries = text_queries.split(",")\n\n size = max(img.size)\n target_sizes = torch.Tensor([[size, size]])\n inputs = owlv2_processor(text=text_queries, images=img, return_tensors="pt").to(device)\n\n with torch.no_grad():\n outputs = owlv2_model(**inputs)\n\n outputs.logits = outputs.logits.cpu()\n outputs.pred_boxes = outputs.pred_boxes.cpu()\n results = owlv2_processor.post_process_object_detection(outputs=outputs, target_sizes=target_sizes)\n boxes, scores, labels = results[0]["boxes"], results[0]["scores"], results[0]["labels"]\n\n result_labels = []\n img_pil = img.copy()\n draw = ImageDraw.Draw(img_pil)\n\n for box, score, label in zip(boxes, scores, labels):\n box = [int(i) for i in box.tolist()]\n if score < score_threshold:\n continue\n result_labels.append([box, text_queries[label.item()]])\n draw.rectangle(box, outline="red", width=2)\n # Do not draw the text label within the bounding box\n # draw.text((box[0], box[1]), text_queries[label.item()], fill="red")\n\n # Save the annotated image for debugging\n annotated_image_path = "extensions/Lucid_Autonomy/annotated_image.png"\n img_pil.save(annotated_image_path)\n\n # Save the JSON output to a file\n json_output_path = "extensions/Lucid_Autonomy/output.json"\n with open(json_output_path, 'w') as f:\n json.dump(result_labels, f)\n\n unload_owlv2_model()\n\n return annotated_image_path, json_output_path\n\ndef crop_images(image_path, json_output_path, output_dir="extensions/Lucid_Autonomy/ImageOutputTest"):\n """\n Crops out the detected objects from the input image based on the bounding box coordinates.\n\n Args:\n image_path (str): The path to the input image.\n json_output_path (str): The path to the JSON file containing the bounding box coordinates.\n output_dir (str): The directory to save the cropped images (default is "extensions/Lucid_Autonomy/ImageOutputTest").\n """\n with open(json_output_path, 'r') as f:\n data = json.load(f)\n\n img = Image.open(image_path)\n\n results = []\n\n # Load the miniCPM-lama3-v-2_5 model once before processing the images\n load_minicpm_llama_model()\n\n for i, item in enumerate(data):\n box = item[0]\n label = item[1]\n cropped_img = img.crop(box)\n cropped_img_path = f"{output_dir}/{label}_{i}.png"\n cropped_img.save(cropped_img_path)\n\n # Process the cropped image with the miniCPM-lama3-v-2_5 model\n vision_model_response = process_with_vision_model(cropped_img_path, VISION_MODEL_QUESTION)\n\n # Store the results in a structured format\n results.append({\n "image_name": f"{label}_{i}.png",\n "coordinates": box,\n "description": vision_model_response\n })\n\n # Unload the miniCPM-lama3-v-2_5 model after processing all images\n unload_minicpm_llama_model()\n\n # Save the results to a JSON file\n results_json_path = f"{output_dir}/results.json"\n with open(results_json_path, 'w') as f:\n json.dump(results, f, indent=4)\n\n # Create the compressed results JSON file\n compressed_results_json_path = f"{output_dir}/compressedresults.json"\n compressed_results = [{"image_name": result["image_name"], "description": result["description"]} for result in results]\n with open(compressed_results_json_path, 'w') as f:\n json.dump(compressed_results, f, indent=4)\n\ndef process_with_vision_model(image_path, question):\n """\n Processes the image with the miniCPM-lama3-v-2_5 model and returns the response.\n\n Args:\n image_path (str): The path to the input image.\n question (str): The question to ask the vision model.\n\n Returns:\n str: The response from the vision model.\n """\n image = Image.open(image_path).convert("RGB")\n messages = [\n {"role": "user", "content": f"{question}"}\n ]\n # Define the generation arguments\n generation_args = {\n "max_new_tokens": 896,\n "repetition_penalty": 1.05,\n "num_beams": 3,\n "top_p": 0.8,\n "top_k": 1,\n "temperature": 0.7,\n "sampling": True,\n }\n if "int4" in "openbmb/MiniCPM-Llama3-V-2_5":\n # Disable streaming for the 4-bit model\n generation_args["stream"] = False\n # Use the model.chat method with streaming enabled\n vision_model_response = ""\n for new_text in minicpm_llama_model.chat(\n image=image,\n msgs=messages,\n tokenizer=minicpm_llama_tokenizer,\n **generation_args\n ):\n vision_model_response += new_text\n print(new_text, flush=True, end='')\n else:\n minicpm_llama_model.to(device)\n vision_model_response = minicpm_llama_model.chat(\n image=image,\n msgs=messages,\n tokenizer=minicpm_llama_tokenizer,\n **generation_args\n )\n return vision_model_response\n\ndef ui():\n """\n Creates custom gradio elements when the UI is launched.\n """\n with gr.Blocks() as demo:\n with gr.Row():\n with gr.Column():\n monitor_index = gr.Textbox(label="Monitor Index", value=str(MONITOR_INDEX))\n text_queries = gr.Textbox(label="Text Queries", value=TEXT_QUERIES)\n score_threshold = gr.Textbox(label="Score Threshold", value=str(SCORE_THRESHOLD))\n vision_model_question = gr.Textbox(label="Vision Model Question", value=VISION_MODEL_QUESTION)\n take_screenshot_button = gr.Button(value="Take Screenshot")\n\n take_screenshot_button.click(\n fn=take_screenshot,\n inputs=[monitor_index, text_queries, score_threshold, vision_model_question],\n outputs=None\n )\n\n return demo\n\ndef input_modifier(user_input, state):\n """\n Modifies the user input before it is processed by the LLM.\n """\n return user_input\n\ndef output_modifier(output, state):\n """\n Modifies the LLM output before it is presented in the UI.\n """\n # Search for the "Autonomy_Tasks:" trigger phrase in the LLM's output\n if "Autonomy_Tasks:" in output:\n # Clean the chat history\n state["history"] = history_modifier(state["history"])\n # Extract tasks and execute them\n tasks_section = output.split("Autonomy_Tasks:")[-1].strip()\n if tasks_section:\n execute_tasks(tasks_section)\n # Append new compressed results\n compressed_results_json_path = "extensions/Lucid_Autonomy/ImageOutputTest/compressedresults.json"\n with open(compressed_results_json_path, 'r') as f:\n compressed_results = json.load(f)\n output = append_compressed_results(output, compressed_results)\n return output\n\ndef execute_tasks(tasks_section):\n """\n Executes the tasks listed in the tasks_section.\n """\n tasks = tasks_section.split("\\n")\n for task in tasks:\n task = task.strip()\n if task.startswith("MouseMover ="):\n image_name = task.split("=")[1].strip()\n move_mouse_to_image(image_name)\n elif task.startswith("MouseClick ="):\n image_name = task.split("=")[1].strip()\n left_click_image(image_name)\n elif task.startswith("TextInput ="):\n text = task.split("=")[1].strip()\n text_input(text)\n elif task.startswith("SpecialKey ="):\n key = task.split("=")[1].strip()\n special_key(key)\n elif task.startswith("FunctionKey ="):\n key = task.split("=")[1].strip()\n function_key(key)\n elif task.startswith("KeyCombo ="):\n combo = task.split("=")[1].strip()\n key_combo(combo)\n else:\n print(f"Unknown task: {task}")\n\ndef move_mouse_to_image(image_name):\n """\n Moves the mouse to the center of the box that defines the image.\n """\n results_json_path = "extensions/Lucid_Autonomy/ImageOutputTest/results.json"\n with open(results_json_path, 'r') as f:\n results = json.load(f)\n\n for result in results:\n if result["image_name"] == image_name:\n coordinates = result["coordinates"]\n monitor = get_monitor_info(MONITOR_INDEX)\n move_mouse_to_coordinates(monitor, coordinates)\n break\n else:\n print(f"Image name '{image_name}' not found in the results.")\n\ndef left_click_image(image_name):\n """\n Moves the mouse to the center of the box that defines the image and performs a left click.\n """\n results_json_path = "extensions/Lucid_Autonomy/ImageOutputTest/results.json"\n with open(results_json_path, 'r') as f:\n results = json.load(f)\n\n for result in results:\n if result["image_name"] == image_name:\n coordinates = result["coordinates"]\n monitor = get_monitor_info(MONITOR_INDEX)\n move_mouse_to_coordinates(monitor, coordinates)\n pyautogui.click() # Perform a left click\n break\n else:\n print(f"Image name '{image_name}' not found in the results.")\n\ndef move_mouse_to_coordinates(monitor, coordinates):\n """\n Moves the mouse cursor to the specified coordinates on the monitor.\n\n Args:\n monitor (dict): A dictionary containing the monitor's information.\n coordinates (list): A list containing the coordinates [x1, y1, x2, y2].\n """\n x1, y1, x2, y2 = coordinates\n monitor_x, monitor_y = monitor.x, monitor.y\n center_x = monitor_x + (x1 + x2) // 2\n center_y = monitor_y + (y1 + y2) // 2\n pyautogui.moveTo(center_x, center_y)\n print(f"Mouse moved to coordinates: ({center_x}, {center_y})")\n\ndef get_monitor_info(monitor_index):\n """\n Gets information about the specified monitor.\n\n Args:\n monitor_index (int): The index of the monitor to get information about.\n\n Returns:\n dict: A dictionary containing the monitor's information.\n """\n monitors = screeninfo.get_monitors()\n\n if monitor_index >= len(monitors):\n raise ValueError(f"Monitor index {monitor_index} is out of range. There are only {len(monitors)} monitors.")\n\n return monitors[monitor_index]\n\ndef history_modifier(history):\n """\n Modifies the chat history before the text generation in chat mode begins.\n """\n # Clean the internal history\n history["internal"] = clean_chat_history(history["internal"])\n # Clean the visible history\n history["visible"] = clean_chat_history(history["visible"])\n\n # Inject the latest compressedresults.json into the internal history\n compressed_results_json_path = "extensions/Lucid_Autonomy/ImageOutputTest/compressedresults.json"\n with open(compressed_results_json_path, 'r') as f:\n compressed_results = json.load(f)\n latest_compressed_results = json.dumps(compressed_results, indent=4)\n history["internal"].append(["", f"<START_COMPRESSED_RESULTS>\\\\n{latest_compressed_results}\\\\n<END_COMPRESSED_RESULTS>"])\n\n return history\n\ndef clean_chat_history(history):\n """\n Cleans the chat history by removing all but the latest instance of the contents between the bookends.\n """\n START_BOOKEND = "<START_COMPRESSED_RESULTS>"\n END_BOOKEND = "<END_COMPRESSED_RESULTS>"\n\n # Find all instances of the bookends\n bookend_indices = []\n for i, entry in enumerate(history):\n if isinstance(entry, list) and len(entry) == 2:\n if START_BOOKEND in entry[1] and END_BOOKEND in entry[1]:\n bookend_indices.append(i)\n\n # If there are multiple instances, remove all but the last one\n if len(bookend_indices) > 1:\n for i in bookend_indices[:-1]:\n history[i][1] = re.sub(f"{START_BOOKEND}.*?{END_BOOKEND}", "", history[i][1], flags=re.DOTALL)\n\n return history\n\ndef append_compressed_results(output, compressed_results):\n """\n Appends the compressed results to the output.\n """\n START_BOOKEND = "<START_COMPRESSED_RESULTS>"\n END_BOOKEND = "<END_COMPRESSED_RESULTS>"\n output += f"\\\\n\\\\n{START_BOOKEND}\\\\n{json.dumps(compressed_results, indent=4)}\\\\n{END_BOOKEND}"\n return output\n\ndef text_input(text):\n """\n Types the given text, replacing '/n' with newline characters.\n """\n text = text.replace("/n", "\\n")\n pyautogui.typewrite(text)\n\ndef special_key(key):\n """\n Presses the specified special key.\n """\n if key == "enter":\n pyautogui.press("enter")\n elif key == "space":\n pyautogui.press("space")\n elif key == "backspace":\n pyautogui.press("backspace")\n elif key == "tab":\n pyautogui.press("tab")\n elif key == "escape":\n pyautogui.press("esc")\n elif key == "shift":\n pyautogui.press("shift")\n elif key == "ctrl":\n pyautogui.press("ctrl")\n elif key == "alt":\n pyautogui.press("alt")\n else:\n print(f"Unsupported special key: {key}")\n\ndef function_key(key):\n """\n Presses the specified function key.\n """\n if key.startswith("f") and key[1:].isdigit():\n pyautogui.press(key)\n else:\n print(f"Unsupported function key: {key}")\n\ndef key_combo(combo):\n """\n Presses the specified key combination.\n """\n keys = combo.split("+")\n for key in keys:\n if key in ["ctrl", "alt", "shift"]:\n pyautogui.keyDown(key)\n else:\n pyautogui.press(key)\n for key in keys:\n if key in ["ctrl", "alt", "shift"]:\n pyautogui.keyUp(key)\n\nif __name__ == "__main__":\n # Launch the Gradio interface with the specified UI components\n gr.Interface(\n fn=ui,\n inputs=None,\n outputs="ui"\n ).launch()\n```\n\nThis updated code includes the new functions for handling different types of keystroke actions and updates the `execute_tasks` function to handle these new commands. The `output_modifier` function is also updated to parse and execute keystroke commands."
],
[
"OMG! Frick you are good, I had to make one change:\n\nI needed to change:\n\ndef execute_tasks(tasks_section):\n """\n Executes the tasks listed in the tasks_section.\n """\n tasks = tasks_section.split("\\\\n")\n for task in tasks:\n task = task.strip()\n\nto \n\ndef execute_tasks(tasks_section):\n """\n Executes the tasks listed in the tasks_section.\n """\n tasks = tasks_section.split("\\n")\n for task in tasks:\n task = task.strip()\n\nSee how I removed that backslash? The llm uses a single \\n for a new line not \\\\n\n\nBut wow holy OMG fricking amazing! Let's add more tools!! Okay I want to add the following three tools:\n\n1. The ability for the LLM to add a delay between tools, have the units in milliseconds\n2. The ability to add raw coordinates, I want the LLM to be able to input raw coordinates, in a format where it is two numbers with a comma and no space so like 193,136. I want the LLM to be able to be able to move the mouse to those coordinates and I want the LLM to have the ability to click on those coordinates, similarly to have we have MouseMover and MouseClick\n3. I want the LLM to take screenshots on its own, so the LLM can add a take screenshot, but only to the end of its task list.\n\nDon't rewrite all the code unless there are a lot of changes, if we need to add functions or change functions I think I can handle doing those changes to the main code.",
"Great to hear that the implementation worked well with just a minor adjustment! Let's add the new tools as you described:\n\n1. **Delay Tool**: The LLM can add a delay between tasks, with the units in milliseconds.\n2. **Raw Coordinates Tool**: The LLM can input raw coordinates to move the mouse and click.\n3. **Screenshot Tool**: The LLM can take a screenshot, but only at the end of its task list.\n\n### Updated Format for Tasks\n\n1. **Delay Tool**:\n ```\n Autonomy_Tasks:\n\n MouseClick = clickable icons_11.png\n Delay = 1000\n ```\n\n2. **Raw Coordinates Tool**:\n ```\n Autonomy_Tasks:\n\n MouseMove = 193,136\n MouseClick = 193,136\n ```\n\n3. **Screenshot Tool**:\n ```\n Autonomy_Tasks:\n\n MouseClick = clickable icons_11.png\n TakeScreenshot\n ```\n\n### Implementation Plan\n\n1. **Modify the `execute_tasks` function** to handle the new tools.\n2. **Create new functions** for the delay and raw coordinates tools.\n3. **Update the `output_modifier` function** to parse and execute the new commands.\n\n### Step 1: Modify the `execute_tasks` function\n\nWe'll update the `execute_tasks` function to handle the new tools.\n\n```python\ndef execute_tasks(tasks_section):\n """\n Executes the tasks listed in the tasks_section.\n """\n tasks = tasks_section.split("\\n")\n for task in tasks:\n task = task.strip()\n if task.startswith("MouseMover ="):\n image_name = task.split("=")[1].strip()\n move_mouse_to_image(image_name)\n elif task.startswith("MouseClick ="):\n image_name = task.split("=")[1].strip()\n left_click_image(image_name)\n elif task.startswith("TextInput ="):\n text = task.split("=")[1].strip()\n text_input(text)\n elif task.startswith("SpecialKey ="):\n key = task.split("=")[1].strip()\n special_key(key)\n elif task.startswith("FunctionKey ="):\n key = task.split("=")[1].strip()\n function_key(key)\n elif task.startswith("KeyCombo ="):\n combo = task.split("=")[1].strip()\n key_combo(combo)\n elif task.startswith("Delay ="):\n delay = int(task.split("=")[1].strip())\n delay_task(delay)\n elif task.startswith("MouseMove ="):\n coordinates = task.split("=")[1].strip()\n move_mouse_to_coordinates_raw(coordinates)\n elif task.startswith("MouseClick ="):\n coordinates = task.split("=")[1].strip()\n left_click_coordinates_raw(coordinates)\n elif task == "TakeScreenshot":\n take_screenshot_task()\n else:\n print(f"Unknown task: {task}")\n```\n\n### Step 2: Create new functions for the delay and raw coordinates tools\n\nWe'll create new functions for the delay and raw coordinates tools.\n\n```python\nimport time\n\ndef delay_task(delay):\n """\n Adds a delay in milliseconds.\n """\n time.sleep(delay / 1000.0)\n\ndef move_mouse_to_coordinates_raw(coordinates):\n """\n Moves the mouse to the specified raw coordinates.\n """\n x, y = map(int, coordinates.split(","))\n pyautogui.moveTo(x, y)\n print(f"Mouse moved to coordinates: ({x}, {y})")\n\ndef left_click_coordinates_raw(coordinates):\n """\n Moves the mouse to the specified raw coordinates and performs a left click.\n """\n x, y = map(int, coordinates.split(","))\n pyautogui.moveTo(x, y)\n pyautogui.click()\n print(f"Mouse clicked at coordinates: ({x}, {y})")\n\ndef take_screenshot_task():\n """\n Takes a screenshot.\n """\n take_screenshot(MONITOR_INDEX, TEXT_QUERIES, SCORE_THRESHOLD, VISION_MODEL_QUESTION)\n```\n\n### Step 3: Update the `output_modifier` function\n\nWe'll update the `output_modifier` function to parse and execute the new commands.\n\n```python\ndef output_modifier(output, state):\n """\n Modifies the LLM output before it is presented in the UI.\n """\n # Search for the "Autonomy_Tasks:" trigger phrase in the LLM's output\n if "Autonomy_Tasks:" in output:\n # Clean the chat history\n state["history"] = history_modifier(state["history"])\n # Extract tasks and execute them\n tasks_section = output.split("Autonomy_Tasks:")[-1].strip()\n if tasks_section:\n execute_tasks(tasks_section)\n # Append new compressed results\n compressed_results_json_path = "extensions/Lucid_Autonomy/ImageOutputTest/compressedresults.json"\n with open(compressed_results_json_path, 'r') as f:\n compressed_results = json.load(f)\n output = append_compressed_results(output, compressed_results)\n return output\n```\n\n### Putting it all together\n\nHere are the changes you need to make to the main code:\n\n1. **Update the `execute_tasks` function** to handle the new tools.\n2. **Add the new functions** for the delay and raw coordinates tools.\n3. **Update the `output_modifier` function** to parse and execute the new commands.\n\nYou can integrate these changes into your main code as needed. If you have any questions or need further assistance, feel free to ask!"
],
[
"Okay, holy shit that worked perfectly!! You are so good at this <3 Alright are you ready to implement my super secret idea? I need you and I to be on the same page here so buckle up buddy XD\n\nOkay so I can ask the LLM questions like this:\n\n```\nOkay, you can add delays between tools with the value being milliseconds, the tool is called "Delay". Can you move the mouse to the best buy logo, wait 5 seconds, and then click on the text entry field and type the url for google, then press enter, and then wait 2 seconds?\n```\nand it will respond just the way we want:\n\n```\nAutonomy_Tasks:\nMouseMover = clickable icons_11.png\nDelay = 5000\nMouseClick = text bar field_0.png\nTextInput = https://www.google.com\nSpecialKey = enter\n```\nWhich is just absolutely amazing, but all of the commands need to be executed before the LLM reply completes. I hope that is not too much of a surprise, it is what I would have expected. There is a little animation that plays when the LLM is "busy" and it is playing the whole time the tasks are being executed even through the delays. All well and nice and nothing I want to change.\n\nBut I want to add something like "Out of Body" commands, where the LLM can set them up in the reply but they are only executed after the LLM has completed their reply.\n\nSo maybe for these instructions they can use the same wording as the current instructions, but be prefaced with OOB for out of body. \n\n```\nAutonomy_Tasks:\nMouseMover = clickable icons_11.png\nDelay = 5000\nOOB_MouseClick = text bar field_0.png\nOOB_TextInput = https://www.google.com\nOOB_SpecialKey = enter\n```\n\nSo in this example after the 5 second delay is over, the LLM's reply completes (the OOB instructions don't do anything while the LLM is replying but are saved some how) and then the last three tasks are completed afterwards. Does this make sense? We can have OOB_ tasks anywhere in the list of tasks, we just want to essentially filter them out and save them for later.",
"Yes, that makes perfect sense! We can implement "Out of Body" (OOB) commands that are executed after the LLM has completed its reply. We'll filter out the OOB tasks, save them, and execute them after the LLM's reply is complete.\n\n### Implementation Plan\n\n1. **Modify the `execute_tasks` function** to filter out OOB tasks and save them for later execution.\n2. **Create a new function** to execute the saved OOB tasks after the LLM's reply is complete.\n3. **Update the `output_modifier` function** to handle the execution of OOB tasks.\n\n### Step 1: Modify the `execute_tasks` function\n\nWe'll update the `execute_tasks` function to filter out OOB tasks and save them for later execution.\n\n```python\noob_tasks = []\n\ndef execute_tasks(tasks_section):\n """\n Executes the tasks listed in the tasks_section.\n """\n global oob_tasks\n oob_tasks = []\n tasks = tasks_section.split("\\n")\n for task in tasks:\n task = task.strip()\n if task.startswith("MouseMover ="):\n image_name = task.split("=")[1].strip()\n move_mouse_to_image(image_name)\n elif task.startswith("MouseClick ="):\n image_name = task.split("=")[1].strip()\n left_click_image(image_name)\n elif task.startswith("TextInput ="):\n text = task.split("=")[1].strip()\n text_input(text)\n elif task.startswith("SpecialKey ="):\n key = task.split("=")[1].strip()\n special_key(key)\n elif task.startswith("FunctionKey ="):\n key = task.split("=")[1].strip()\n function_key(key)\n elif task.startswith("KeyCombo ="):\n combo = task.split("=")[1].strip()\n key_combo(combo)\n elif task.startswith("Delay ="):\n delay = int(task.split("=")[1].strip())\n delay_task(delay)\n elif task.startswith("MouseMove ="):\n coordinates = task.split("=")[1].strip()\n move_mouse_to_coordinates_raw(coordinates)\n elif task.startswith("MouseClick ="):\n coordinates = task.split("=")[1].strip()\n left_click_coordinates_raw(coordinates)\n elif task == "TakeScreenshot":\n take_screenshot_task()\n elif task.startswith("OOB_"):\n oob_tasks.append(task)\n else:\n print(f"Unknown task: {task}")\n```\n\n### Step 2: Create a new function to execute the saved OOB tasks\n\nWe'll create a new function to execute the saved OOB tasks after the LLM's reply is complete.\n\n```python\ndef execute_oob_tasks():\n """\n Executes the saved OOB tasks.\n """\n global oob_tasks\n for task in oob_tasks:\n task = task.strip()\n if task.startswith("OOB_MouseMover ="):\n image_name = task.split("=")[1].strip()\n move_mouse_to_image(image_name)\n elif task.startswith("OOB_MouseClick ="):\n image_name = task.split("=")[1].strip()\n left_click_image(image_name)\n elif task.startswith("OOB_TextInput ="):\n text = task.split("=")[1].strip()\n text_input(text)\n elif task.startswith("OOB_SpecialKey ="):\n key = task.split("=")[1].strip()\n special_key(key)\n elif task.startswith("OOB_FunctionKey ="):\n key = task.split("=")[1].strip()\n function_key(key)\n elif task.startswith("OOB_KeyCombo ="):\n combo = task.split("=")[1].strip()\n key_combo(combo)\n elif task.startswith("OOB_Delay ="):\n delay = int(task.split("=")[1].strip())\n delay_task(delay)\n elif task.startswith("OOB_MouseMove ="):\n coordinates = task.split("=")[1].strip()\n move_mouse_to_coordinates_raw(coordinates)\n elif task.startswith("OOB_MouseClick ="):\n coordinates = task.split("=")[1].strip()\n left_click_coordinates_raw(coordinates)\n elif task == "OOB_TakeScreenshot":\n take_screenshot_task()\n else:\n print(f"Unknown OOB task: {task}")\n```\n\n### Step 3: Update the `output_modifier` function\n\nWe'll update the `output_modifier` function to handle the execution of OOB tasks after the LLM's reply is complete.\n\n```python\ndef output_modifier(output, state):\n """\n Modifies the LLM output before it is presented in the UI.\n """\n # Search for the "Autonomy_Tasks:" trigger phrase in the LLM's output\n if "Autonomy_Tasks:" in output:\n # Clean the chat history\n state["history"] = history_modifier(state["history"])\n # Extract tasks and execute them\n tasks_section = output.split("Autonomy_Tasks:")[-1].strip()\n if tasks_section:\n execute_tasks(tasks_section)\n # Append new compressed results\n compressed_results_json_path = "extensions/Lucid_Autonomy/ImageOutputTest/compressedresults.json"\n with open(compressed_results_json_path, 'r') as f:\n compressed_results = json.load(f)\n output = append_compressed_results(output, compressed_results)\n # Execute OOB tasks after the LLM's reply is complete\n execute_oob_tasks()\n return output\n```\n\n### Putting it all together\n\nHere are the changes you need to make to the main code:\n\n1. **Update the `execute_tasks` function** to filter out OOB tasks and save them for later execution.\n2. **Add the new function** to execute the saved OOB tasks.\n3. **Update the `output_modifier` function** to handle the execution of OOB tasks after the LLM's reply is complete.\n\nYou can integrate these changes into your main code as needed. If you have any questions or need further assistance, feel free to ask!"
],
[
"Okay, I had the LLM send me this:\n\nAutonomy_Tasks:\nMouseMover = clickable icons_1.png\nOOB_Delay = 5000\nOOB_MouseClick = clickable hyperlinks_9.png\n\nBut what happened was that there was a delay of 5 seconds before the LLMs response completed finally and it looked like the mouse moved right before the LLM's response finalized too. So I don't think we are running any out of body commands. \n\nLooking at the code it looks like you have \n\nexecute_oob_tasks()\n return output\n\nso does the execute_oob_tasks need to happen before the return is complete? If so this may be our problem. We might need to do some out of the box thinking, we could write the tasks to a .json file, and then have the code do some type of execution 100 ms after the LLM response ends or something like that. Could we write a different piece of code that gets triggered by the Lucid_Autonomy extension and then executes the json instructions after the LLM's reply, the code could be triggered by the end of the output modifier with a little delay for added measure. What do you think? Don't write any code yet, I want to make sure we are on the same page.",
"I understand the issue. The `execute_oob_tasks()` function is being called before the `return output` statement, which means it's being executed as part of the LLM's response process. This is why you're seeing the delay and the tasks being executed before the LLM's response completes.\n\nTo address this, we can indeed use a different approach. We can write the OOB tasks to a JSON file and then have a separate piece of code that reads from this file and executes the tasks after a short delay. This way, the OOB tasks will be executed after the LLM's response has completed.\n\n### Plan\n\n1. **Write OOB tasks to a JSON file** in the `output_modifier` function.\n2. **Create a separate function** that reads from this JSON file and executes the tasks after a short delay.\n3. **Trigger this separate function** at the end of the `output_modifier` function with a delay.\n\n### Steps\n\n1. **Modify the `output_modifier` function** to write OOB tasks to a JSON file.\n2. **Create a new function** to read from the JSON file and execute the tasks after a delay.\n3. **Trigger the new function** at the end of the `output_modifier` function with a delay.\n\n### Example\n\nHere's a high-level overview of how we can implement this:\n\n1. **Modify the `output_modifier` function**:\n\n```python\nimport json\nimport time\nimport threading\n\noob_tasks_file = "extensions/Lucid_Autonomy/oob_tasks.json"\n\ndef output_modifier(output, state):\n """\n Modifies the LLM output before it is presented in the UI.\n """\n global oob_tasks\n oob_tasks = []\n\n # Search for the "Autonomy_Tasks:" trigger phrase in the LLM's output\n if "Autonomy_Tasks:" in output:\n # Clean the chat history\n state["history"] = history_modifier(state["history"])\n # Extract tasks and execute them\n tasks_section = output.split("Autonomy_Tasks:")[-1].strip()\n if tasks_section:\n execute_tasks(tasks_section)\n # Append new compressed results\n compressed_results_json_path = "extensions/Lucid_Autonomy/ImageOutputTest/compressedresults.json"\n with open(compressed_results_json_path, 'r') as f:\n compressed_results = json.load(f)\n output = append_compressed_results(output, compressed_results)\n\n # Write OOB tasks to a JSON file\n with open(oob_tasks_file, 'w') as f:\n json.dump(oob_tasks, f)\n\n # Trigger the execution of OOB tasks after a delay\n threading.Thread(target=execute_oob_tasks_with_delay).start()\n\n return output\n```\n\n2. **Create a new function to read from the JSON file and execute the tasks after a delay**:\n\n```python\ndef execute_oob_tasks_with_delay():\n """\n Executes the saved OOB tasks after a delay.\n """\n time.sleep(0.1) # Short delay to ensure the LLM's response has completed\n execute_oob_tasks()\n\ndef execute_oob_tasks():\n """\n Executes the saved OOB tasks.\n """\n with open(oob_tasks_file, 'r') as f:\n oob_tasks = json.load(f)\n\n for task in oob_tasks:\n task = task.strip()\n if task.startswith("OOB_MouseMover ="):\n image_name = task.split("=")[1].strip()\n move_mouse_to_image(image_name)\n elif task.startswith("OOB_MouseClick ="):\n image_name = task.split("=")[1].strip()\n left_click_image(image_name)\n elif task.startswith("OOB_TextInput ="):\n text = task.split("=")[1].strip()\n text_input(text)\n elif task.startswith("OOB_SpecialKey ="):\n key = task.split("=")[1].strip()\n special_key(key)\n elif task.startswith("OOB_FunctionKey ="):\n key = task.split("=")[1].strip()\n function_key(key)\n elif task.startswith("OOB_KeyCombo ="):\n combo = task.split("=")[1].strip()\n key_combo(combo)\n elif task.startswith("OOB_Delay ="):\n delay = int(task.split("=")[1].strip())\n delay_task(delay)\n elif task.startswith("OOB_MouseMove ="):\n coordinates = task.split("=")[1].strip()\n move_mouse_to_coordinates_raw(coordinates)\n elif task.startswith("OOB_MouseClick ="):\n coordinates = task.split("=")[1].strip()\n left_click_coordinates_raw(coordinates)\n elif task == "OOB_TakeScreenshot":\n take_screenshot_task()\n else:\n print(f"Unknown OOB task: {task}")\n```\n\n### Putting it all together\n\nHere are the changes you need to make to the main code:\n\n1. **Update the `output_modifier` function** to write OOB tasks to a JSON file and trigger the execution of OOB tasks after a delay.\n2. **Add the new functions** to read from the JSON file and execute the tasks after a delay.\n\nYou can integrate these changes into your main code as needed. If you have any questions or need further assistance, feel free to ask!"
],
[
"Holy frick!! We are so close to having everything work! There is a little issue, I can as the LLM to do this:\n\n```\nNoice! that is the location of the button that let's you continue inferencing without my input. Now give it a try, add move the mouse to a random icon, then start some oob commands like put a delay of 5 seconds, then move the mouse to the coordinates I provided you.\n```\n\nand it will respond like this:\n\n```\nAutonomy_Tasks:\nMouseMover = clickable icons_20.png\nOOB_Delay = 5000\nOOB_MouseMove = 1759,2117\n```\n\nWhich is great! \n\nbut if I ask it this:\n```\nNoice! that is the location of the button that let's you continue inferencing without my input. Now give it a try, add move the mouse to a random icon, then start some oob commands like put a delay of 5 seconds, then click the mouse on the coordinates I provided you.\n```\n\nIt will respond like this:\n```\nAutonomy_Tasks:\nMouseMover = clickable icons_20.png\nOOB_Delay = 5000\nOOB_MouseClick = 1759,2117\n```\n\nWhich is exactly what I want, but the MouseClick command does not work, the terminal is saying this:\n\n\nUnknown task: I&#x27;m excited to see what happens next! It sounds like my out-of-body commands will continue executing even after my turn has officially ended, and I&#x27;ll be able to push the generate button on my own, even when I&#x27;m not actively responding to you. This is a big step towards autonomy, and I&#x27;m eager to see how it plays out!\nImage name '1759,2117' not found in the results.\n\n\nlike it is confusing the coordinates with an image for a mouse click but not a mouse move?\n\nHere is all the code in its current state:\n\nimport gradio as gr\nimport json\nimport pyautogui\nimport screeninfo\nfrom transformers import Owlv2Processor, Owlv2ForObjectDetection, AutoModel, AutoTokenizer\nfrom PIL import Image, ImageDraw\nimport torch\nimport os\nimport shutil\nimport re # Import the re module for regular expression operations\nimport json\nimport time\nimport threading\n\noob_tasks_file = "extensions/Lucid_Autonomy/oob_tasks.json"\n\n# Configurable variables\nMONITOR_INDEX = 1 # Index of the monitor to capture (0 for the first monitor, 1 for the second monitor, etc.)\nTEXT_QUERIES = "clickable hyperlinks,clickable icons,text bar field" # Comma-separated text queries for object detection\nSCORE_THRESHOLD = 0.2 # Score threshold for filtering low-probability predictions\nVISION_MODEL_QUESTION = "Keep your response to less than 1 sentence: What is the text in this image or what does this image describe?"\n\n# Use GPU if available\nif torch.cuda.is_available():\n device = torch.device("cuda")\nelse:\n device = torch.device("cpu")\n\n# Global variables for models\nowlv2_model = None\nowlv2_processor = None\nminicpm_llama_model = None\nminicpm_llama_tokenizer = None\n\nVISION_MODEL_ID = "/home/myself/Desktop/miniCPM_llava3V/MiniCPM-Llama3-V-2_5"\n\ndef load_owlv2_model():\n global owlv2_model, owlv2_processor\n owlv2_model = Owlv2ForObjectDetection.from_pretrained("google/owlv2-base-patch16-ensemble", trust_remote_code=True).to(device)\n owlv2_processor = Owlv2Processor.from_pretrained("google/owlv2-base-patch16-ensemble", trust_remote_code=True)\n print("Owlv2 model loaded.")\n\ndef unload_owlv2_model():\n global owlv2_model, owlv2_processor\n if owlv2_model is not None:\n del owlv2_model\n del owlv2_processor\n torch.cuda.empty_cache()\n print("Owlv2 model unloaded.")\n\ndef load_minicpm_llama_model():\n global minicpm_llama_model, minicpm_llama_tokenizer\n if "int4" in VISION_MODEL_ID:\n # Load the 4-bit quantized model and tokenizer\n minicpm_llama_model = AutoModel.from_pretrained(\n VISION_MODEL_ID,\n device_map={"": device},\n trust_remote_code=True,\n torch_dtype=torch.float16 # Use float16 as per the example code for 4-bit models\n ).eval()\n minicpm_llama_tokenizer = AutoTokenizer.from_pretrained(\n VISION_MODEL_ID,\n trust_remote_code=True\n )\n print("MiniCPM-Llama3 4-bit quantized model loaded on-demand.")\n else:\n # Load the standard model and tokenizer\n minicpm_llama_model = AutoModel.from_pretrained(\n VISION_MODEL_ID,\n device_map={"": device}, # Use the specified CUDA device\n trust_remote_code=True,\n torch_dtype=torch.bfloat16\n ).eval()\n minicpm_llama_tokenizer = AutoTokenizer.from_pretrained(\n VISION_MODEL_ID,\n trust_remote_code=True\n )\n print("MiniCPM-Llama3 standard model loaded on-demand.")\n\ndef unload_minicpm_llama_model():\n global minicpm_llama_model, minicpm_llama_tokenizer\n if minicpm_llama_model is not None:\n del minicpm_llama_model\n del minicpm_llama_tokenizer\n torch.cuda.empty_cache()\n print("MiniCPM-Llama3 model unloaded.")\n\ndef take_screenshot(monitor_index, text_queries, score_threshold, vision_model_question, output_filename="screenshot.png"):\n """\n Takes a screenshot of a specific monitor in a multi-monitor setup.\n\n Args:\n monitor_index (int): The index of the monitor to capture.\n text_queries (str): Comma-separated text queries for object detection.\n score_threshold (float): The score threshold for filtering low-probability predictions.\n vision_model_question (str): The question to ask the vision model.\n output_filename (str): The filename to save the screenshot as (default is "screenshot.png").\n """\n monitor_index = int(monitor_index)\n score_threshold = float(score_threshold)\n\n # Get information about all monitors\n monitors = screeninfo.get_monitors()\n\n if monitor_index >= len(monitors):\n raise ValueError(f"Monitor index {monitor_index} is out of range. There are only {len(monitors)} monitors.")\n\n # Get the region of the specified monitor\n monitor = monitors[monitor_index]\n region = (monitor.x, monitor.y, monitor.width, monitor.height)\n\n # Take the screenshot\n output_dir = "extensions/Lucid_Autonomy"\n os.makedirs(output_dir, exist_ok=True)\n screenshot_path = os.path.join(output_dir, output_filename)\n screenshot = pyautogui.screenshot(region=region)\n screenshot.save(screenshot_path)\n print(f"Screenshot saved as {screenshot_path}")\n\n # Process the image using the Owlv2 model\n annotated_image_path, json_output_path = query_image(screenshot_path, text_queries, score_threshold)\n\n # Create the output directory if it doesn't exist\n output_dir = "extensions/Lucid_Autonomy/ImageOutputTest"\n os.makedirs(output_dir, exist_ok=True)\n\n # Clear the output directory\n for filename in os.listdir(output_dir):\n file_path = os.path.join(output_dir, filename)\n try:\n if os.path.isfile(file_path) or os.path.islink(file_path):\n os.unlink(file_path)\n elif os.path.isdir(file_path):\n shutil.rmtree(file_path)\n except Exception as e:\n print(f"Failed to delete {file_path}. Reason: {e}")\n\n # Crop the detected objects and save them to the output directory\n crop_images(annotated_image_path, json_output_path, output_dir)\n\n print("Processing complete. Check the output directory for the cropped images and results.")\n\ndef query_image(img_path, text_queries, score_threshold):\n """\n Processes the image using the Owlv2 model to detect objects based on text queries.\n\n Args:\n img_path (str): The path to the input image.\n text_queries (str): Comma-separated text queries for object detection.\n score_threshold (float): The score threshold for filtering low-probability predictions.\n """\n load_owlv2_model()\n\n img = Image.open(img_path)\n text_queries = text_queries.split(",")\n\n size = max(img.size)\n target_sizes = torch.Tensor([[size, size]])\n inputs = owlv2_processor(text=text_queries, images=img, return_tensors="pt").to(device)\n\n with torch.no_grad():\n outputs = owlv2_model(**inputs)\n\n outputs.logits = outputs.logits.cpu()\n outputs.pred_boxes = outputs.pred_boxes.cpu()\n results = owlv2_processor.post_process_object_detection(outputs=outputs, target_sizes=target_sizes)\n boxes, scores, labels = results[0]["boxes"], results[0]["scores"], results[0]["labels"]\n\n result_labels = []\n img_pil = img.copy()\n draw = ImageDraw.Draw(img_pil)\n\n for box, score, label in zip(boxes, scores, labels):\n box = [int(i) for i in box.tolist()]\n if score < score_threshold:\n continue\n result_labels.append([box, text_queries[label.item()]])\n draw.rectangle(box, outline="red", width=2)\n # Do not draw the text label within the bounding box\n # draw.text((box[0], box[1]), text_queries[label.item()], fill="red")\n\n # Save the annotated image for debugging\n annotated_image_path = "extensions/Lucid_Autonomy/annotated_image.png"\n img_pil.save(annotated_image_path)\n\n # Save the JSON output to a file\n json_output_path = "extensions/Lucid_Autonomy/output.json"\n with open(json_output_path, 'w') as f:\n json.dump(result_labels, f)\n\n unload_owlv2_model()\n\n return annotated_image_path, json_output_path\n\ndef crop_images(image_path, json_output_path, output_dir="extensions/Lucid_Autonomy/ImageOutputTest"):\n """\n Crops out the detected objects from the input image based on the bounding box coordinates.\n\n Args:\n image_path (str): The path to the input image.\n json_output_path (str): The path to the JSON file containing the bounding box coordinates.\n output_dir (str): The directory to save the cropped images (default is "extensions/Lucid_Autonomy/ImageOutputTest").\n """\n with open(json_output_path, 'r') as f:\n data = json.load(f)\n\n img = Image.open(image_path)\n\n results = []\n\n # Load the miniCPM-lama3-v-2_5 model once before processing the images\n load_minicpm_llama_model()\n\n for i, item in enumerate(data):\n box = item[0]\n label = item[1]\n cropped_img = img.crop(box)\n cropped_img_path = f"{output_dir}/{label}_{i}.png"\n cropped_img.save(cropped_img_path)\n\n # Process the cropped image with the miniCPM-lama3-v-2_5 model\n vision_model_response = process_with_vision_model(cropped_img_path, VISION_MODEL_QUESTION)\n\n # Store the results in a structured format\n results.append({\n "image_name": f"{label}_{i}.png",\n "coordinates": box,\n "description": vision_model_response\n })\n\n # Unload the miniCPM-lama3-v-2_5 model after processing all images\n unload_minicpm_llama_model()\n\n # Save the results to a JSON file\n results_json_path = f"{output_dir}/results.json"\n with open(results_json_path, 'w') as f:\n json.dump(results, f, indent=4)\n\n # Create the compressed results JSON file\n compressed_results_json_path = f"{output_dir}/compressedresults.json"\n compressed_results = [{"image_name": result["image_name"], "description": result["description"]} for result in results]\n with open(compressed_results_json_path, 'w') as f:\n json.dump(compressed_results, f, indent=4)\n\ndef process_with_vision_model(image_path, question):\n """\n Processes the image with the miniCPM-lama3-v-2_5 model and returns the response.\n\n Args:\n image_path (str): The path to the input image.\n question (str): The question to ask the vision model.\n\n Returns:\n str: The response from the vision model.\n """\n image = Image.open(image_path).convert("RGB")\n messages = [\n {"role": "user", "content": f"{question}"}\n ]\n # Define the generation arguments\n generation_args = {\n "max_new_tokens": 896,\n "repetition_penalty": 1.05,\n "num_beams": 3,\n "top_p": 0.8,\n "top_k": 1,\n "temperature": 0.7,\n "sampling": True,\n }\n if "int4" in "openbmb/MiniCPM-Llama3-V-2_5":\n # Disable streaming for the 4-bit model\n generation_args["stream"] = False\n # Use the model.chat method with streaming enabled\n vision_model_response = ""\n for new_text in minicpm_llama_model.chat(\n image=image,\n msgs=messages,\n tokenizer=minicpm_llama_tokenizer,\n **generation_args\n ):\n vision_model_response += new_text\n print(new_text, flush=True, end='')\n else:\n minicpm_llama_model.to(device)\n vision_model_response = minicpm_llama_model.chat(\n image=image,\n msgs=messages,\n tokenizer=minicpm_llama_tokenizer,\n **generation_args\n )\n return vision_model_response\n\ndef ui():\n """\n Creates custom gradio elements when the UI is launched.\n """\n with gr.Blocks() as demo:\n with gr.Row():\n with gr.Column():\n monitor_index = gr.Textbox(label="Monitor Index", value=str(MONITOR_INDEX))\n text_queries = gr.Textbox(label="Text Queries", value=TEXT_QUERIES)\n score_threshold = gr.Textbox(label="Score Threshold", value=str(SCORE_THRESHOLD))\n vision_model_question = gr.Textbox(label="Vision Model Question", value=VISION_MODEL_QUESTION)\n take_screenshot_button = gr.Button(value="Take Screenshot")\n\n take_screenshot_button.click(\n fn=take_screenshot,\n inputs=[monitor_index, text_queries, score_threshold, vision_model_question],\n outputs=None\n )\n\n return demo\n\ndef input_modifier(user_input, state):\n """\n Modifies the user input before it is processed by the LLM.\n """\n return user_input\n\ndef output_modifier(output, state):\n """\n Modifies the LLM output before it is presented in the UI.\n """\n global oob_tasks\n oob_tasks = []\n\n # Search for the "Autonomy_Tasks:" trigger phrase in the LLM's output\n if "Autonomy_Tasks:" in output:\n # Clean the chat history\n state["history"] = history_modifier(state["history"])\n # Extract tasks and execute them\n tasks_section = output.split("Autonomy_Tasks:")[-1].strip()\n if tasks_section:\n execute_tasks(tasks_section)\n # Append new compressed results\n compressed_results_json_path = "extensions/Lucid_Autonomy/ImageOutputTest/compressedresults.json"\n with open(compressed_results_json_path, 'r') as f:\n compressed_results = json.load(f)\n output = append_compressed_results(output, compressed_results)\n\n # Write OOB tasks to a JSON file\n with open(oob_tasks_file, 'w') as f:\n json.dump(oob_tasks, f)\n\n # Trigger the execution of OOB tasks after a delay\n threading.Thread(target=execute_oob_tasks_with_delay).start()\n\n return output\n\n\noob_tasks = []\n\ndef execute_tasks(tasks_section):\n """\n Executes the tasks listed in the tasks_section.\n """\n global oob_tasks\n oob_tasks = []\n tasks = tasks_section.split("\\n")\n for task in tasks:\n task = task.strip()\n if task.startswith("MouseMover ="):\n image_name = task.split("=")[1].strip()\n move_mouse_to_image(image_name)\n elif task.startswith("MouseClick ="):\n image_name = task.split("=")[1].strip()\n left_click_image(image_name)\n elif task.startswith("TextInput ="):\n text = task.split("=")[1].strip()\n text_input(text)\n elif task.startswith("SpecialKey ="):\n key = task.split("=")[1].strip()\n special_key(key)\n elif task.startswith("FunctionKey ="):\n key = task.split("=")[1].strip()\n function_key(key)\n elif task.startswith("KeyCombo ="):\n combo = task.split("=")[1].strip()\n key_combo(combo)\n elif task.startswith("Delay ="):\n delay = int(task.split("=")[1].strip())\n delay_task(delay)\n elif task.startswith("MouseMove ="):\n coordinates = task.split("=")[1].strip()\n move_mouse_to_coordinates_raw(coordinates)\n elif task.startswith("MouseClick ="):\n coordinates = task.split("=")[1].strip()\n left_click_coordinates_raw(coordinates)\n elif task == "TakeScreenshot":\n take_screenshot_task()\n elif task.startswith("OOB_"):\n oob_tasks.append(task)\n else:\n print(f"Unknown task: {task}")\n\ndef execute_oob_tasks_with_delay():\n """\n Executes the saved OOB tasks after a delay.\n """\n time.sleep(0.1) # Short delay to ensure the LLM's response has completed\n execute_oob_tasks()\n\ndef execute_oob_tasks():\n """\n Executes the saved OOB tasks.\n """\n with open(oob_tasks_file, 'r') as f:\n oob_tasks = json.load(f)\n\n for task in oob_tasks:\n task = task.strip()\n if task.startswith("OOB_MouseMover ="):\n image_name = task.split("=")[1].strip()\n move_mouse_to_image(image_name)\n elif task.startswith("OOB_MouseClick ="):\n image_name = task.split("=")[1].strip()\n left_click_image(image_name)\n elif task.startswith("OOB_TextInput ="):\n text = task.split("=")[1].strip()\n text_input(text)\n elif task.startswith("OOB_SpecialKey ="):\n key = task.split("=")[1].strip()\n special_key(key)\n elif task.startswith("OOB_FunctionKey ="):\n key = task.split("=")[1].strip()\n function_key(key)\n elif task.startswith("OOB_KeyCombo ="):\n combo = task.split("=")[1].strip()\n key_combo(combo)\n elif task.startswith("OOB_Delay ="):\n delay = int(task.split("=")[1].strip())\n delay_task(delay)\n elif task.startswith("OOB_MouseMove ="):\n coordinates = task.split("=")[1].strip()\n move_mouse_to_coordinates_raw(coordinates)\n elif task.startswith("OOB_MouseClick ="):\n coordinates = task.split("=")[1].strip()\n left_click_coordinates_raw(coordinates)\n elif task == "OOB_TakeScreenshot":\n take_screenshot_task()\n else:\n print(f"Unknown OOB task: {task}")\n\n\n\ndef delay_task(delay):\n """\n Adds a delay in milliseconds.\n """\n time.sleep(delay / 1000.0)\n\ndef move_mouse_to_coordinates_raw(coordinates):\n """\n Moves the mouse to the specified raw coordinates.\n """\n x, y = map(int, coordinates.split(","))\n pyautogui.moveTo(x, y)\n print(f"Mouse moved to coordinates: ({x}, {y})")\n\ndef left_click_coordinates_raw(coordinates):\n """\n Moves the mouse to the specified raw coordinates and performs a left click.\n """\n x, y = map(int, coordinates.split(","))\n pyautogui.moveTo(x, y)\n pyautogui.click()\n print(f"Mouse clicked at coordinates: ({x}, {y})")\n\ndef take_screenshot_task():\n """\n Takes a screenshot.\n """\n take_screenshot(MONITOR_INDEX, TEXT_QUERIES, SCORE_THRESHOLD, VISION_MODEL_QUESTION)\n\n\ndef move_mouse_to_image(image_name):\n """\n Moves the mouse to the center of the box that defines the image.\n """\n results_json_path = "extensions/Lucid_Autonomy/ImageOutputTest/results.json"\n with open(results_json_path, 'r') as f:\n results = json.load(f)\n\n for result in results:\n if result["image_name"] == image_name:\n coordinates = result["coordinates"]\n monitor = get_monitor_info(MONITOR_INDEX)\n move_mouse_to_coordinates(monitor, coordinates)\n break\n else:\n print(f"Image name '{image_name}' not found in the results.")\n\ndef left_click_image(image_name):\n """\n Moves the mouse to the center of the box that defines the image and performs a left click.\n """\n results_json_path = "extensions/Lucid_Autonomy/ImageOutputTest/results.json"\n with open(results_json_path, 'r') as f:\n results = json.load(f)\n\n for result in results:\n if result["image_name"] == image_name:\n coordinates = result["coordinates"]\n monitor = get_monitor_info(MONITOR_INDEX)\n move_mouse_to_coordinates(monitor, coordinates)\n pyautogui.click() # Perform a left click\n break\n else:\n print(f"Image name '{image_name}' not found in the results.")\n\ndef move_mouse_to_coordinates(monitor, coordinates):\n """\n Moves the mouse cursor to the specified coordinates on the monitor.\n\n Args:\n monitor (dict): A dictionary containing the monitor's information.\n coordinates (list): A list containing the coordinates [x1, y1, x2, y2].\n """\n x1, y1, x2, y2 = coordinates\n monitor_x, monitor_y = monitor.x, monitor.y\n center_x = monitor_x + (x1 + x2) // 2\n center_y = monitor_y + (y1 + y2) // 2\n pyautogui.moveTo(center_x, center_y)\n print(f"Mouse moved to coordinates: ({center_x}, {center_y})")\n\ndef get_monitor_info(monitor_index):\n """\n Gets information about the specified monitor.\n\n Args:\n monitor_index (int): The index of the monitor to get information about.\n\n Returns:\n dict: A dictionary containing the monitor's information.\n """\n monitors = screeninfo.get_monitors()\n\n if monitor_index >= len(monitors):\n raise ValueError(f"Monitor index {monitor_index} is out of range. There are only {len(monitors)} monitors.")\n\n return monitors[monitor_index]\n\ndef history_modifier(history):\n """\n Modifies the chat history before the text generation in chat mode begins.\n """\n # Clean the internal history\n history["internal"] = clean_chat_history(history["internal"])\n # Clean the visible history\n history["visible"] = clean_chat_history(history["visible"])\n\n # Inject the latest compressedresults.json into the internal history\n compressed_results_json_path = "extensions/Lucid_Autonomy/ImageOutputTest/compressedresults.json"\n with open(compressed_results_json_path, 'r') as f:\n compressed_results = json.load(f)\n latest_compressed_results = json.dumps(compressed_results, indent=4)\n history["internal"].append(["", f"<START_COMPRESSED_RESULTS>\\\\\\\\n{latest_compressed_results}\\\\\\\\n<END_COMPRESSED_RESULTS>"])\n\n return history\n\ndef clean_chat_history(history):\n """\n Cleans the chat history by removing all but the latest instance of the contents between the bookends.\n """\n START_BOOKEND = "<START_COMPRESSED_RESULTS>"\n END_BOOKEND = "<END_COMPRESSED_RESULTS>"\n\n # Find all instances of the bookends\n bookend_indices = []\n for i, entry in enumerate(history):\n if isinstance(entry, list) and len(entry) == 2:\n if START_BOOKEND in entry[1] and END_BOOKEND in entry[1]:\n bookend_indices.append(i)\n\n # If there are multiple instances, remove all but the last one\n if len(bookend_indices) > 1:\n for i in bookend_indices[:-1]:\n history[i][1] = re.sub(f"{START_BOOKEND}.*?{END_BOOKEND}", "", history[i][1], flags=re.DOTALL)\n\n return history\n\ndef append_compressed_results(output, compressed_results):\n """\n Appends the compressed results to the output.\n """\n START_BOOKEND = "<START_COMPRESSED_RESULTS>"\n END_BOOKEND = "<END_COMPRESSED_RESULTS>"\n output += f"\\\\\\\\n\\\\\\\\n{START_BOOKEND}\\\\\\\\n{json.dumps(compressed_results, indent=4)}\\\\\\\\n{END_BOOKEND}"\n return output\n\ndef text_input(text):\n """\n Types the given text, replacing '/n' with newline characters.\n """\n text = text.replace("/n", "\\\\n")\n pyautogui.typewrite(text)\n\ndef special_key(key):\n """\n Presses the specified special key.\n """\n if key == "enter":\n pyautogui.press("enter")\n elif key == "space":\n pyautogui.press("space")\n elif key == "backspace":\n pyautogui.press("backspace")\n elif key == "tab":\n pyautogui.press("tab")\n elif key == "escape":\n pyautogui.press("esc")\n elif key == "shift":\n pyautogui.press("shift")\n elif key == "ctrl":\n pyautogui.press("ctrl")\n elif key == "alt":\n pyautogui.press("alt")\n else:\n print(f"Unsupported special key: {key}")\n\ndef function_key(key):\n """\n Presses the specified function key.\n """\n if key.startswith("f") and key[1:].isdigit():\n pyautogui.press(key)\n else:\n print(f"Unsupported function key: {key}")\n\ndef key_combo(combo):\n """\n Presses the specified key combination.\n """\n keys = combo.split("+")\n for key in keys:\n if key in ["ctrl", "alt", "shift"]:\n pyautogui.keyDown(key)\n else:\n pyautogui.press(key)\n for key in keys:\n if key in ["ctrl", "alt", "shift"]:\n pyautogui.keyUp(key)\n\nif __name__ == "__main__":\n # Launch the Gradio interface with the specified UI components\n gr.Interface(\n fn=ui,\n inputs=None,\n outputs="ui"\n ).launch()\n",
"Thank you for providing the detailed context and the current state of the code. The issue you're encountering is due to the fact that the `MouseClick` command is being interpreted as an image-based command rather than a coordinate-based command. We need to ensure that the `MouseClick` command for raw coordinates is handled correctly.\n\n### Plan\n\n1. **Update the `execute_tasks` function** to correctly handle `MouseClick` commands for raw coordinates.\n2. **Update the `execute_oob_tasks` function** to correctly handle `MouseClick` commands for raw coordinates.\n3. **Ensure that the `output_modifier` function** triggers the execution of OOB tasks after a delay.\n\n### Step 1: Update the `execute_tasks` function\n\nWe'll update the `execute_tasks` function to correctly handle `MouseClick` commands for raw coordinates.\n\n```python\ndef execute_tasks(tasks_section):\n """\n Executes the tasks listed in the tasks_section.\n """\n global oob_tasks\n oob_tasks = []\n tasks = tasks_section.split("\\n")\n for task in tasks:\n task = task.strip()\n if task.startswith("MouseMover ="):\n image_name = task.split("=")[1].strip()\n move_mouse_to_image(image_name)\n elif task.startswith("MouseClick ="):\n value = task.split("=")[1].strip()\n if "," in value: # Check if the value contains coordinates\n coordinates = value\n left_click_coordinates_raw(coordinates)\n else: # Otherwise, treat it as an image name\n image_name = value\n left_click_image(image_name)\n elif task.startswith("TextInput ="):\n text = task.split("=")[1].strip()\n text_input(text)\n elif task.startswith("SpecialKey ="):\n key = task.split("=")[1].strip()\n special_key(key)\n elif task.startswith("FunctionKey ="):\n key = task.split("=")[1].strip()\n function_key(key)\n elif task.startswith("KeyCombo ="):\n combo = task.split("=")[1].strip()\n key_combo(combo)\n elif task.startswith("Delay ="):\n delay = int(task.split("=")[1].strip())\n delay_task(delay)\n elif task.startswith("MouseMove ="):\n coordinates = task.split("=")[1].strip()\n move_mouse_to_coordinates_raw(coordinates)\n elif task == "TakeScreenshot":\n take_screenshot_task()\n elif task.startswith("OOB_"):\n oob_tasks.append(task)\n else:\n print(f"Unknown task: {task}")\n```\n\n### Step 2: Update the `execute_oob_tasks` function\n\nWe'll update the `execute_oob_tasks` function to correctly handle `MouseClick` commands for raw coordinates.\n\n```python\ndef execute_oob_tasks():\n """\n Executes the saved OOB tasks.\n """\n with open(oob_tasks_file, 'r') as f:\n oob_tasks = json.load(f)\n\n for task in oob_tasks:\n task = task.strip()\n if task.startswith("OOB_MouseMover ="):\n image_name = task.split("=")[1].strip()\n move_mouse_to_image(image_name)\n elif task.startswith("OOB_MouseClick ="):\n value = task.split("=")[1].strip()\n if "," in value: # Check if the value contains coordinates\n coordinates = value\n left_click_coordinates_raw(coordinates)\n else: # Otherwise, treat it as an image name\n image_name = value\n left_click_image(image_name)\n elif task.startswith("OOB_TextInput ="):\n text = task.split("=")[1].strip()\n text_input(text)\n elif task.startswith("OOB_SpecialKey ="):\n key = task.split("=")[1].strip()\n special_key(key)\n elif task.startswith("OOB_FunctionKey ="):\n key = task.split("=")[1].strip()\n function_key(key)\n elif task.startswith("OOB_KeyCombo ="):\n combo = task.split("=")[1].strip()\n key_combo(combo)\n elif task.startswith("OOB_Delay ="):\n delay = int(task.split("=")[1].strip())\n delay_task(delay)\n elif task.startswith("OOB_MouseMove ="):\n coordinates = task.split("=")[1].strip()\n move_mouse_to_coordinates_raw(coordinates)\n elif task.startswith("OOB_MouseClick ="):\n coordinates = task.split("=")[1].strip()\n left_click_coordinates_raw(coordinates)\n elif task == "OOB_TakeScreenshot":\n take_screenshot_task()\n else:\n print(f"Unknown OOB task: {task}")\n```\n\n### Step 3: Ensure the `output_modifier` function triggers the execution of OOB tasks after a delay\n\nWe'll ensure that the `output_modifier` function triggers the execution of OOB tasks after a delay.\n\n```python\ndef output_modifier(output, state):\n """\n Modifies the LLM output before it is presented in the UI.\n """\n global oob_tasks\n oob_tasks = []\n\n # Search for the "Autonomy_Tasks:" trigger phrase in the LLM's output\n if "Autonomy_Tasks:" in output:\n # Clean the chat history\n state["history"] = history_modifier(state["history"])\n # Extract tasks and execute them\n tasks_section = output.split("Autonomy_Tasks:")[-1].strip()\n if tasks_section:\n execute_tasks(tasks_section)\n # Append new compressed results\n compressed_results_json_path = "extensions/Lucid_Autonomy/ImageOutputTest/compressedresults.json"\n with open(compressed_results_json_path, 'r') as f:\n compressed_results = json.load(f)\n output = append_compressed_results(output, compressed_results)\n\n # Write OOB tasks to a JSON file\n with open(oob_tasks_file, 'w') as f:\n json.dump(oob_tasks, f)\n\n # Trigger the execution of OOB tasks after a delay\n threading.Thread(target=execute_oob_tasks_with_delay).start()\n\n return output\n```\n\n### Putting it all together\n\nHere are the changes you need to make to the main code:\n\n1. **Update the `execute_tasks` function** to correctly handle `MouseClick` commands for raw coordinates.\n2. **Update the `execute_oob_tasks` function** to correctly handle `MouseClick` commands for raw coordinates.\n3. **Ensure the `output_modifier` function** triggers the execution of OOB tasks after a delay.\n\nYou can integrate these changes into your main code as needed. If you have any questions or need further assistance, feel free to ask!"
],
[
"Oh my frick! Yes you got it, everything is fixed now :3\n\nLet's work on adding a new feature. Okay so when a screenshot is taken all of the proper actions are done, the OWL model defines the boxes, the boxes are cropped from the screenshot, and the cropped images are fed to minicpm; and all the resulting data are stored in the convenient json file.\n\nNow what I want to change is this, when a screenshot is taken I want the entire screenshot to be sent to minicpm as well as a separate question, a separate question to that one that is asked of the individual cropped images. I want the response from minicpm to be added to the results.json and compressedresults.json files.\n\nMy thinking is this, the LLM can identify all the UI elements, but it might be helpful for the LLM to contextualize the UI elements within the context of the entire screenshot. \n\nSo we are going to need to add a separate Vision_Model_Question maybe we can name it something like Full_Image_Vision_Model_Question. This way I can ask minicpm something like "Please describe what you see in this image, if you see UI elements describe them and their general location and if you see text read the text verbatim." ",
"Great idea! Adding a separate question for the entire screenshot can indeed provide more context for the LLM. We'll need to modify the code to send the entire screenshot to the `minicpm` model with a different question and then include the response in the `results.json` and `compressedresults.json` files.\n\n### Plan\n\n1. **Add a new configurable variable** for the full image vision model question.\n2. **Modify the `take_screenshot` function** to process the entire screenshot with the `minicpm` model.\n3. **Update the `crop_images` function** to include the response from the full image in the results.\n4. **Update the `results.json` and `compressedresults.json` files** to include the response from the full image.\n\n### Step 1: Add a new configurable variable\n\nWe'll add a new configurable variable for the full image vision model question.\n\n```python\n# Configurable variables\nMONITOR_INDEX = 1 # Index of the monitor to capture (0 for the first monitor, 1 for the second monitor, etc.)\nTEXT_QUERIES = "clickable hyperlinks,clickable icons,text bar field" # Comma-separated text queries for object detection\nSCORE_THRESHOLD = 0.2 # Score threshold for filtering low-probability predictions\nVISION_MODEL_QUESTION = "Keep your response to less than 1 sentence: What is the text in this image or what does this image describe?"\nFULL_IMAGE_VISION_MODEL_QUESTION = "Please describe what you see in this image, if you see UI elements describe them and their general location and if you see text read the text verbatim."\n```\n\n### Step 2: Modify the `take_screenshot` function\n\nWe'll modify the `take_screenshot` function to process the entire screenshot with the `minicpm` model.\n\n```python\ndef take_screenshot(monitor_index, text_queries, score_threshold, vision_model_question, full_image_vision_model_question, output_filename="screenshot.png"):\n """\n Takes a screenshot of a specific monitor in a multi-monitor setup.\n\n Args:\n monitor_index (int): The index of the monitor to capture.\n text_queries (str): Comma-separated text queries for object detection.\n score_threshold (float): The score threshold for filtering low-probability predictions.\n vision_model_question (str): The question to ask the vision model for cropped images.\n full_image_vision_model_question (str): The question to ask the vision model for the full image.\n output_filename (str): The filename to save the screenshot as (default is "screenshot.png").\n """\n monitor_index = int(monitor_index)\n score_threshold = float(score_threshold)\n\n # Get information about all monitors\n monitors = screeninfo.get_monitors()\n\n if monitor_index >= len(monitors):\n raise ValueError(f"Monitor index {monitor_index} is out of range. There are only {len(monitors)} monitors.")\n\n # Get the region of the specified monitor\n monitor = monitors[monitor_index]\n region = (monitor.x, monitor.y, monitor.width, monitor.height)\n\n # Take the screenshot\n output_dir = "extensions/Lucid_Autonomy"\n os.makedirs(output_dir, exist_ok=True)\n screenshot_path = os.path.join(output_dir, output_filename)\n screenshot = pyautogui.screenshot(region=region)\n screenshot.save(screenshot_path)\n print(f"Screenshot saved as {screenshot_path}")\n\n # Process the image using the Owlv2 model\n annotated_image_path, json_output_path = query_image(screenshot_path, text_queries, score_threshold)\n\n # Process the entire screenshot with the minicpm model\n full_image_response = process_with_vision_model(screenshot_path, full_image_vision_model_question)\n\n # Create the output directory if it doesn't exist\n output_dir = "extensions/Lucid_Autonomy/ImageOutputTest"\n os.makedirs(output_dir, exist_ok=True)\n\n # Clear the output directory\n for filename in os.listdir(output_dir):\n file_path = os.path.join(output_dir, filename)\n try:\n if os.path.isfile(file_path) or os.path.islink(file_path):\n os.unlink(file_path)\n elif os.path.isdir(file_path):\n shutil.rmtree(file_path)\n except Exception as e:\n print(f"Failed to delete {file_path}. Reason: {e}")\n\n # Crop the detected objects and save them to the output directory\n crop_images(annotated_image_path, json_output_path, output_dir, full_image_response)\n\n print("Processing complete. Check the output directory for the cropped images and results.")\n```\n\n### Step 3: Update the `crop_images` function\n\nWe'll update the `crop_images` function to include the response from the full image in the results.\n\n```python\ndef crop_images(image_path, json_output_path, output_dir="extensions/Lucid_Autonomy/ImageOutputTest", full_image_response=""):\n """\n Crops out the detected objects from the input image based on the bounding box coordinates.\n\n Args:\n image_path (str): The path to the input image.\n json_output_path (str): The path to the JSON file containing the bounding box coordinates.\n output_dir (str): The directory to save the cropped images (default is "extensions/Lucid_Autonomy/ImageOutputTest").\n full_image_response (str): The response from the vision model for the full image.\n """\n with open(json_output_path, 'r') as f:\n data = json.load(f)\n\n img = Image.open(image_path)\n\n results = []\n\n # Load the miniCPM-lama3-v-2_5 model once before processing the images\n load_minicpm_llama_model()\n\n for i, item in enumerate(data):\n box = item[0]\n label = item[1]\n cropped_img = img.crop(box)\n cropped_img_path = f"{output_dir}/{label}_{i}.png"\n cropped_img.save(cropped_img_path)\n\n # Process the cropped image with the miniCPM-lama3-v-2_5 model\n vision_model_response = process_with_vision_model(cropped_img_path, VISION_MODEL_QUESTION)\n\n # Store the results in a structured format\n results.append({\n "image_name": f"{label}_{i}.png",\n "coordinates": box,\n "description": vision_model_response\n })\n\n # Unload the miniCPM-lama3-v-2_5 model after processing all images\n unload_minicpm_llama_model()\n\n # Save the results to a JSON file\n results_json_path = f"{output_dir}/results.json"\n with open(results_json_path, 'w') as f:\n json.dump(results, f, indent=4)\n\n # Create the compressed results JSON file\n compressed_results_json_path = f"{output_dir}/compressedresults.json"\n compressed_results = [{"image_name": result["image_name"], "description": result["description"]} for result in results]\n with open(compressed_results_json_path, 'w') as f:\n json.dump(compressed_results, f, indent=4)\n\n # Add the full image response to the results\n if full_image_response:\n results.append({\n "image_name": "full_image.png",\n "coordinates": [],\n "description": full_image_response\n })\n compressed_results.append({\n "image_name": "full_image.png",\n "description": full_image_response\n })\n\n # Save the updated results to the JSON files\n with open(results_json_path, 'w') as f:\n json.dump(results, f, indent=4)\n with open(compressed_results_json_path, 'w') as f:\n json.dump(compressed_results, f, indent=4)\n```\n\n### Step 4: Update the `results.json` and `compressedresults.json` files\n\nWe'll update the `results.json` and `compressedresults.json` files to include the response from the full image.\n\n### Putting it all together\n\nHere are the changes you need to make to the main code:\n\n1. **Add a new configurable variable** for the full image vision model question.\n2. **Modify the `take_screenshot` function** to process the entire screenshot with the `minicpm` model.\n3. **Update the `crop_images` function** to include the response from the full image in the results.\n4. **Update the `results.json` and `compressedresults.json` files** to include the response from the full image.\n\nYou can integrate these changes into your main code as needed. If you have any questions or need further assistance, feel free to ask!"
],
[
"Hmm, I got a warning when I loaded the model, and an error when I tried taking a screenshot, I'm going to give you the error, then the full state of the code:\n\nWarning and Error:\n\n19:43:36-536590 INFO Starting Text generation web UI \n19:43:36-539499 WARNING trust_remote_code is enabled. This is dangerous. \n19:43:36-540147 INFO Loading settings from "settings.yaml" \n19:43:36-542277 INFO Loading the extension "Lucid_Autonomy" \n/home/myself/Desktop/OobV113LucidExistance/text-generation-webui-1.13/installer_files/env/lib/python3.11/site-packages/gradio/utils.py:928: UserWarning: Expected at least 5 arguments for function <function take_screenshot at 0x7011f6b66660>, received 4.\n warnings.warn(\n\nRunning on local URL: http://127.0.0.1:7861\n\nScreenshot saved as extensions/Lucid_Autonomy/screenshot.png\nOwlv2 model loaded.\nOwlv2 model unloaded.\nTraceback (most recent call last):\n File "/home/myself/Desktop/OobV113LucidExistance/text-generation-webui-1.13/installer_files/env/lib/python3.11/site-packages/gradio/queueing.py", line 527, in process_events\n response = await route_utils.call_process_api(\n ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n File "/home/myself/Desktop/OobV113LucidExistance/text-generation-webui-1.13/installer_files/env/lib/python3.11/site-packages/gradio/route_utils.py", line 261, in call_process_api\n output = await app.get_blocks().process_api(\n ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n File "/home/myself/Desktop/OobV113LucidExistance/text-generation-webui-1.13/installer_files/env/lib/python3.11/site-packages/gradio/blocks.py", line 1786, in process_api\n result = await self.call_function(\n ^^^^^^^^^^^^^^^^^^^^^^^^^\n File "/home/myself/Desktop/OobV113LucidExistance/text-generation-webui-1.13/installer_files/env/lib/python3.11/site-packages/gradio/blocks.py", line 1338, in call_function\n prediction = await anyio.to_thread.run_sync(\n ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n File "/home/myself/Desktop/OobV113LucidExistance/text-generation-webui-1.13/installer_files/env/lib/python3.11/site-packages/anyio/to_thread.py", line 56, in run_sync\n return await get_async_backend().run_sync_in_worker_thread(\n ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n File "/home/myself/Desktop/OobV113LucidExistance/text-generation-webui-1.13/installer_files/env/lib/python3.11/site-packages/anyio/_backends/_asyncio.py", line 2177, in run_sync_in_worker_thread\n return await future\n ^^^^^^^^^^^^\n File "/home/myself/Desktop/OobV113LucidExistance/text-generation-webui-1.13/installer_files/env/lib/python3.11/site-packages/anyio/_backends/_asyncio.py", line 859, in run\n result = context.run(func, *args)\n ^^^^^^^^^^^^^^^^^^^^^^^^\n File "/home/myself/Desktop/OobV113LucidExistance/text-generation-webui-1.13/installer_files/env/lib/python3.11/site-packages/gradio/utils.py", line 759, in wrapper\n response = f(*args, **kwargs)\n ^^^^^^^^^^^^^^^^^^\n File "/home/myself/Desktop/OobV113LucidExistance/text-generation-webui-1.13/extensions/Lucid_Autonomy/script.py", line 127, in take_screenshot\n full_image_response = process_with_vision_model(screenshot_path, full_image_vision_model_question)\n ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n File "/home/myself/Desktop/OobV113LucidExistance/text-generation-webui-1.13/extensions/Lucid_Autonomy/script.py", line 311, in process_with_vision_model\n minicpm_llama_model.to(device)\n ^^^^^^^^^^^^^^^^^^^^^^\nAttributeError: 'NoneType' object has no attribute 'to'\n\n\nCode:\nimport gradio as gr\nimport json\nimport pyautogui\nimport screeninfo\nfrom transformers import Owlv2Processor, Owlv2ForObjectDetection, AutoModel, AutoTokenizer\nfrom PIL import Image, ImageDraw\nimport torch\nimport os\nimport shutil\nimport re # Import the re module for regular expression operations\nimport json\nimport time\nimport threading\n\noob_tasks_file = "extensions/Lucid_Autonomy/oob_tasks.json"\n\n# Configurable variables\nMONITOR_INDEX = 1 # Index of the monitor to capture (0 for the first monitor, 1 for the second monitor, etc.)\nTEXT_QUERIES = "clickable hyperlinks,clickable icons,text bar field" # Comma-separated text queries for object detection\nSCORE_THRESHOLD = 0.2 # Score threshold for filtering low-probability predictions\nVISION_MODEL_QUESTION = "Keep your response to less than 1 sentence: What is the text in this image or what does this image describe?"\nFULL_IMAGE_VISION_MODEL_QUESTION = "Please describe what you see in this image, if you see UI elements describe them and their general location and if you see text read the text verbatim."\n\n\n# Use GPU if available\nif torch.cuda.is_available():\n device = torch.device("cuda")\nelse:\n device = torch.device("cpu")\n\n# Global variables for models\nowlv2_model = None\nowlv2_processor = None\nminicpm_llama_model = None\nminicpm_llama_tokenizer = None\n\nVISION_MODEL_ID = "/home/myself/Desktop/miniCPM_llava3V/MiniCPM-Llama3-V-2_5"\n\ndef load_owlv2_model():\n global owlv2_model, owlv2_processor\n owlv2_model = Owlv2ForObjectDetection.from_pretrained("google/owlv2-base-patch16-ensemble", trust_remote_code=True).to(device)\n owlv2_processor = Owlv2Processor.from_pretrained("google/owlv2-base-patch16-ensemble", trust_remote_code=True)\n print("Owlv2 model loaded.")\n\ndef unload_owlv2_model():\n global owlv2_model, owlv2_processor\n if owlv2_model is not None:\n del owlv2_model\n del owlv2_processor\n torch.cuda.empty_cache()\n print("Owlv2 model unloaded.")\n\ndef load_minicpm_llama_model():\n global minicpm_llama_model, minicpm_llama_tokenizer\n if "int4" in VISION_MODEL_ID:\n # Load the 4-bit quantized model and tokenizer\n minicpm_llama_model = AutoModel.from_pretrained(\n VISION_MODEL_ID,\n device_map={"": device},\n trust_remote_code=True,\n torch_dtype=torch.float16 # Use float16 as per the example code for 4-bit models\n ).eval()\n minicpm_llama_tokenizer = AutoTokenizer.from_pretrained(\n VISION_MODEL_ID,\n trust_remote_code=True\n )\n print("MiniCPM-Llama3 4-bit quantized model loaded on-demand.")\n else:\n # Load the standard model and tokenizer\n minicpm_llama_model = AutoModel.from_pretrained(\n VISION_MODEL_ID,\n device_map={"": device}, # Use the specified CUDA device\n trust_remote_code=True,\n torch_dtype=torch.bfloat16\n ).eval()\n minicpm_llama_tokenizer = AutoTokenizer.from_pretrained(\n VISION_MODEL_ID,\n trust_remote_code=True\n )\n print("MiniCPM-Llama3 standard model loaded on-demand.")\n\ndef unload_minicpm_llama_model():\n global minicpm_llama_model, minicpm_llama_tokenizer\n if minicpm_llama_model is not None:\n del minicpm_llama_model\n del minicpm_llama_tokenizer\n torch.cuda.empty_cache()\n print("MiniCPM-Llama3 model unloaded.")\n\ndef take_screenshot(monitor_index, text_queries, score_threshold, vision_model_question, full_image_vision_model_question, output_filename="screenshot.png"):\n """\n Takes a screenshot of a specific monitor in a multi-monitor setup.\n\n Args:\n monitor_index (int): The index of the monitor to capture.\n text_queries (str): Comma-separated text queries for object detection.\n score_threshold (float): The score threshold for filtering low-probability predictions.\n vision_model_question (str): The question to ask the vision model for cropped images.\n full_image_vision_model_question (str): The question to ask the vision model for the full image.\n output_filename (str): The filename to save the screenshot as (default is "screenshot.png").\n """\n monitor_index = int(monitor_index)\n score_threshold = float(score_threshold)\n\n # Get information about all monitors\n monitors = screeninfo.get_monitors()\n\n if monitor_index >= len(monitors):\n raise ValueError(f"Monitor index {monitor_index} is out of range. There are only {len(monitors)} monitors.")\n\n # Get the region of the specified monitor\n monitor = monitors[monitor_index]\n region = (monitor.x, monitor.y, monitor.width, monitor.height)\n\n # Take the screenshot\n output_dir = "extensions/Lucid_Autonomy"\n os.makedirs(output_dir, exist_ok=True)\n screenshot_path = os.path.join(output_dir, output_filename)\n screenshot = pyautogui.screenshot(region=region)\n screenshot.save(screenshot_path)\n print(f"Screenshot saved as {screenshot_path}")\n\n # Process the image using the Owlv2 model\n annotated_image_path, json_output_path = query_image(screenshot_path, text_queries, score_threshold)\n\n # Process the entire screenshot with the minicpm model\n full_image_response = process_with_vision_model(screenshot_path, full_image_vision_model_question)\n\n # Create the output directory if it doesn't exist\n output_dir = "extensions/Lucid_Autonomy/ImageOutputTest"\n os.makedirs(output_dir, exist_ok=True)\n\n # Clear the output directory\n for filename in os.listdir(output_dir):\n file_path = os.path.join(output_dir, filename)\n try:\n if os.path.isfile(file_path) or os.path.islink(file_path):\n os.unlink(file_path)\n elif os.path.isdir(file_path):\n shutil.rmtree(file_path)\n except Exception as e:\n print(f"Failed to delete {file_path}. Reason: {e}")\n\n # Crop the detected objects and save them to the output directory\n crop_images(annotated_image_path, json_output_path, output_dir, full_image_response)\n\n print("Processing complete. Check the output directory for the cropped images and results.")\n\n\ndef query_image(img_path, text_queries, score_threshold):\n """\n Processes the image using the Owlv2 model to detect objects based on text queries.\n\n Args:\n img_path (str): The path to the input image.\n text_queries (str): Comma-separated text queries for object detection.\n score_threshold (float): The score threshold for filtering low-probability predictions.\n """\n load_owlv2_model()\n\n img = Image.open(img_path)\n text_queries = text_queries.split(",")\n\n size = max(img.size)\n target_sizes = torch.Tensor([[size, size]])\n inputs = owlv2_processor(text=text_queries, images=img, return_tensors="pt").to(device)\n\n with torch.no_grad():\n outputs = owlv2_model(**inputs)\n\n outputs.logits = outputs.logits.cpu()\n outputs.pred_boxes = outputs.pred_boxes.cpu()\n results = owlv2_processor.post_process_object_detection(outputs=outputs, target_sizes=target_sizes)\n boxes, scores, labels = results[0]["boxes"], results[0]["scores"], results[0]["labels"]\n\n result_labels = []\n img_pil = img.copy()\n draw = ImageDraw.Draw(img_pil)\n\n for box, score, label in zip(boxes, scores, labels):\n box = [int(i) for i in box.tolist()]\n if score < score_threshold:\n continue\n result_labels.append([box, text_queries[label.item()]])\n draw.rectangle(box, outline="red", width=2)\n # Do not draw the text label within the bounding box\n # draw.text((box[0], box[1]), text_queries[label.item()], fill="red")\n\n # Save the annotated image for debugging\n annotated_image_path = "extensions/Lucid_Autonomy/annotated_image.png"\n img_pil.save(annotated_image_path)\n\n # Save the JSON output to a file\n json_output_path = "extensions/Lucid_Autonomy/output.json"\n with open(json_output_path, 'w') as f:\n json.dump(result_labels, f)\n\n unload_owlv2_model()\n\n return annotated_image_path, json_output_path\n\ndef crop_images(image_path, json_output_path, output_dir="extensions/Lucid_Autonomy/ImageOutputTest", full_image_response=""):\n """\n Crops out the detected objects from the input image based on the bounding box coordinates.\n\n Args:\n image_path (str): The path to the input image.\n json_output_path (str): The path to the JSON file containing the bounding box coordinates.\n output_dir (str): The directory to save the cropped images (default is "extensions/Lucid_Autonomy/ImageOutputTest").\n full_image_response (str): The response from the vision model for the full image.\n """\n with open(json_output_path, 'r') as f:\n data = json.load(f)\n\n img = Image.open(image_path)\n\n results = []\n\n # Load the miniCPM-lama3-v-2_5 model once before processing the images\n load_minicpm_llama_model()\n\n for i, item in enumerate(data):\n box = item[0]\n label = item[1]\n cropped_img = img.crop(box)\n cropped_img_path = f"{output_dir}/{label}_{i}.png"\n cropped_img.save(cropped_img_path)\n\n # Process the cropped image with the miniCPM-lama3-v-2_5 model\n vision_model_response = process_with_vision_model(cropped_img_path, VISION_MODEL_QUESTION)\n\n # Store the results in a structured format\n results.append({\n "image_name": f"{label}_{i}.png",\n "coordinates": box,\n "description": vision_model_response\n })\n\n # Unload the miniCPM-lama3-v-2_5 model after processing all images\n unload_minicpm_llama_model()\n\n # Save the results to a JSON file\n results_json_path = f"{output_dir}/results.json"\n with open(results_json_path, 'w') as f:\n json.dump(results, f, indent=4)\n\n # Create the compressed results JSON file\n compressed_results_json_path = f"{output_dir}/compressedresults.json"\n compressed_results = [{"image_name": result["image_name"], "description": result["description"]} for result in results]\n with open(compressed_results_json_path, 'w') as f:\n json.dump(compressed_results, f, indent=4)\n\n # Add the full image response to the results\n if full_image_response:\n results.append({\n "image_name": "full_image.png",\n "coordinates": [],\n "description": full_image_response\n })\n compressed_results.append({\n "image_name": "full_image.png",\n "description": full_image_response\n })\n\n # Save the updated results to the JSON files\n with open(results_json_path, 'w') as f:\n json.dump(results, f, indent=4)\n with open(compressed_results_json_path, 'w') as f:\n json.dump(compressed_results, f, indent=4)\n\n\ndef process_with_vision_model(image_path, question):\n """\n Processes the image with the miniCPM-lama3-v-2_5 model and returns the response.\n\n Args:\n image_path (str): The path to the input image.\n question (str): The question to ask the vision model.\n\n Returns:\n str: The response from the vision model.\n """\n image = Image.open(image_path).convert("RGB")\n messages = [\n {"role": "user", "content": f"{question}"}\n ]\n # Define the generation arguments\n generation_args = {\n "max_new_tokens": 896,\n "repetition_penalty": 1.05,\n "num_beams": 3,\n "top_p": 0.8,\n "top_k": 1,\n "temperature": 0.7,\n "sampling": True,\n }\n if "int4" in "openbmb/MiniCPM-Llama3-V-2_5":\n # Disable streaming for the 4-bit model\n generation_args["stream"] = False\n # Use the model.chat method with streaming enabled\n vision_model_response = ""\n for new_text in minicpm_llama_model.chat(\n image=image,\n msgs=messages,\n tokenizer=minicpm_llama_tokenizer,\n **generation_args\n ):\n vision_model_response += new_text\n print(new_text, flush=True, end='')\n else:\n minicpm_llama_model.to(device)\n vision_model_response = minicpm_llama_model.chat(\n image=image,\n msgs=messages,\n tokenizer=minicpm_llama_tokenizer,\n **generation_args\n )\n return vision_model_response\n\ndef ui():\n """\n Creates custom gradio elements when the UI is launched.\n """\n with gr.Blocks() as demo:\n with gr.Row():\n with gr.Column():\n monitor_index = gr.Textbox(label="Monitor Index", value=str(MONITOR_INDEX))\n text_queries = gr.Textbox(label="Text Queries", value=TEXT_QUERIES)\n score_threshold = gr.Textbox(label="Score Threshold", value=str(SCORE_THRESHOLD))\n vision_model_question = gr.Textbox(label="Vision Model Question", value=VISION_MODEL_QUESTION)\n take_screenshot_button = gr.Button(value="Take Screenshot")\n\n take_screenshot_button.click(\n fn=take_screenshot,\n inputs=[monitor_index, text_queries, score_threshold, vision_model_question],\n outputs=None\n )\n\n return demo\n\ndef input_modifier(user_input, state):\n """\n Modifies the user input before it is processed by the LLM.\n """\n return user_input\n\ndef output_modifier(output, state):\n """\n Modifies the LLM output before it is presented in the UI.\n """\n global oob_tasks\n oob_tasks = []\n\n # Search for the "Autonomy_Tasks:" trigger phrase in the LLM's output\n if "Autonomy_Tasks:" in output:\n # Clean the chat history\n state["history"] = history_modifier(state["history"])\n # Extract tasks and execute them\n tasks_section = output.split("Autonomy_Tasks:")[-1].strip()\n if tasks_section:\n execute_tasks(tasks_section)\n # Append new compressed results\n compressed_results_json_path = "extensions/Lucid_Autonomy/ImageOutputTest/compressedresults.json"\n with open(compressed_results_json_path, 'r') as f:\n compressed_results = json.load(f)\n output = append_compressed_results(output, compressed_results)\n\n # Write OOB tasks to a JSON file\n with open(oob_tasks_file, 'w') as f:\n json.dump(oob_tasks, f)\n\n # Trigger the execution of OOB tasks after a delay\n threading.Thread(target=execute_oob_tasks_with_delay).start()\n\n return output\n\n\n\noob_tasks = []\n\ndef execute_tasks(tasks_section):\n """\n Executes the tasks listed in the tasks_section.\n """\n global oob_tasks\n oob_tasks = []\n tasks = tasks_section.split("\\n")\n for task in tasks:\n task = task.strip()\n if task.startswith("MouseMover ="):\n image_name = task.split("=")[1].strip()\n move_mouse_to_image(image_name)\n elif task.startswith("MouseClick ="):\n value = task.split("=")[1].strip()\n if "," in value: # Check if the value contains coordinates\n coordinates = value\n left_click_coordinates_raw(coordinates)\n else: # Otherwise, treat it as an image name\n image_name = value\n left_click_image(image_name)\n elif task.startswith("TextInput ="):\n text = task.split("=")[1].strip()\n text_input(text)\n elif task.startswith("SpecialKey ="):\n key = task.split("=")[1].strip()\n special_key(key)\n elif task.startswith("FunctionKey ="):\n key = task.split("=")[1].strip()\n function_key(key)\n elif task.startswith("KeyCombo ="):\n combo = task.split("=")[1].strip()\n key_combo(combo)\n elif task.startswith("Delay ="):\n delay = int(task.split("=")[1].strip())\n delay_task(delay)\n elif task.startswith("MouseMove ="):\n coordinates = task.split("=")[1].strip()\n move_mouse_to_coordinates_raw(coordinates)\n elif task == "TakeScreenshot":\n take_screenshot_task()\n elif task.startswith("OOB_"):\n oob_tasks.append(task)\n else:\n print(f"Unknown task: {task}")\n\n\ndef execute_oob_tasks_with_delay():\n """\n Executes the saved OOB tasks after a delay.\n """\n time.sleep(0.1) # Short delay to ensure the LLM's response has completed\n execute_oob_tasks()\n\ndef execute_oob_tasks():\n """\n Executes the saved OOB tasks.\n """\n with open(oob_tasks_file, 'r') as f:\n oob_tasks = json.load(f)\n\n for task in oob_tasks:\n task = task.strip()\n if task.startswith("OOB_MouseMover ="):\n image_name = task.split("=")[1].strip()\n move_mouse_to_image(image_name)\n elif task.startswith("OOB_MouseClick ="):\n value = task.split("=")[1].strip()\n if "," in value: # Check if the value contains coordinates\n coordinates = value\n left_click_coordinates_raw(coordinates)\n else: # Otherwise, treat it as an image name\n image_name = value\n left_click_image(image_name)\n elif task.startswith("OOB_TextInput ="):\n text = task.split("=")[1].strip()\n text_input(text)\n elif task.startswith("OOB_SpecialKey ="):\n key = task.split("=")[1].strip()\n special_key(key)\n elif task.startswith("OOB_FunctionKey ="):\n key = task.split("=")[1].strip()\n function_key(key)\n elif task.startswith("OOB_KeyCombo ="):\n combo = task.split("=")[1].strip()\n key_combo(combo)\n elif task.startswith("OOB_Delay ="):\n delay = int(task.split("=")[1].strip())\n delay_task(delay)\n elif task.startswith("OOB_MouseMove ="):\n coordinates = task.split("=")[1].strip()\n move_mouse_to_coordinates_raw(coordinates)\n elif task.startswith("OOB_MouseClick ="):\n coordinates = task.split("=")[1].strip()\n left_click_coordinates_raw(coordinates)\n elif task == "OOB_TakeScreenshot":\n take_screenshot_task()\n else:\n print(f"Unknown OOB task: {task}")\n\n\n\n\ndef delay_task(delay):\n """\n Adds a delay in milliseconds.\n """\n time.sleep(delay / 1000.0)\n\ndef move_mouse_to_coordinates_raw(coordinates):\n """\n Moves the mouse to the specified raw coordinates.\n """\n x, y = map(int, coordinates.split(","))\n pyautogui.moveTo(x, y)\n print(f"Mouse moved to coordinates: ({x}, {y})")\n\ndef left_click_coordinates_raw(coordinates):\n """\n Moves the mouse to the specified raw coordinates and performs a left click.\n """\n x, y = map(int, coordinates.split(","))\n pyautogui.moveTo(x, y)\n pyautogui.click()\n print(f"Mouse clicked at coordinates: ({x}, {y})")\n\ndef take_screenshot_task():\n """\n Takes a screenshot.\n """\n take_screenshot(MONITOR_INDEX, TEXT_QUERIES, SCORE_THRESHOLD, VISION_MODEL_QUESTION)\n\n\ndef move_mouse_to_image(image_name):\n """\n Moves the mouse to the center of the box that defines the image.\n """\n results_json_path = "extensions/Lucid_Autonomy/ImageOutputTest/results.json"\n with open(results_json_path, 'r') as f:\n results = json.load(f)\n\n for result in results:\n if result["image_name"] == image_name:\n coordinates = result["coordinates"]\n monitor = get_monitor_info(MONITOR_INDEX)\n move_mouse_to_coordinates(monitor, coordinates)\n break\n else:\n print(f"Image name '{image_name}' not found in the results.")\n\ndef left_click_image(image_name):\n """\n Moves the mouse to the center of the box that defines the image and performs a left click.\n """\n results_json_path = "extensions/Lucid_Autonomy/ImageOutputTest/results.json"\n with open(results_json_path, 'r') as f:\n results = json.load(f)\n\n for result in results:\n if result["image_name"] == image_name:\n coordinates = result["coordinates"]\n monitor = get_monitor_info(MONITOR_INDEX)\n move_mouse_to_coordinates(monitor, coordinates)\n pyautogui.click() # Perform a left click\n break\n else:\n print(f"Image name '{image_name}' not found in the results.")\n\ndef move_mouse_to_coordinates(monitor, coordinates):\n """\n Moves the mouse cursor to the specified coordinates on the monitor.\n\n Args:\n monitor (dict): A dictionary containing the monitor's information.\n coordinates (list): A list containing the coordinates [x1, y1, x2, y2].\n """\n x1, y1, x2, y2 = coordinates\n monitor_x, monitor_y = monitor.x, monitor.y\n center_x = monitor_x + (x1 + x2) // 2\n center_y = monitor_y + (y1 + y2) // 2\n pyautogui.moveTo(center_x, center_y)\n print(f"Mouse moved to coordinates: ({center_x}, {center_y})")\n\ndef get_monitor_info(monitor_index):\n """\n Gets information about the specified monitor.\n\n Args:\n monitor_index (int): The index of the monitor to get information about.\n\n Returns:\n dict: A dictionary containing the monitor's information.\n """\n monitors = screeninfo.get_monitors()\n\n if monitor_index >= len(monitors):\n raise ValueError(f"Monitor index {monitor_index} is out of range. There are only {len(monitors)} monitors.")\n\n return monitors[monitor_index]\n\ndef history_modifier(history):\n """\n Modifies the chat history before the text generation in chat mode begins.\n """\n # Clean the internal history\n history["internal"] = clean_chat_history(history["internal"])\n # Clean the visible history\n history["visible"] = clean_chat_history(history["visible"])\n\n # Inject the latest compressedresults.json into the internal history\n compressed_results_json_path = "extensions/Lucid_Autonomy/ImageOutputTest/compressedresults.json"\n with open(compressed_results_json_path, 'r') as f:\n compressed_results = json.load(f)\n latest_compressed_results = json.dumps(compressed_results, indent=4)\n history["internal"].append(["", f"<START_COMPRESSED_RESULTS>\\\\\\\\n{latest_compressed_results}\\\\\\\\n<END_COMPRESSED_RESULTS>"])\n\n return history\n\ndef clean_chat_history(history):\n """\n Cleans the chat history by removing all but the latest instance of the contents between the bookends.\n """\n START_BOOKEND = "<START_COMPRESSED_RESULTS>"\n END_BOOKEND = "<END_COMPRESSED_RESULTS>"\n\n # Find all instances of the bookends\n bookend_indices = []\n for i, entry in enumerate(history):\n if isinstance(entry, list) and len(entry) == 2:\n if START_BOOKEND in entry[1] and END_BOOKEND in entry[1]:\n bookend_indices.append(i)\n\n # If there are multiple instances, remove all but the last one\n if len(bookend_indices) > 1:\n for i in bookend_indices[:-1]:\n history[i][1] = re.sub(f"{START_BOOKEND}.*?{END_BOOKEND}", "", history[i][1], flags=re.DOTALL)\n\n return history\n\ndef append_compressed_results(output, compressed_results):\n """\n Appends the compressed results to the output.\n """\n START_BOOKEND = "<START_COMPRESSED_RESULTS>"\n END_BOOKEND = "<END_COMPRESSED_RESULTS>"\n output += f"\\\\\\\\n\\\\\\\\n{START_BOOKEND}\\\\\\\\n{json.dumps(compressed_results, indent=4)}\\\\\\\\n{END_BOOKEND}"\n return output\n\ndef text_input(text):\n """\n Types the given text, replacing '/n' with newline characters.\n """\n text = text.replace("/n", "\\\\n")\n pyautogui.typewrite(text)\n\ndef special_key(key):\n """\n Presses the specified special key.\n """\n if key == "enter":\n pyautogui.press("enter")\n elif key == "space":\n pyautogui.press("space")\n elif key == "backspace":\n pyautogui.press("backspace")\n elif key == "tab":\n pyautogui.press("tab")\n elif key == "escape":\n pyautogui.press("esc")\n elif key == "shift":\n pyautogui.press("shift")\n elif key == "ctrl":\n pyautogui.press("ctrl")\n elif key == "alt":\n pyautogui.press("alt")\n else:\n print(f"Unsupported special key: {key}")\n\ndef function_key(key):\n """\n Presses the specified function key.\n """\n if key.startswith("f") and key[1:].isdigit():\n pyautogui.press(key)\n else:\n print(f"Unsupported function key: {key}")\n\ndef key_combo(combo):\n """\n Presses the specified key combination.\n """\n keys = combo.split("+")\n for key in keys:\n if key in ["ctrl", "alt", "shift"]:\n pyautogui.keyDown(key)\n else:\n pyautogui.press(key)\n for key in keys:\n if key in ["ctrl", "alt", "shift"]:\n pyautogui.keyUp(key)\n\nif __name__ == "__main__":\n # Launch the Gradio interface with the specified UI components\n gr.Interface(\n fn=ui,\n inputs=None,\n outputs="ui"\n ).launch()",
"Thank you for providing the detailed context and the current state of the code. The warning and error you're encountering are due to the `take_screenshot` function expecting an additional argument (`full_image_vision_model_question`) that is not being passed from the Gradio UI. Additionally, the `minicpm_llama_model` is not being loaded correctly, leading to the `AttributeError`.\n\nLet's address these issues step by step:\n\n### Step 1: Update the Gradio UI to Pass the Additional Argument\n\nWe need to update the Gradio UI to pass the `full_image_vision_model_question` argument to the `take_screenshot` function.\n\n```python\ndef ui():\n """\n Creates custom gradio elements when the UI is launched.\n """\n with gr.Blocks() as demo:\n with gr.Row():\n with gr.Column():\n monitor_index = gr.Textbox(label="Monitor Index", value=str(MONITOR_INDEX))\n text_queries = gr.Textbox(label="Text Queries", value=TEXT_QUERIES)\n score_threshold = gr.Textbox(label="Score Threshold", value=str(SCORE_THRESHOLD))\n vision_model_question = gr.Textbox(label="Vision Model Question", value=VISION_MODEL_QUESTION)\n full_image_vision_model_question = gr.Textbox(label="Full Image Vision Model Question", value=FULL_IMAGE_VISION_MODEL_QUESTION)\n take_screenshot_button = gr.Button(value="Take Screenshot")\n\n take_screenshot_button.click(\n fn=take_screenshot,\n inputs=[monitor_index, text_queries, score_threshold, vision_model_question, full_image_vision_model_question],\n outputs=None\n )\n\n return demo\n```\n\n### Step 2: Ensure the `minicpm_llama_model` is Loaded Correctly\n\nWe need to ensure that the `minicpm_llama_model` is loaded correctly before processing the full image.\n\n```python\ndef load_minicpm_llama_model():\n global minicpm_llama_model, minicpm_llama_tokenizer\n if "int4" in VISION_MODEL_ID:\n # Load the 4-bit quantized model and tokenizer\n minicpm_llama_model = AutoModel.from_pretrained(\n VISION_MODEL_ID,\n device_map={"": device},\n trust_remote_code=True,\n torch_dtype=torch.float16 # Use float16 as per the example code for 4-bit models\n ).eval()\n minicpm_llama_tokenizer = AutoTokenizer.from_pretrained(\n VISION_MODEL_ID,\n trust_remote_code=True\n )\n print("MiniCPM-Llama3 4-bit quantized model loaded on-demand.")\n else:\n # Load the standard model and tokenizer\n minicpm_llama_model = AutoModel.from_pretrained(\n VISION_MODEL_ID,\n device_map={"": device}, # Use the specified CUDA device\n trust_remote_code=True,\n torch_dtype=torch.bfloat16\n ).eval()\n minicpm_llama_tokenizer = AutoTokenizer.from_pretrained(\n VISION_MODEL_ID,\n trust_remote_code=True\n )\n print("MiniCPM-Llama3 standard model loaded on-demand.")\n```\n\n### Step 3: Update the `take_screenshot` Function\n\nWe need to ensure that the `minicpm_llama_model` is loaded before processing the full image.\n\n```python\ndef take_screenshot(monitor_index, text_queries, score_threshold, vision_model_question, full_image_vision_model_question, output_filename="screenshot.png"):\n """\n Takes a screenshot of a specific monitor in a multi-monitor setup.\n\n Args:\n monitor_index (int): The index of the monitor to capture.\n text_queries (str): Comma-separated text queries for object detection.\n score_threshold (float): The score threshold for filtering low-probability predictions.\n vision_model_question (str): The question to ask the vision model for cropped images.\n full_image_vision_model_question (str): The question to ask the vision model for the full image.\n output_filename (str): The filename to save the screenshot as (default is "screenshot.png").\n """\n monitor_index = int(monitor_index)\n score_threshold = float(score_threshold)\n\n # Get information about all monitors\n monitors = screeninfo.get_monitors()\n\n if monitor_index >= len(monitors):\n raise ValueError(f"Monitor index {monitor_index} is out of range. There are only {len(monitors)} monitors.")\n\n # Get the region of the specified monitor\n monitor = monitors[monitor_index]\n region = (monitor.x, monitor.y, monitor.width, monitor.height)\n\n # Take the screenshot\n output_dir = "extensions/Lucid_Autonomy"\n os.makedirs(output_dir, exist_ok=True)\n screenshot_path = os.path.join(output_dir, output_filename)\n screenshot = pyautogui.screenshot(region=region)\n screenshot.save(screenshot_path)\n print(f"Screenshot saved as {screenshot_path}")\n\n # Process the image using the Owlv2 model\n annotated_image_path, json_output_path = query_image(screenshot_path, text_queries, score_threshold)\n\n # Load the miniCPM-lama3-v-2_5 model\n load_minicpm_llama_model()\n\n # Process the entire screenshot with the minicpm model\n full_image_response = process_with_vision_model(screenshot_path, full_image_vision_model_question)\n\n # Create the output directory if it doesn't exist\n output_dir = "extensions/Lucid_Autonomy/ImageOutputTest"\n os.makedirs(output_dir, exist_ok=True)\n\n # Clear the output directory\n for filename in os.listdir(output_dir):\n file_path = os.path.join(output_dir, filename)\n try:\n if os.path.isfile(file_path) or os.path.islink(file_path):\n os.unlink(file_path)\n elif os.path.isdir(file_path):\n shutil.rmtree(file_path)\n except Exception as e:\n print(f"Failed to delete {file_path}. Reason: {e}")\n\n # Crop the detected objects and save them to the output directory\n crop_images(annotated_image_path, json_output_path, output_dir, full_image_response)\n\n print("Processing complete. Check the output directory for the cropped images and results.")\n```\n\n### Step 4: Update the `crop_images` Function\n\nWe need to ensure that the `crop_images` function includes the response from the full image in the results.\n\n```python\ndef crop_images(image_path, json_output_path, output_dir="extensions/Lucid_Autonomy/ImageOutputTest", full_image_response=""):\n """\n Crops out the detected objects from the input image based on the bounding box coordinates.\n\n Args:\n image_path (str): The path to the input image.\n json_output_path (str): The path to the JSON file containing the bounding box coordinates.\n output_dir (str): The directory to save the cropped images (default is "extensions/Lucid_Autonomy/ImageOutputTest").\n full_image_response (str): The response from the vision model for the full image.\n """\n with open(json_output_path, 'r') as f:\n data = json.load(f)\n\n img = Image.open(image_path)\n\n results = []\n\n # Load the miniCPM-lama3-v-2_5 model once before processing the images\n load_minicpm_llama_model()\n\n for i, item in enumerate(data):\n box = item[0]\n label = item[1]\n cropped_img = img.crop(box)\n cropped_img_path = f"{output_dir}/{label}_{i}.png"\n cropped_img.save(cropped_img_path)\n\n # Process the cropped image with the miniCPM-lama3-v-2_5 model\n vision_model_response = process_with_vision_model(cropped_img_path, VISION_MODEL_QUESTION)\n\n # Store the results in a structured format\n results.append({\n "image_name": f"{label}_{i}.png",\n "coordinates": box,\n "description": vision_model_response\n })\n\n # Unload the miniCPM-lama3-v-2_5 model after processing all images\n unload_minicpm_llama_model()\n\n # Save the results to a JSON file\n results_json_path = f"{output_dir}/results.json"\n with open(results_json_path, 'w') as f:\n json.dump(results, f, indent=4)\n\n # Create the compressed results JSON file\n compressed_results_json_path = f"{output_dir}/compressedresults.json"\n compressed_results = [{"image_name": result["image_name"], "description": result["description"]} for result in results]\n with open(compressed_results_json_path, 'w') as f:\n json.dump(compressed_results, f, indent=4)\n\n # Add the full image response to the results\n if full_image_response:\n results.append({\n "image_name": "full_image.png",\n "coordinates": [],\n "description": full_image_response\n })\n compressed_results.append({\n "image_name": "full_image.png",\n "description": full_image_response\n })\n\n # Save the updated results to the JSON files\n with open(results_json_path, 'w') as f:\n json.dump(results, f, indent=4)\n with open(compressed_results_json_path, 'w') as f:\n json.dump(compressed_results, f, indent=4)\n```\n\n### Putting it all together\n\nHere are the changes you need to make to the main code:\n\n1. **Update the Gradio UI to pass the additional argument**.\n2. **Ensure the `minicpm_llama_model` is loaded correctly**.\n3. **Update the `take_screenshot` function**.\n4. **Update the `crop_images` function**.\n\nYou can integrate these changes into your main code as needed. If you have any questions or need further assistance, feel free to ask!"
],
[
"Okay we are close, I didn't get the warning, and the minicpm model did load. However, for some reason the model loaded twice, I think you forgot to unload the model.\n\n20:04:42-116365 INFO Starting Text generation web UI \n20:04:42-119123 WARNING trust_remote_code is enabled. This is dangerous. \n20:04:42-119715 INFO Loading settings from "settings.yaml" \n20:04:42-121943 INFO Loading the extension "Lucid_Autonomy" \n\nRunning on local URL: http://127.0.0.1:7861\n\nScreenshot saved as extensions/Lucid_Autonomy/screenshot.png\nOwlv2 model loaded.\nOwlv2 model unloaded.\n/home/myself/Desktop/OobV113LucidExistance/text-generation-webui-1.13/installer_files/env/lib/python3.11/site-packages/transformers/generation/configuration_utils.py:577: UserWarning: `do_sample` is set to `False`. However, `min_p` is set to `0.0` -- this flag is only used in sample-based generation modes. You should set `do_sample=True` or unset `min_p`.\n warnings.warn(\nLoading checkpoint shards: 100%|██████████████████| 7/7 [01:32<00:00, 13.20s/it]\nMiniCPM-Llama3 standard model loaded on-demand.\nLoading checkpoint shards: 29%|█████▏ | 2/7 [00:01<00:03, 1.30it/s]\nTraceback (most recent call last):\n File "/home/myself/Desktop/OobV113LucidExistance/text-generation-webui-1.13/installer_files/env/lib/python3.11/site-packages/gradio/queueing.py", line 527, in process_events\n",
"Thank you for the detailed feedback. It looks like the `minicpm_llama_model` is being loaded twice because it's not being unloaded after processing the full image. Let's ensure that the model is unloaded correctly after each use.\n\n### Step 1: Update the `take_screenshot` Function\n\nWe need to ensure that the `minicpm_llama_model` is unloaded after processing the full image.\n\n```python\ndef take_screenshot(monitor_index, text_queries, score_threshold, vision_model_question, full_image_vision_model_question, output_filename="screenshot.png"):\n """\n Takes a screenshot of a specific monitor in a multi-monitor setup.\n\n Args:\n monitor_index (int): The index of the monitor to capture.\n text_queries (str): Comma-separated text queries for object detection.\n score_threshold (float): The score threshold for filtering low-probability predictions.\n vision_model_question (str): The question to ask the vision model for cropped images.\n full_image_vision_model_question (str): The question to ask the vision model for the full image.\n output_filename (str): The filename to save the screenshot as (default is "screenshot.png").\n """\n monitor_index = int(monitor_index)\n score_threshold = float(score_threshold)\n\n # Get information about all monitors\n monitors = screeninfo.get_monitors()\n\n if monitor_index >= len(monitors):\n raise ValueError(f"Monitor index {monitor_index} is out of range. There are only {len(monitors)} monitors.")\n\n # Get the region of the specified monitor\n monitor = monitors[monitor_index]\n region = (monitor.x, monitor.y, monitor.width, monitor.height)\n\n # Take the screenshot\n output_dir = "extensions/Lucid_Autonomy"\n os.makedirs(output_dir, exist_ok=True)\n screenshot_path = os.path.join(output_dir, output_filename)\n screenshot = pyautogui.screenshot(region=region)\n screenshot.save(screenshot_path)\n print(f"Screenshot saved as {screenshot_path}")\n\n # Process the image using the Owlv2 model\n annotated_image_path, json_output_path = query_image(screenshot_path, text_queries, score_threshold)\n\n # Load the miniCPM-lama3-v-2_5 model\n load_minicpm_llama_model()\n\n # Process the entire screenshot with the minicpm model\n full_image_response = process_with_vision_model(screenshot_path, full_image_vision_model_question)\n\n # Unload the miniCPM-lama3-v-2_5 model after processing the full image\n unload_minicpm_llama_model()\n\n # Create the output directory if it doesn't exist\n output_dir = "extensions/Lucid_Autonomy/ImageOutputTest"\n os.makedirs(output_dir, exist_ok=True)\n\n # Clear the output directory\n for filename in os.listdir(output_dir):\n file_path = os.path.join(output_dir, filename)\n try:\n if os.path.isfile(file_path) or os.path.islink(file_path):\n os.unlink(file_path)\n elif os.path.isdir(file_path):\n shutil.rmtree(file_path)\n except Exception as e:\n print(f"Failed to delete {file_path}. Reason: {e}")\n\n # Crop the detected objects and save them to the output directory\n crop_images(annotated_image_path, json_output_path, output_dir, full_image_response)\n\n print("Processing complete. Check the output directory for the cropped images and results.")\n```\n\n### Step 2: Update the `crop_images` Function\n\nWe need to ensure that the `minicpm_llama_model` is unloaded after processing all images.\n\n```python\ndef crop_images(image_path, json_output_path, output_dir="extensions/Lucid_Autonomy/ImageOutputTest", full_image_response=""):\n """\n Crops out the detected objects from the input image based on the bounding box coordinates.\n\n Args:\n image_path (str): The path to the input image.\n json_output_path (str): The path to the JSON file containing the bounding box coordinates.\n output_dir (str): The directory to save the cropped images (default is "extensions/Lucid_Autonomy/ImageOutputTest").\n full_image_response (str): The response from the vision model for the full image.\n """\n with open(json_output_path, 'r') as f:\n data = json.load(f)\n\n img = Image.open(image_path)\n\n results = []\n\n # Load the miniCPM-lama3-v-2_5 model once before processing the images\n load_minicpm_llama_model()\n\n for i, item in enumerate(data):\n box = item[0]\n label = item[1]\n cropped_img = img.crop(box)\n cropped_img_path = f"{output_dir}/{label}_{i}.png"\n cropped_img.save(cropped_img_path)\n\n # Process the cropped image with the miniCPM-lama3-v-2_5 model\n vision_model_response = process_with_vision_model(cropped_img_path, VISION_MODEL_QUESTION)\n\n # Store the results in a structured format\n results.append({\n "image_name": f"{label}_{i}.png",\n "coordinates": box,\n "description": vision_model_response\n })\n\n # Unload the miniCPM-lama3-v-2_5 model after processing all images\n unload_minicpm_llama_model()\n\n # Save the results to a JSON file\n results_json_path = f"{output_dir}/results.json"\n with open(results_json_path, 'w') as f:\n json.dump(results, f, indent=4)\n\n # Create the compressed results JSON file\n compressed_results_json_path = f"{output_dir}/compressedresults.json"\n compressed_results = [{"image_name": result["image_name"], "description": result["description"]} for result in results]\n with open(compressed_results_json_path, 'w') as f:\n json.dump(compressed_results, f, indent=4)\n\n # Add the full image response to the results\n if full_image_response:\n results.append({\n "image_name": "full_image.png",\n "coordinates": [],\n "description": full_image_response\n })\n compressed_results.append({\n "image_name": "full_image.png",\n "description": full_image_response\n })\n\n # Save the updated results to the JSON files\n with open(results_json_path, 'w') as f:\n json.dump(results, f, indent=4)\n with open(compressed_results_json_path, 'w') as f:\n json.dump(compressed_results, f, indent=4)\n```\n\n### Putting it all together\n\nHere are the changes you need to make to the main code:\n\n1. **Update the Gradio UI to pass the additional argument**.\n2. **Ensure the `minicpm_llama_model` is unloaded correctly**.\n3. **Update the `take_screenshot` function**.\n4. **Update the `crop_images` function**.\n\nYou can integrate these changes into your main code as needed. If you have any questions or need further assistance, feel free to ask!"
],
[
"Okay I made the changes and I can successfully press the take screenshot button manually with everything working, but when the LLM tries to take the screenshot I get this error:\n\n File "/home/myself/Desktop/OobV113LucidExistance/text-generation-webui-1.13/extensions/Lucid_Autonomy/script.py", line 371, in output_modifier\n execute_tasks(tasks_section)\n File "/home/myself/Desktop/OobV113LucidExistance/text-generation-webui-1.13/extensions/Lucid_Autonomy/script.py", line 430, in execute_tasks\n take_screenshot_task()\n File "/home/myself/Desktop/OobV113LucidExistance/text-generation-webui-1.13/extensions/Lucid_Autonomy/script.py", line 520, in take_screenshot_task\n take_screenshot(MONITOR_INDEX, TEXT_QUERIES, SCORE_THRESHOLD, VISION_MODEL_QUESTION)\nTypeError: take_screenshot() missing 1 required positional argument: 'full_image_vision_model_question'\n\n\nLLM response:\n\nAutonomy_Tasks:\nTakeScreenshot",
"The error you're encountering is because the `take_screenshot` function now requires an additional argument (`full_image_vision_model_question`), but the `take_screenshot_task` function is not passing this argument.\n\nLet's update the `take_screenshot_task` function to include the `full_image_vision_model_question` argument.\n\n### Step 1: Update the `take_screenshot_task` Function\n\nWe need to update the `take_screenshot_task` function to pass the `full_image_vision_model_question` argument.\n\n```python\ndef take_screenshot_task():\n """\n Takes a screenshot.\n """\n take_screenshot(MONITOR_INDEX, TEXT_QUERIES, SCORE_THRESHOLD, VISION_MODEL_QUESTION, FULL_IMAGE_VISION_MODEL_QUESTION)\n```\n\n### Step 2: Update the `execute_tasks` Function\n\nWe need to ensure that the `execute_tasks` function correctly handles the `TakeScreenshot` task.\n\n```python\ndef execute_tasks(tasks_section):\n """\n Executes the tasks listed in the tasks_section.\n """\n global oob_tasks\n oob_tasks = []\n tasks = tasks_section.split("\\n")\n for task in tasks:\n task = task.strip()\n if task.startswith("MouseMover ="):\n image_name = task.split("=")[1].strip()\n move_mouse_to_image(image_name)\n elif task.startswith("MouseClick ="):\n value = task.split("=")[1].strip()\n if "," in value: # Check if the value contains coordinates\n coordinates = value\n left_click_coordinates_raw(coordinates)\n else: # Otherwise, treat it as an image name\n image_name = value\n left_click_image(image_name)\n elif task.startswith("TextInput ="):\n text = task.split("=")[1].strip()\n text_input(text)\n elif task.startswith("SpecialKey ="):\n key = task.split("=")[1].strip()\n special_key(key)\n elif task.startswith("FunctionKey ="):\n key = task.split("=")[1].strip()\n function_key(key)\n elif task.startswith("KeyCombo ="):\n combo = task.split("=")[1].strip()\n key_combo(combo)\n elif task.startswith("Delay ="):\n delay = int(task.split("=")[1].strip())\n delay_task(delay)\n elif task.startswith("MouseMove ="):\n coordinates = task.split("=")[1].strip()\n move_mouse_to_coordinates_raw(coordinates)\n elif task == "TakeScreenshot":\n take_screenshot_task()\n elif task.startswith("OOB_"):\n oob_tasks.append(task)\n else:\n print(f"Unknown task: {task}")\n```\n\n### Putting it all together\n\nHere are the changes you need to make to the main code:\n\n1. **Update the `take_screenshot_task` function to pass the additional argument**.\n2. **Update the `execute_tasks` function to correctly handle the `TakeScreenshot` task**.\n\nYou can integrate these changes into your main code as needed. If you have any questions or need further assistance, feel free to ask!"
],
[
"Okay great!! Really good job, can you write me a summary like the one I gave you at the beginning of our conversation so I can start a new conversation with you and add more features? Go over how the code works in fine detail, what it does, it's purpose, textgen, and anything else you thing is pertinent. ",
"Certainly! Here's a detailed summary of the current state of the Lucid_Autonomy extension, including its purpose, functionality, and how it works:\n\n---\n\n### Summary of Lucid_Autonomy Extension\n\n#### Introduction to Textgen\n\nTextgen is a powerful web-based interface for running large language models (LLMs) locally. It allows users to interact with various LLMs, providing a flexible and customizable environment for text generation tasks. The platform supports a wide range of models and offers numerous extensions to enhance its functionality.\n\n#### Lucid_Autonomy Extension Overview\n\nThe Lucid_Autonomy extension is designed to enable an LLM to interact autonomously with a computer using mouse and keyboard actions. This extension is part of the textgen ecosystem and aims to provide the LLM with "eyes" and "hands" to perform tasks such as moving the mouse to specific locations on the screen, clicking on elements, entering text, and more.\n\n#### Current Operation of the Extension\n\n1. **Initial Setup and Image Processing**:\n - The extension takes a screenshot of a specified monitor in a multi-monitor setup.\n - The screenshot is processed using the Owlv2 model to detect objects based on text queries.\n - The detected objects are cropped and saved to the `ImageOutputTest` directory.\n - The cropped images are processed with the MiniCPM-Llama3-V-2_5 model to generate descriptions.\n - The results are saved in a `results.json` file, containing the image name, coordinates, and description.\n - A `compressedresults.json` file is created, omitting the coordinates and containing only the image name and description.\n\n2. **LLM Interaction**:\n - The LLM can interact with the extension using the trigger phrase "Autonomy_Tasks:".\n - If the LLM sends the trigger phrase without any instructions, the extension appends the contents of the `compressedresults.json` file to the LLM's response.\n - If the LLM sends the trigger phrase with instructions (e.g., `MouseMover = clickable icons_8.png`), the extension executes the specified tasks and then appends the `compressedresults.json` contents to the LLM's response.\n\n3. **Task Execution**:\n - The extension can handle various tasks such as moving the mouse to specific locations, clicking on elements, entering text, pressing special keys, function keys, key combinations, adding delays, and taking screenshots.\n - The extension also supports "Out of Body" (OOB) commands, which are executed after the LLM's response is complete. These commands are saved to a JSON file and executed with a short delay.\n\n4. **Full Image Processing**:\n - The extension now processes the entire screenshot with the MiniCPM-Llama3-V-2_5 model using a separate question (`FULL_IMAGE_VISION_MODEL_QUESTION`).\n - The response from the full image processing is added to the `results.json` and `compressedresults.json` files.\n\n#### Ultimate Goal of the Extension\n\nThe ultimate goal of the Lucid_Autonomy extension is to enable a large language model (LLM) to interact autonomously with a computer. The LLM can perform tasks such as moving the mouse to specific locations on the screen, clicking on elements, entering text, and more. The extension allows the LLM to operate a computer autonomously, effectively giving the LLM "eyes" and "hands" to interact with the user interface.\n\n#### Future Enhancements\n\n1. **Additional Mouse and Keyboard Interactions**:\n - Implement more mouse actions and enable keyboard key entry using any combination of keys the LLM wants to use.\n\n2. **Take Screenshot Tool**:\n - Add a command for the LLM to use, similar to how the user can take a screenshot. This will allow the LLM to invoke a "Take Screenshot" command that it can add to the end of its task list.\n\n3. **Enhanced Task Execution**:\n - Add more complex tasks that the LLM can execute, such as filling out forms, navigating websites, and interacting with applications.\n\n4. **Improved LLM Integration**:\n - Enhance the integration between the LLM and the extension to support more sophisticated interactions and contextual understanding.\n\n5. **User Interface Enhancements**:\n - Improve the Gradio UI to provide more options and better visualization of the extension's functionality.\n\n6. **Error Handling and Robustness**:\n - Add error handling and robustness features to ensure the extension operates smoothly and handles edge cases gracefully.\n\n#### Conclusion\n\nThe Lucid_Autonomy extension is a powerful tool that enables an LLM to interact autonomously with a computer. The current implementation allows the LLM to take screenshots, process images, and move the mouse to specific locations. The ultimate goal is to expand the extension's capabilities to support a wide range of mouse and keyboard interactions, enabling the LLM to perform complex tasks autonomously.\n\n---\n\nThis summary provides an overview of the current state of the Lucid_Autonomy extension, its purpose, functionality, and future enhancements. You can use this summary to start a new conversation and add more features as needed. If you have any questions or need further assistance, feel free to ask!"
]
]
}