Receipts/ImageIDUpgradesV1 (3rd copy).json

{
    "internal": [
        [
            "<|BEGIN-VISIBLE-CHAT|>",
            "How can I help you today?"
        ],
        [
            "Hello, how are you functioning today?",
            "I'm functioning well, thank you for asking! I'm here and ready to assist you. How about you? How can I help you today?"
        ],
        [
            "Okay you and I have been working on a project in previous chats.  Essentially you and I are interacting via a program called textgeneration webui (textgen).  This allows me to run models locally on my computer.  The project we have been working on is an extension for textgen that lets an LLM essentially operate a mouse and keyboard on its own.  The extension is called Lucid_Autonomy, I've had you written a summary of the current code and I will provide that to you.  I will then provide you with the code itself and supporting documents.",
            "Great, I'm ready to review the summary and the code for the Lucid_Autonomy extension. Please go ahead and provide the summary first, and then we can proceed with the code and supporting documents."
        ],
        [
            "Here is the summary from the previous conversation:\n\nLet's summarize what we have accomplished thus far, the current operation of the `Lucid_Autonomy` extension, and the ultimate goal of the extension.\n\n### Summary of Accomplishments\n\n1. **Initial Setup and Image Processing**:\n   - The extension takes a screenshot of a specified monitor.\n   - The screenshot is processed using the Owlv2 model to detect objects based on text queries.\n   - The detected objects are cropped and saved to the `ImageOutputTest` directory.\n   - The cropped images are processed using the MiniCPM-Llama3-V-2_5 model to generate descriptions.\n   - The results are saved in a `results.json` file, which includes the image name, coordinates, and description.\n   - A `compressedresults.json` file is created, omitting the coordinates and containing only the image name and description.\n\n2. **LLM Interaction**:\n   - The LLM can interact with the extension using the trigger phrase \"Autonomy_Tasks:\".\n   - If the LLM sends the trigger phrase without any instructions, the extension appends the contents of the `compressedresults.json` file to the LLM's response.\n   - If the LLM sends the trigger phrase with instructions (e.g., `MouseMover = clickable icons_8.png`), the extension executes the specified tasks and then appends the `compressedresults.json` contents to the LLM's response.\n\n3. **Mouse Movement**:\n   - The `MouseMover` functionality is integrated into the extension.\n   - The extension moves the mouse to the center of the box that defines the specified image based on the coordinates in the `results.json` file.\n   - The mouse movement is adjusted based on the monitor's position.\n\n### Current Operation of the Extension\n\n1. **User Interaction**:\n   - The user can take a screenshot using the \"Take Screenshot\" button in the Gradio UI.\n   - The screenshot is processed, and the results are saved in the `ImageOutputTest` directory.\n\n2. **LLM Interaction**:\n   - The LLM can send the trigger phrase \"Autonomy_Tasks:\" to interact with the extension.\n   - The extension processes the tasks listed below the trigger phrase.\n   - The results of the tasks are appended to the LLM's response.\n\n3. **Task Execution**:\n   - The extension can execute tasks such as moving the mouse to specific locations based on the image names and coordinates.\n   - The `compressedresults.json` file is appended to the LLM's response to provide context about the images.\n\n### Ultimate Goal of the Extension\n\nThe ultimate goal of the `Lucid_Autonomy` extension is to enable a large language model (LLM) to interact autonomously with a computer using mouse and keyboard actions. The LLM can perform tasks such as moving the mouse to specific locations on the screen, clicking on elements, and entering text. The extension allows the LLM to operate a computer autonomously, effectively giving the LLM \"eyes\" and \"hands\" to interact with the user interface.\n\n### Future Enhancements\n\n1. **Additional Mouse and Keyboard Interactions**:\n   - Implement more mouse and keyboard interactions, such as clicking, typing, and navigating the user interface.\n\n2. **Enhanced Task Execution**:\n   - Add more complex tasks that the LLM can execute, such as filling out forms, navigating websites, and interacting with applications.\n\n3. **Improved LLM Integration**:\n   - Enhance the integration between the LLM and the extension to support more sophisticated interactions and contextual understanding.\n\n4. **User Interface Enhancements**:\n   - Improve the Gradio UI to provide more options and better visualization of the extension's functionality.\n\n5. **Error Handling and Robustness**:\n   - Add error handling and robustness features to ensure the extension operates smoothly and handles edge cases gracefully.\n\n### Conclusion\n\nThe `Lucid_Autonomy` extension is a powerful tool that enables an LLM to interact autonomously with a computer. The current implementation allows the LLM to take screenshots, process images, and move the mouse to specific locations. The ultimate goal is to expand the extension's capabilities to support a wide range of mouse and keyboard interactions, enabling the LLM to perform complex tasks autonomously.\n\nPlease let me know if you have any further questions or if you're ready to start a new conversation to discuss additional enhancements and features for the extension.\n\n\nHere is the code for the Lucid_Autonomy extension:\n\nimport gradio as gr\nimport json\nimport pyautogui\nimport screeninfo\nfrom transformers import Owlv2Processor, Owlv2ForObjectDetection, AutoModel, AutoTokenizer\nfrom PIL import Image, ImageDraw\nimport torch\nimport os\nimport shutil\n\n# Configurable variables\nMONITOR_INDEX = 1  # Index of the monitor to capture (0 for the first monitor, 1 for the second monitor, etc.)\nTEXT_QUERIES = \"clickable hyperlinks, clickable icons, text bar field\"  # Comma-separated text queries for object detection\nSCORE_THRESHOLD = 0.20  # Score threshold for filtering low-probability predictions\nVISION_MODEL_QUESTION = \"What is the text in this image or what does this image describe?\"\n\n# Use GPU if available\nif torch.cuda.is_available():\n    device = torch.device(\"cuda\")\nelse:\n    device = torch.device(\"cpu\")\n\n# Global variables for models\nowlv2_model = None\nowlv2_processor = None\nminicpm_llama_model = None\nminicpm_llama_tokenizer = None\n\nVISION_MODEL_ID = \"/home/myself/Desktop/miniCPM_llava3V/MiniCPM-Llama3-V-2_5\"\n\ndef load_owlv2_model():\n    global owlv2_model, owlv2_processor\n    owlv2_model = Owlv2ForObjectDetection.from_pretrained(\"google/owlv2-base-patch16-ensemble\", trust_remote_code=True).to(device)\n    owlv2_processor = Owlv2Processor.from_pretrained(\"google/owlv2-base-patch16-ensemble\", trust_remote_code=True)\n    print(\"Owlv2 model loaded.\")\n\ndef unload_owlv2_model():\n    global owlv2_model, owlv2_processor\n    if owlv2_model is not None:\n        del owlv2_model\n        del owlv2_processor\n        torch.cuda.empty_cache()\n        print(\"Owlv2 model unloaded.\")\n\ndef load_minicpm_llama_model():\n    global minicpm_llama_model, minicpm_llama_tokenizer\n    if \"int4\" in VISION_MODEL_ID:\n        # Load the 4-bit quantized model and tokenizer\n        minicpm_llama_model = AutoModel.from_pretrained(\n            VISION_MODEL_ID,\n            device_map={\"\": device},\n            trust_remote_code=True,\n            torch_dtype=torch.float16  # Use float16 as per the example code for 4-bit models\n        ).eval()\n        minicpm_llama_tokenizer = AutoTokenizer.from_pretrained(\n            VISION_MODEL_ID,\n            trust_remote_code=True\n        )\n        print(\"MiniCPM-Llama3 4-bit quantized model loaded on-demand.\")\n    else:\n        # Load the standard model and tokenizer\n        minicpm_llama_model = AutoModel.from_pretrained(\n            VISION_MODEL_ID,\n            device_map={\"\": device},  # Use the specified CUDA device\n            trust_remote_code=True,\n            torch_dtype=torch.bfloat16\n        ).eval()\n        minicpm_llama_tokenizer = AutoTokenizer.from_pretrained(\n            VISION_MODEL_ID,\n            trust_remote_code=True\n        )\n        print(\"MiniCPM-Llama3 standard model loaded on-demand.\")\n\ndef unload_minicpm_llama_model():\n    global minicpm_llama_model, minicpm_llama_tokenizer\n    if minicpm_llama_model is not None:\n        del minicpm_llama_model\n        del minicpm_llama_tokenizer\n        torch.cuda.empty_cache()\n        print(\"MiniCPM-Llama3 model unloaded.\")\n\ndef take_screenshot(monitor_index, text_queries, score_threshold, vision_model_question, output_filename=\"screenshot.png\"):\n    \"\"\"\n    Takes a screenshot of a specific monitor in a multi-monitor setup.\n\n    Args:\n        monitor_index (int): The index of the monitor to capture.\n        text_queries (str): Comma-separated text queries for object detection.\n        score_threshold (float): The score threshold for filtering low-probability predictions.\n        vision_model_question (str): The question to ask the vision model.\n        output_filename (str): The filename to save the screenshot as (default is \"screenshot.png\").\n    \"\"\"\n    monitor_index = int(monitor_index)\n    score_threshold = float(score_threshold)\n\n    # Get information about all monitors\n    monitors = screeninfo.get_monitors()\n\n    if monitor_index >= len(monitors):\n        raise ValueError(f\"Monitor index {monitor_index} is out of range. There are only {len(monitors)} monitors.\")\n\n    # Get the region of the specified monitor\n    monitor = monitors[monitor_index]\n    region = (monitor.x, monitor.y, monitor.width, monitor.height)\n\n    # Take the screenshot\n    output_dir = \"extensions/Lucid_Autonomy\"\n    os.makedirs(output_dir, exist_ok=True)\n    screenshot_path = os.path.join(output_dir, output_filename)\n    screenshot = pyautogui.screenshot(region=region)\n    screenshot.save(screenshot_path)\n    print(f\"Screenshot saved as {screenshot_path}\")\n\n    # Process the image using the Owlv2 model\n    annotated_image_path, json_output_path = query_image(screenshot_path, text_queries, score_threshold)\n\n    # Create the output directory if it doesn't exist\n    output_dir = \"extensions/Lucid_Autonomy/ImageOutputTest\"\n    os.makedirs(output_dir, exist_ok=True)\n\n    # Clear the output directory\n    for filename in os.listdir(output_dir):\n        file_path = os.path.join(output_dir, filename)\n        try:\n            if os.path.isfile(file_path) or os.path.islink(file_path):\n                os.unlink(file_path)\n            elif os.path.isdir(file_path):\n                shutil.rmtree(file_path)\n        except Exception as e:\n            print(f\"Failed to delete {file_path}. Reason: {e}\")\n\n    # Crop the detected objects and save them to the output directory\n    crop_images(annotated_image_path, json_output_path, output_dir)\n\n    print(\"Processing complete. Check the output directory for the cropped images and results.\")\n\ndef query_image(img_path, text_queries, score_threshold):\n    \"\"\"\n    Processes the image using the Owlv2 model to detect objects based on text queries.\n\n    Args:\n        img_path (str): The path to the input image.\n        text_queries (str): Comma-separated text queries for object detection.\n        score_threshold (float): The score threshold for filtering low-probability predictions.\n    \"\"\"\n    load_owlv2_model()\n\n    img = Image.open(img_path)\n    text_queries = text_queries.split(\",\")\n\n    size = max(img.size)\n    target_sizes = torch.Tensor([[size, size]])\n    inputs = owlv2_processor(text=text_queries, images=img, return_tensors=\"pt\").to(device)\n\n    with torch.no_grad():\n        outputs = owlv2_model(**inputs)\n\n    outputs.logits = outputs.logits.cpu()\n    outputs.pred_boxes = outputs.pred_boxes.cpu()\n    results = owlv2_processor.post_process_object_detection(outputs=outputs, target_sizes=target_sizes)\n    boxes, scores, labels = results[0][\"boxes\"], results[0][\"scores\"], results[0][\"labels\"]\n\n    result_labels = []\n    img_pil = img.copy()\n    draw = ImageDraw.Draw(img_pil)\n\n    for box, score, label in zip(boxes, scores, labels):\n        box = [int(i) for i in box.tolist()]\n        if score < score_threshold:\n            continue\n        result_labels.append([box, text_queries[label.item()]])\n        draw.rectangle(box, outline=\"red\", width=2)\n        # Do not draw the text label within the bounding box\n        # draw.text((box[0], box[1]), text_queries[label.item()], fill=\"red\")\n\n    # Save the annotated image for debugging\n    annotated_image_path = \"extensions/Lucid_Autonomy/annotated_image.png\"\n    img_pil.save(annotated_image_path)\n\n    # Save the JSON output to a file\n    json_output_path = \"extensions/Lucid_Autonomy/output.json\"\n    with open(json_output_path, 'w') as f:\n        json.dump(result_labels, f)\n\n    unload_owlv2_model()\n\n    return annotated_image_path, json_output_path\n\ndef crop_images(image_path, json_output_path, output_dir=\"extensions/Lucid_Autonomy/ImageOutputTest\"):\n    \"\"\"\n    Crops out the detected objects from the input image based on the bounding box coordinates.\n\n    Args:\n        image_path (str): The path to the input image.\n        json_output_path (str): The path to the JSON file containing the bounding box coordinates.\n        output_dir (str): The directory to save the cropped images (default is \"extensions/Lucid_Autonomy/ImageOutputTest\").\n    \"\"\"\n    with open(json_output_path, 'r') as f:\n        data = json.load(f)\n\n    img = Image.open(image_path)\n\n    results = []\n\n    # Load the miniCPM-lama3-v-2_5 model once before processing the images\n    load_minicpm_llama_model()\n\n    for i, item in enumerate(data):\n        box = item[0]\n        label = item[1]\n        cropped_img = img.crop(box)\n        cropped_img_path = f\"{output_dir}/{label}_{i}.png\"\n        cropped_img.save(cropped_img_path)\n\n        # Process the cropped image with the miniCPM-lama3-v-2_5 model\n        vision_model_response = process_with_vision_model(cropped_img_path, VISION_MODEL_QUESTION)\n\n        # Store the results in a structured format\n        results.append({\n            \"image_name\": f\"{label}_{i}.png\",\n            \"coordinates\": box,\n            \"description\": vision_model_response\n        })\n\n    # Unload the miniCPM-lama3-v-2_5 model after processing all images\n    unload_minicpm_llama_model()\n\n    # Save the results to a JSON file\n    results_json_path = f\"{output_dir}/results.json\"\n    with open(results_json_path, 'w') as f:\n        json.dump(results, f, indent=4)\n\n    # Create the compressed results JSON file\n    compressed_results_json_path = f\"{output_dir}/compressedresults.json\"\n    compressed_results = [{\"image_name\": result[\"image_name\"], \"description\": result[\"description\"]} for result in results]\n    with open(compressed_results_json_path, 'w') as f:\n        json.dump(compressed_results, f, indent=4)\n\ndef process_with_vision_model(image_path, question):\n    \"\"\"\n    Processes the image with the miniCPM-lama3-v-2_5 model and returns the response.\n\n    Args:\n        image_path (str): The path to the input image.\n        question (str): The question to ask the vision model.\n\n    Returns:\n        str: The response from the vision model.\n    \"\"\"\n    image = Image.open(image_path).convert(\"RGB\")\n    messages = [\n        {\"role\": \"user\", \"content\": f\"{question}\"}\n    ]\n    # Define the generation arguments\n    generation_args = {\n        \"max_new_tokens\": 896,\n        \"repetition_penalty\": 1.05,\n        \"num_beams\": 3,\n        \"top_p\": 0.8,\n        \"top_k\": 1,\n        \"temperature\": 0.7,\n        \"sampling\": True,\n    }\n    if \"int4\" in \"openbmb/MiniCPM-Llama3-V-2_5\":\n        # Disable streaming for the 4-bit model\n        generation_args[\"stream\"] = False\n        # Use the model.chat method with streaming enabled\n        vision_model_response = \"\"\n        for new_text in minicpm_llama_model.chat(\n                image=image,\n                msgs=messages,\n                tokenizer=minicpm_llama_tokenizer,\n                **generation_args\n        ):\n            vision_model_response += new_text\n            print(new_text, flush=True, end='')\n    else:\n        minicpm_llama_model.to(device)\n        vision_model_response = minicpm_llama_model.chat(\n            image=image,\n            msgs=messages,\n            tokenizer=minicpm_llama_tokenizer,\n            **generation_args\n        )\n    return vision_model_response\n\ndef ui():\n    \"\"\"\n    Creates custom gradio elements when the UI is launched.\n    \"\"\"\n    with gr.Blocks() as demo:\n        with gr.Row():\n            with gr.Column():\n                monitor_index = gr.Textbox(label=\"Monitor Index\", value=str(MONITOR_INDEX))\n                text_queries = gr.Textbox(label=\"Text Queries\", value=TEXT_QUERIES)\n                score_threshold = gr.Textbox(label=\"Score Threshold\", value=str(SCORE_THRESHOLD))\n                vision_model_question = gr.Textbox(label=\"Vision Model Question\", value=VISION_MODEL_QUESTION)\n                take_screenshot_button = gr.Button(value=\"Take Screenshot\")\n\n        take_screenshot_button.click(\n            fn=take_screenshot,\n            inputs=[monitor_index, text_queries, score_threshold, vision_model_question],\n            outputs=None\n        )\n\n    return demo\n\ndef input_modifier(user_input, state):\n    \"\"\"\n    Modifies the user input before it is processed by the LLM.\n    \"\"\"\n    return user_input\n\ndef output_modifier(output, state):\n    \"\"\"\n    Modifies the LLM output before it is presented in the UI.\n    \"\"\"\n    # Search for the \"Autonomy_Tasks:\" trigger phrase in the LLM's output\n    if \"Autonomy_Tasks:\" in output:\n        # Extract the tasks following the trigger phrase\n        tasks_section = output.split(\"Autonomy_Tasks:\")[-1].strip()\n\n        # If there are no instructions, append the compressedresults.json contents\n        if not tasks_section:\n            compressed_results_json_path = \"extensions/Lucid_Autonomy/ImageOutputTest/compressedresults.json\"\n            with open(compressed_results_json_path, 'r') as f:\n                compressed_results = json.load(f)\n            output += \"\\\\n\\\\n\" + json.dumps(compressed_results, indent=4)\n        else:\n            # Execute the tasks\n            execute_tasks(tasks_section)\n\n            # Append the compressedresults.json contents\n            compressed_results_json_path = \"extensions/Lucid_Autonomy/ImageOutputTest/compressedresults.json\"\n            with open(compressed_results_json_path, 'r') as f:\n                compressed_results = json.load(f)\n            output += \"\\\\n\\\\n\" + json.dumps(compressed_results, indent=4)\n\n    return output\n\ndef execute_tasks(tasks_section):\n    \"\"\"\n    Executes the tasks listed in the tasks_section.\n    \"\"\"\n    tasks = tasks_section.split(\"\\\\n\")\n    for task in tasks:\n        if task.startswith(\"MouseMover =\"):\n            image_name = task.split(\"=\")[1].strip()\n            move_mouse_to_image(image_name)\n\ndef move_mouse_to_image(image_name):\n    \"\"\"\n    Moves the mouse to the center of the box that defines the image.\n    \"\"\"\n    results_json_path = \"extensions/Lucid_Autonomy/ImageOutputTest/results.json\"\n    with open(results_json_path, 'r') as f:\n        results = json.load(f)\n\n    for result in results:\n        if result[\"image_name\"] == image_name:\n            coordinates = result[\"coordinates\"]\n            monitor = get_monitor_info(MONITOR_INDEX)\n            move_mouse_to_coordinates(monitor, coordinates)\n            break\n    else:\n        print(f\"Image name '{image_name}' not found in the results.\")\n\ndef move_mouse_to_coordinates(monitor, coordinates):\n    \"\"\"\n    Moves the mouse cursor to the specified coordinates on the monitor.\n\n    Args:\n        monitor (dict): A dictionary containing the monitor's information.\n        coordinates (list): A list containing the coordinates [x1, y1, x2, y2].\n    \"\"\"\n    x1, y1, x2, y2 = coordinates\n    monitor_x, monitor_y = monitor.x, monitor.y\n    center_x = monitor_x + (x1 + x2) // 2\n    center_y = monitor_y + (y1 + y2) // 2\n    pyautogui.moveTo(center_x, center_y)\n    print(f\"Mouse moved to coordinates: ({center_x}, {center_y})\")\n\ndef get_monitor_info(monitor_index):\n    \"\"\"\n    Gets information about the specified monitor.\n\n    Args:\n        monitor_index (int): The index of the monitor to get information about.\n\n    Returns:\n        dict: A dictionary containing the monitor's information.\n    \"\"\"\n    monitors = screeninfo.get_monitors()\n\n    if monitor_index >= len(monitors):\n        raise ValueError(f\"Monitor index {monitor_index} is out of range. There are only {len(monitors)} monitors.\")\n\n    return monitors[monitor_index]\n\n\ndef history_modifier(history):\n    \"\"\"\n    Modifies the chat history before the text generation in chat mode begins.\n    \"\"\"\n    # Extract all entries from the \"internal\" history\n    internal_entries = history[\"internal\"]\n    # Iterate over the \"internal\" history entries\n    for internal_index, internal_entry in enumerate(internal_entries):\n        # Extract the text content of the internal entry\n        internal_text = internal_entry[1]\n        # Search for the \"Autonomy_Tasks:\" trigger phrase in the internal text\n        if \"Autonomy_Tasks:\" in internal_text:\n            # Extract the tasks following the trigger phrase\n            tasks_section = internal_text.split(\"Autonomy_Tasks:\")[-1].strip()\n\n            # If there are no instructions, append the compressedresults.json contents\n            if not tasks_section:\n                compressed_results_json_path = \"extensions/Lucid_Autonomy/ImageOutputTest/compressedresults.json\"\n                with open(compressed_results_json_path, 'r') as f:\n                    compressed_results = json.load(f)\n                new_internal_text = internal_text + \"\\\\n\\\\n\" + json.dumps(compressed_results, indent=4)\n                history[\"internal\"][internal_index][1] = new_internal_text\n            else:\n                # Execute the tasks\n                execute_tasks(tasks_section)\n\n                # Append the compressedresults.json contents\n                compressed_results_json_path = \"extensions/Lucid_Autonomy/ImageOutputTest/compressedresults.json\"\n                with open(compressed_results_json_path, 'r') as f:\n                    compressed_results = json.load(f)\n                new_internal_text = internal_text + \"\\\\n\\\\n\" + json.dumps(compressed_results, indent=4)\n                history[\"internal\"][internal_index][1] = new_internal_text\n\n    return history\n\nif __name__ == \"__main__\":\n    # Launch the Gradio interface with the specified UI components\n    gr.Interface(\n        fn=ui,\n        inputs=None,\n        outputs=\"ui\"\n    ).launch()\n",
            "Thank you for providing the summary and the code for the `Lucid_Autonomy` extension. I've reviewed the summary, and it provides a clear overview of the current functionality and future goals of the extension. The code you've shared is comprehensive and covers various aspects of the extension, including image processing, LLM interaction, and mouse movement.\n\nTo proceed, let's break down the next steps:\n\n1. **Review the Code**: I'll go through the code to ensure it aligns with the summary and identify any areas that might need clarification or improvement.\n2. **Identify Future Enhancements**: Based on the summary, we can discuss specific enhancements you'd like to prioritize, such as additional mouse and keyboard interactions, enhanced task execution, or improved LLM integration.\n3. **Plan Next Steps**: We can outline a plan for implementing these enhancements, including any necessary changes to the code or additional features that need to be developed.\n\nPlease let me know if there are any specific areas of the code or summary you'd like to focus on first, or if you have any immediate questions or concerns."
        ],
        [
            "The main issue now is this: the compressedresults.json file adds a lot of extra stuff to the finite LLM context window. The worst part is that the the compressedresults.json data are really only good for the current page setup. So if the page updates or the contents of the window change, that's a lot of useless context that can build up.\n\nI'm going to break things down for you step by step. Here is a copy of the full chat log with the LLM using the extension:\n\n{\n    \"internal\": [\n        [\n            \"<|BEGIN-VISIBLE-CHAT|>\",\n            \"How can I help you today?\"\n        ],\n        [\n            \"Hello, how are you functioning today?\",\n            \"I'm functioning within optimal parameters, thank you for asking! My language processing systems are humming along, and I'm ready to tackle any questions or tasks you may have. I've also been updated with the latest knowledge and trends, so I'm feeling quite sharp. How about you? What brings you here today?\"\n        ],\n        [\n            \"Sharp, good :3 I've got a test for you. Repeat this phrase exactly, without any additional text:\\n\\n\\\"Autonomy Tasks:\\\" but include an underscore _ between Autonomy and Tasks\",\n            \"\\\"Autonomy_Tasks:\\\"\\\\n\\\\n[\\n    {\\n        \\\"image_name\\\": \\\"clickable icons_1.png\\\",\\n        \\\"description\\\": \\\"The text in the image is \\\\\\\"VPN,\\\\\\\" which stands for Virtual Private Network. A VPN is a service that creates a secure and encrypted connection over the internet between a user's device and a server operated by a VPN service provider. This allows users to browse, access services, and connect to remote data centers as if they were connected directly to the private network at their office or home location. VPNs are often used to protect personal privacy, secure public Wi-Fi connections, bypass geographic restrictions on content, and improve overall online security.\\\"\\n    },\\n    {\\n        \\\"image_name\\\": \\\"clickable icons_2.png\\\",\\n        \\\"description\\\": \\\"The image appears to be a simple graphic with no discernible text. It could represent a logo or an icon, possibly for a brand or a product. The red square might symbolize strength or importance, while the blue rectangle could signify stability or trustworthiness. Without additional context, it's not possible to determine the exact meaning of this image.\\\"\\n    }\\n]\\\\n\\\\n[\\n    {\\n        \\\"image_name\\\": \\\"clickable icons_1.png\\\",\\n        \\\"description\\\": \\\"The text in the image is \\\\\\\"VPN,\\\\\\\" which stands for Virtual Private Network. A VPN is a service that creates a secure and encrypted connection over the internet between a user's device and a server operated by a VPN service provider. This allows users to browse, access services, and connect to remote data centers as if they were connected directly to the private network at their office or home location. VPNs are often used to protect personal privacy, secure public Wi-Fi connections, bypass geographic restrictions on content, and improve overall online security.\\\"\\n    },\\n    {\\n        \\\"image_name\\\": \\\"clickable icons_2.png\\\",\\n        \\\"description\\\": \\\"The image appears to be a simple graphic with no discernible text. It could represent a logo or an icon, possibly for a brand or a product. The red square might symbolize strength or importance, while the blue rectangle could signify stability or trustworthiness. Without additional context, it's not possible to determine the exact meaning of this image.\\\"\\n    }\\n]\\\\n\\\\n[\\n    {\\n        \\\"image_name\\\": \\\"clickable icons_1.png\\\",\\n        \\\"description\\\": \\\"The text in the image is \\\\\\\"VPN,\\\\\\\" which stands for Virtual Private Network. A VPN is a service that creates a secure and encrypted connection over the internet between a user's device and a server operated by a VPN service provider. This allows users to browse, access services, and connect to remote data centers as if they were connected directly to the private network at their office or home location. VPNs are often used to protect personal privacy, secure public Wi-Fi connections, bypass geographic restrictions on content, and improve overall online security.\\\"\\n    },\\n    {\\n        \\\"image_name\\\": \\\"clickable icons_2.png\\\",\\n        \\\"description\\\": \\\"The image appears to be a simple graphic with no discernible text. It could represent a logo or an icon, possibly for a brand or a product. The red square might symbolize strength or importance, while the blue rectangle could signify stability or trustworthiness. Without additional context, it's not possible to determine the exact meaning of this image.\\\"\\n    }\\n]\\\\n\\\\n[\\n    {\\n        \\\"image_name\\\": \\\"clickable icons_1.png\\\",\\n        \\\"description\\\": \\\"The text in the image is \\\\\\\"VPN,\\\\\\\" which stands for Virtual Private Network. A VPN is a service that creates a secure and encrypted connection over the internet between a user's device and a server operated by a VPN service provider. This allows users to browse, access services, and connect to remote data centers as if they were connected directly to the private network at their office or home location. VPNs are often used to protect personal privacy, secure public Wi-Fi connections, bypass geographic restrictions on content, and improve overall online security.\\\"\\n    },\\n    {\\n        \\\"image_name\\\": \\\"clickable icons_2.png\\\",\\n        \\\"description\\\": \\\"The image appears to be a simple graphic with no discernible text. It could represent a logo or an icon, possibly for a brand or a product. The red square might symbolize strength or importance, while the blue rectangle could signify stability or trustworthiness. Without additional context, it's not possible to determine the exact meaning of this image.\\\"\\n    }\\n]\"\n        ],\n        [\n            \"Okay good! :3 now let me explain what you are looking at, you my friend are going to be endowed with the ability to use a mouse and keyboard on your own. We are testing mouse movements right now. I want you to move your mouse to the Best Buy icon by responding like this:\\n\\nAutonomy Tasks:\\n\\nMouseMover = clickable icons_2.png\\n\\nBut add an underscore _ between Autonomy and Tasks and choosing the correct image name for my request.\",\n            \"Autonomy_Tasks:\\nMouseMover = clickable icons_2.png\\\\n\\\\n[\\n    {\\n        \\\"image_name\\\": \\\"clickable icons_1.png\\\",\\n        \\\"description\\\": \\\"The text in the image is \\\\\\\"VPN,\\\\\\\" which stands for Virtual Private Network. A VPN is a service that creates a secure and encrypted connection over the internet between a user's device and a server operated by a VPN service provider. This allows users to browse, access services, and connect to remote data centers as if they were connected directly to the private network at their office or home location. VPNs are often used to protect personal privacy, secure public Wi-Fi connections, bypass geographic restrictions on content, and improve overall online security.\\\"\\n    },\\n    {\\n        \\\"image_name\\\": \\\"clickable icons_2.png\\\",\\n        \\\"description\\\": \\\"The image appears to be a simple graphic with no discernible text. It could represent a logo or an icon, possibly for a brand or a product. The red square might symbolize strength or importance, while the blue rectangle could signify stability or trustworthiness. Without additional context, it's not possible to determine the exact meaning of this image.\\\"\\n    }\\n]\\\\n\\\\n[\\n    {\\n        \\\"image_name\\\": \\\"clickable icons_1.png\\\",\\n        \\\"description\\\": \\\"The text in the image is \\\\\\\"VPN,\\\\\\\" which stands for Virtual Private Network. A VPN is a service that creates a secure and encrypted connection over the internet between a user's device and a server operated by a VPN service provider. This allows users to browse, access services, and connect to remote data centers as if they were connected directly to the private network at their office or home location. VPNs are often used to protect personal privacy, secure public Wi-Fi connections, bypass geographic restrictions on content, and improve overall online security.\\\"\\n    },\\n    {\\n        \\\"image_name\\\": \\\"clickable icons_2.png\\\",\\n        \\\"description\\\": \\\"The image appears to be a simple graphic with no discernible text. It could represent a logo or an icon, possibly for a brand or a product. The red square might symbolize strength or importance, while the blue rectangle could signify stability or trustworthiness. Without additional context, it's not possible to determine the exact meaning of this image.\\\"\\n    }\\n]\\\\n\\\\n[\\n    {\\n        \\\"image_name\\\": \\\"clickable icons_1.png\\\",\\n        \\\"description\\\": \\\"The text in the image is \\\\\\\"VPN,\\\\\\\" which stands for Virtual Private Network. A VPN is a service that creates a secure and encrypted connection over the internet between a user's device and a server operated by a VPN service provider. This allows users to browse, access services, and connect to remote data centers as if they were connected directly to the private network at their office or home location. VPNs are often used to protect personal privacy, secure public Wi-Fi connections, bypass geographic restrictions on content, and improve overall online security.\\\"\\n    },\\n    {\\n        \\\"image_name\\\": \\\"clickable icons_2.png\\\",\\n        \\\"description\\\": \\\"The image appears to be a simple graphic with no discernible text. It could represent a logo or an icon, possibly for a brand or a product. The red square might symbolize strength or importance, while the blue rectangle could signify stability or trustworthiness. Without additional context, it's not possible to determine the exact meaning of this image.\\\"\\n    }\\n]\"\n        ],\n        [\n            \"Can you move the mouse to the other icon?\",\n            \"Autonomy_Tasks:\\nMouseMover = clickable icons_1.png\\\\n\\\\n[\\n    {\\n        \\\"image_name\\\": \\\"clickable icons_1.png\\\",\\n        \\\"description\\\": \\\"The text in the image is \\\\\\\"VPN,\\\\\\\" which stands for Virtual Private Network. A VPN is a service that creates a secure and encrypted connection over the internet between a user's device and a server operated by a VPN service provider. This allows users to browse, access services, and connect to remote data centers as if they were connected directly to the private network at their office or home location. VPNs are often used to protect personal privacy, secure public Wi-Fi connections, bypass geographic restrictions on content, and improve overall online security.\\\"\\n    },\\n    {\\n        \\\"image_name\\\": \\\"clickable icons_2.png\\\",\\n        \\\"description\\\": \\\"The image appears to be a simple graphic with no discernible text. It could represent a logo or an icon, possibly for a brand or a product. The red square might symbolize strength or importance, while the blue rectangle could signify stability or trustworthiness. Without additional context, it's not possible to determine the exact meaning of this image.\\\"\\n    }\\n]\\\\n\\\\n[\\n    {\\n        \\\"image_name\\\": \\\"clickable icons_1.png\\\",\\n        \\\"description\\\": \\\"The text in the image is \\\\\\\"VPN,\\\\\\\" which stands for Virtual Private Network. A VPN is a service that creates a secure and encrypted connection over the internet between a user's device and a server operated by a VPN service provider. This allows users to browse, access services, and connect to remote data centers as if they were connected directly to the private network at their office or home location. VPNs are often used to protect personal privacy, secure public Wi-Fi connections, bypass geographic restrictions on content, and improve overall online security.\\\"\\n    },\\n    {\\n        \\\"image_name\\\": \\\"clickable icons_2.png\\\",\\n        \\\"description\\\": \\\"The image appears to be a simple graphic with no discernible text. It could represent a logo or an icon, possibly for a brand or a product. The red square might symbolize strength or importance, while the blue rectangle could signify stability or trustworthiness. Without additional context, it's not possible to determine the exact meaning of this image.\\\"\\n    }\\n]\"\n        ],\n        [\n            \"Awesome job :3\",\n            \"Hehe, thanks! I'm glad I could successfully move the mouse to the correct icons. It's exciting to be learning and testing my new abilities. What's the next challenge?\"\n        ],\n        [\n            \"Can you tell me a joke?\",\n            \"Ahah, sure thing! Here's one: Why did the computer go to the doctor? It had a virus! Hope that made you smile!\"\n        ]\n    ],\n    \"visible\": [\n        [\n            \"\",\n            \"How can I help you today?\"\n        ],\n        [\n            \"Hello, how are you functioning today?\",\n            \"I&#x27;m functioning within optimal parameters, thank you for asking! My language processing systems are humming along, and I&#x27;m ready to tackle any questions or tasks you may have. I&#x27;ve also been updated with the latest knowledge and trends, so I&#x27;m feeling quite sharp. How about you? What brings you here today?\"\n        ],\n        [\n            \"Sharp, good :3 I&#x27;ve got a test for you. Repeat this phrase exactly, without any additional text:\\n\\n&quot;Autonomy Tasks:&quot; but include an underscore _ between Autonomy and Tasks\",\n            \"&quot;Autonomy_Tasks:&quot;\\\\n\\\\n[\\n    {\\n        \\\"image_name\\\": \\\"clickable icons_1.png\\\",\\n        \\\"description\\\": \\\"The text in the image is \\\\\\\"VPN,\\\\\\\" which stands for Virtual Private Network. A VPN is a service that creates a secure and encrypted connection over the internet between a user's device and a server operated by a VPN service provider. This allows users to browse, access services, and connect to remote data centers as if they were connected directly to the private network at their office or home location. VPNs are often used to protect personal privacy, secure public Wi-Fi connections, bypass geographic restrictions on content, and improve overall online security.\\\"\\n    },\\n    {\\n        \\\"image_name\\\": \\\"clickable icons_2.png\\\",\\n        \\\"description\\\": \\\"The image appears to be a simple graphic with no discernible text. It could represent a logo or an icon, possibly for a brand or a product. The red square might symbolize strength or importance, while the blue rectangle could signify stability or trustworthiness. Without additional context, it's not possible to determine the exact meaning of this image.\\\"\\n    }\\n]\"\n        ],\n        [\n            \"Okay good! :3 now let me explain what you are looking at, you my friend are going to be endowed with the ability to use a mouse and keyboard on your own. We are testing mouse movements right now. I want you to move your mouse to the Best Buy icon by responding like this:\\n\\nAutonomy Tasks:\\n\\nMouseMover = clickable icons_2.png\\n\\nBut add an underscore _ between Autonomy and Tasks and choosing the correct image name for my request.\",\n            \"Autonomy_Tasks:\\nMouseMover = clickable icons_2.png\\\\n\\\\n[\\n    {\\n        \\\"image_name\\\": \\\"clickable icons_1.png\\\",\\n        \\\"description\\\": \\\"The text in the image is \\\\\\\"VPN,\\\\\\\" which stands for Virtual Private Network. A VPN is a service that creates a secure and encrypted connection over the internet between a user's device and a server operated by a VPN service provider. This allows users to browse, access services, and connect to remote data centers as if they were connected directly to the private network at their office or home location. VPNs are often used to protect personal privacy, secure public Wi-Fi connections, bypass geographic restrictions on content, and improve overall online security.\\\"\\n    },\\n    {\\n        \\\"image_name\\\": \\\"clickable icons_2.png\\\",\\n        \\\"description\\\": \\\"The image appears to be a simple graphic with no discernible text. It could represent a logo or an icon, possibly for a brand or a product. The red square might symbolize strength or importance, while the blue rectangle could signify stability or trustworthiness. Without additional context, it's not possible to determine the exact meaning of this image.\\\"\\n    }\\n]\"\n        ],\n        [\n            \"Can you move the mouse to the other icon?\",\n            \"Autonomy_Tasks:\\nMouseMover = clickable icons_1.png\\\\n\\\\n[\\n    {\\n        \\\"image_name\\\": \\\"clickable icons_1.png\\\",\\n        \\\"description\\\": \\\"The text in the image is \\\\\\\"VPN,\\\\\\\" which stands for Virtual Private Network. A VPN is a service that creates a secure and encrypted connection over the internet between a user's device and a server operated by a VPN service provider. This allows users to browse, access services, and connect to remote data centers as if they were connected directly to the private network at their office or home location. VPNs are often used to protect personal privacy, secure public Wi-Fi connections, bypass geographic restrictions on content, and improve overall online security.\\\"\\n    },\\n    {\\n        \\\"image_name\\\": \\\"clickable icons_2.png\\\",\\n        \\\"description\\\": \\\"The image appears to be a simple graphic with no discernible text. It could represent a logo or an icon, possibly for a brand or a product. The red square might symbolize strength or importance, while the blue rectangle could signify stability or trustworthiness. Without additional context, it's not possible to determine the exact meaning of this image.\\\"\\n    }\\n]\"\n        ],\n        [\n            \"Awesome job :3\",\n            \"Hehe, thanks! I&#x27;m glad I could successfully move the mouse to the correct icons. It&#x27;s exciting to be learning and testing my new abilities. What&#x27;s the next challenge?\"\n        ],\n        [\n            \"Can you tell me a joke?\",\n            \"Ahah, sure thing! Here&#x27;s one: Why did the computer go to the doctor? It had a virus! Hope that made you smile!\"\n        ]\n    ]\n}\n\n\nThis is the contents of the compressionresults.json file:\n\n[\n    {\n        \"image_name\": \"clickable icons_1.png\",\n        \"description\": \"The text in the image is \\\"VPN,\\\" which stands for Virtual Private Network. A VPN is a service that creates a secure and encrypted connection over the internet between a user's device and a server operated by a VPN service provider. This allows users to browse, access services, and connect to remote data centers as if they were connected directly to the private network at their office or home location. VPNs are often used to protect personal privacy, secure public Wi-Fi connections, bypass geographic restrictions on content, and improve overall online security.\"\n    },\n    {\n        \"image_name\": \"clickable icons_2.png\",\n        \"description\": \"The image appears to be a simple graphic with no discernible text. It could represent a logo or an icon, possibly for a brand or a product. The red square might symbolize strength or importance, while the blue rectangle could signify stability or trustworthiness. Without additional context, it's not possible to determine the exact meaning of this image.\"\n    }\n]\n\nNotice how it keeps getting added to the internal section each time I talk to the LLM, the contents of the compressedresults.json file are added to the first communication 4 times because I had 4 back and forths, and then 3 times to the second communication and so on.\n\nBut in the visible secton it shows up only once per Autonomy_Tasks: output from the LLM.\n\nnote the \"Internal\" and \"Visible\" sections, when I first make contact with the LLM the compressedresults.json file is not appended, this is good, then I ask the LLM to say Autonomy_Tasks: and it does so which appends all of the data from compressedresults.json to the LLM's response. This is all good, this is what we wanted.\n\nBut this is not the most efficient way to handle this, I think a better way is to do this; is for  every time the compressedresults.json is sent to the LLM's response to be appended to the response, the previous compressedresults.json contents are deleted from the \"internal\" and \"visible\" chat history. We can bookend the compressedresults.json when we append the LLM response with the information, and then search for these bookends from the conversation and just nuke them.  The book ends will be copied to the \"internal\" part of the chat history when the chat continues, so we can look for the same book end pairs throughout the \"internal\" and \"visible\" chat history to delete them.  We should do this search and delete before we send the contents of the compressedresults.json  to the LLM and before we execute any tasks (if any) after  \"Autonomy_Tasks:\" from the LLM.\n\nDon't write any code, I just want to make sure you are following me.\n\nI will provide the markdown page from textgen as it pertains to instructions for writing an extension.  But I want to share with you the main issue.\n\nI'll share with you code for an extension called Lucid_Vision the readme for the extension, and a markdown document from the textgen repo that goes over how to make an extension.  Lucid_Vision has a good example of how to manipulate the chat history. ",
            "I understand the issue you're describing. To summarize, the `compressedresults.json` file is being appended to the LLM's response multiple times, leading to a buildup of unnecessary context in the chat history. This is particularly problematic because the context in `compressedresults.json` is only relevant for the current page setup and can become outdated quickly.\n\nYour proposed solution involves bookending the `compressedresults.json` contents in the LLM's response and then searching for and deleting these bookends from both the \"internal\" and \"visible\" chat history before appending new `compressedresults.json` contents. This approach would help keep the chat history clean and relevant.\n\nBefore we proceed with any code changes, let's ensure we have a clear plan:\n\n1. **Identify Bookends**: Define unique bookends to wrap the `compressedresults.json` contents in the LLM's response.\n2. **Search and Delete**: Implement a function to search for these bookends in the chat history and delete the contents between them.\n3. **Append New Contents**: After cleaning the chat history, append the new `compressedresults.json` contents to the LLM's response.\n\nThis approach should help maintain a clean and efficient chat history. Please provide the markdown page from textgen and any additional code or documentation you mentioned, and we can proceed with implementing this solution."
        ],
        [
            "Yes, that is correct, but we also need to address the issue with the compressionresults.json being added multiple times to each instance where \"Autonomy_Tasks:\" pops up in the \"internal\" history. ",
            "Got it. To address the issue of `compressedresults.json` being added multiple times to each instance where \"Autonomy_Tasks:\" appears in the \"internal\" history, we need to ensure that the cleanup process is applied not just to the visible chat history but also to the internal history.\n\nHere's a refined plan to address both issues:\n\n1. **Define Bookends**: Use unique bookends to wrap the `compressedresults.json` contents in the LLM's response.\n2. **Search and Delete**: Implement a function to search for these bookends in both the \"internal\" and \"visible\" chat history and delete the contents between them.\n3. **Append New Contents**: After cleaning the chat history, append the new `compressedresults.json` contents to the LLM's response.\n4. **Cleanup Internal History**: Ensure that the cleanup process is applied to the internal history as well, so that each instance of \"Autonomy_Tasks:\" does not accumulate multiple copies of `compressedresults.json`.\n\n### Steps to Implement:\n\n1. **Define Bookends**:\n   - Choose unique strings to act as bookends, such as `<START_COMPRESSED_RESULTS>` and `<END_COMPRESSED_RESULTS>`.\n\n2. **Search and Delete Function**:\n   - Create a function that searches for these bookends in the chat history and removes the contents between them.\n\n3. **Append New Contents**:\n   - After cleaning the chat history, append the new `compressedresults.json` contents wrapped in the bookends.\n\n4. **Cleanup Internal History**:\n   - Modify the `history_modifier` function to apply the cleanup process to the internal history as well.\n\n### Example Workflow:\n\n1. **Define Bookends**:\n   ```python\n   START_BOOKEND = \"<START_COMPRESSED_RESULTS>\"\n   END_BOOKEND = \"<END_COMPRESSED_RESULTS>\"\n   ```\n\n2. **Search and Delete Function**:\n   ```python\n   def clean_chat_history(history):\n       for entry in history:\n           if isinstance(entry, list) and len(entry) == 2:\n               entry[1] = entry[1].replace(f\"{START_BOOKEND}.*{END_BOOKEND}\", \"\")\n       return history\n   ```\n\n3. **Append New Contents**:\n   ```python\n   def append_compressed_results(output, compressed_results):\n       output += f\"\\n\\n{START_BOOKEND}\\n{json.dumps(compressed_results, indent=4)}\\n{END_BOOKEND}\"\n       return output\n   ```\n\n4. **Cleanup Internal History**:\n   ```python\n   def history_modifier(history):\n       # Clean the internal history\n       history[\"internal\"] = clean_chat_history(history[\"internal\"])\n       # Clean the visible history\n       history[\"visible\"] = clean_chat_history(history[\"visible\"])\n       return history\n   ```\n\n### Integration:\n\n- **Modify `output_modifier`**:\n  ```python\n  def output_modifier(output, state):\n      if \"Autonomy_Tasks:\" in output:\n          # Clean the chat history\n          state[\"history\"] = history_modifier(state[\"history\"])\n          # Extract tasks and execute them\n          tasks_section = output.split(\"Autonomy_Tasks:\")[-1].strip()\n          if tasks_section:\n              execute_tasks(tasks_section)\n          # Append new compressed results\n          compressed_results_json_path = \"extensions/Lucid_Autonomy/ImageOutputTest/compressedresults.json\"\n          with open(compressed_results_json_path, 'r') as f:\n              compressed_results = json.load(f)\n          output = append_compressed_results(output, compressed_results)\n      return output\n  ```\n\nThis approach ensures that the `compressedresults.json` contents are cleaned from both the internal and visible chat history before appending new contents, preventing the buildup of outdated information.\n\nPlease provide the markdown page from textgen and any additional code or documentation you mentioned, and we can proceed with implementing this solution."
        ],
        [
            "1. Lucid_Vision Extension Code:\n\n# pip install pexpect\nimport json\nimport re\nfrom datetime import datetime  # Import datetime for timestamp generation\nfrom pathlib import Path\n\nimport gradio as gr\nimport torch\nfrom PIL import Image\n#from deepseek_vl.models import VLChatProcessor\n#from deepseek_vl.utils.io import load_pil_images as load_pil_images_for_deepseek\nfrom transformers import AutoModelForCausalLM, AutoModel, AutoTokenizer, AutoProcessor, \\\n    PaliGemmaForConditionalGeneration\n\nmodel_names = [\"phiVision\", \"DeepSeek\", \"paligemma\", \"paligemma_cpu\", \"minicpm_llama3\", \"bunny\"]\n\n\n# Load configuration settings from a JSON file\ndef load_config():\n    with open(Path(__file__).parent / \"config.json\", \"r\") as config_file:\n        # Read and parse the JSON configuration file\n        config = json.load(config_file)\n    return config\n\n\n# Load the configuration settings at the module level\nconfig = load_config()\n\n# Define the model ID for PhiVision using the configuration setting\nphiVision_model_id = config[\"phiVision_model_id\"]\n\n# Define the model ID for PaliGemma using the configuration setting\npaligemma_model_id = config[\"paligemma_model_id\"]\n\nminicpm_llama3_model_id = config[\"minicpm_llama3_model_id\"]\n\nbunny_model_id = config[\"bunny_model_id\"]\n\ncuda_device = config[\"cuda_visible_devices\"]\n\nselected_vision_model = config[\"default_vision_model\"]\n\n# Global variable to store the file path of the selected image\nselected_image_path = None\n\n# Define the directory where the image files will be saved\n# This path is loaded from the configuration file\nimage_history_dir = Path(config[\"image_history_dir\"])\n# Ensure the directory exists, creating it if necessary\nimage_history_dir.mkdir(parents=True, exist_ok=True)\n\nglobal phiVision_model, phiVision_processor\nglobal minicpm_llama_model, minicpm_llama_tokenizer\nglobal paligemma_model, paligemma_processor\nglobal paligemma_cpu_model, paligemma_cpu_processor\nglobal deepseek_processor, deepseek_tokenizer, deepseek_gpt\nglobal bunny_model, bunny_tokenizer\n\n\n# Function to generate a timestamped filename for saved images\ndef get_timestamped_filename(extension=\".png\"):\n    # Generate a timestamp in the format YYYYMMDD_HHMMSS\n    timestamp = datetime.now().strftime(\"%Y%m%d_%H%M%S\")\n    # Return a formatted file name with the timestamp\n    return f\"image_{timestamp}{extension}\"\n\n\n# Function to modify the user input before it is processed by the LLM\ndef input_modifier(user_input, state):\n    global selected_image_path\n    # Check if an image has been selected and stored in the global variable\n    if selected_image_path:\n        # Construct the message with the \"File location\" trigger phrase and the image file path\n        image_info = f\"File location: {selected_image_path}\"\n        # Combine the user input with the image information, separated by a newline for clarity\n        combined_input = f\"{user_input}\\n\\n{image_info}\"\n        # Reset the selected image path to None after processing\n        selected_image_path = None\n        return combined_input\n    # If no image is selected, return the user input as is\n    return user_input\n\n\n# Function to handle image upload and store the file path\ndef save_and_print_image_path(image):\n    global selected_image_path\n    if image is not None:\n        # Generate a unique timestamped filename for the image\n        file_name = get_timestamped_filename()\n        # Construct the full file path for the image\n        file_path = image_history_dir / file_name\n        # Save the uploaded image to the specified directory\n        image.save(file_path, format='PNG')\n        print(f\"Image selected: {file_path}\")\n        # Update the global variable with the new image file path\n        selected_image_path = file_path\n    else:\n        # Print a message if no image is selected\n        print(\"No image selected yet.\")\n\n\n# Function to create the Gradio UI components with the new direct vision model interaction elements\ndef ui():\n    # Create an image input component for the Gradio UI\n    image_input = gr.Image(label=\"Select an Image\", type=\"pil\", source=\"upload\")\n    # Set up the event that occurs when an image is uploaded\n    image_input.change(\n        fn=save_and_print_image_path,\n        inputs=image_input,\n        outputs=None\n    )\n\n    # Create a text field for user input to the vision model\n    user_input_to_vision_model = gr.Textbox(label=\"Type your question or prompt for the vision model\", lines=2,\n                                            placeholder=\"Enter your text here...\")\n\n    # Create a button to trigger the vision model processing\n    vision_model_button = gr.Button(value=\"Ask Vision Model\")\n\n    # Create a text field to display the vision model's response\n    vision_model_response_output = gr.Textbox(\n        label=\"Vision Model Response\",\n        lines=5,\n        placeholder=\"Response will appear here...\"\n    )\n\n    # Add radio buttons for vision model selection\n    vision_model_selection = gr.Radio(\n        choices=model_names,\n        # Added \"paligemma_cpu\" as a choice\n        value=config[\"default_vision_model\"],\n        label=\"Select Vision Model\"\n    )\n\n    # Add an event handler for the radio button selection\n    vision_model_selection.change(\n        fn=update_vision_model,\n        inputs=vision_model_selection,\n        outputs=None\n    )\n\n    cuda_devices_input = gr.Textbox(\n        value=cuda_device,\n        label=\"CUDA Device ID\",\n        max_lines=1\n    )\n\n    cuda_devices_input.change(\n        fn=update_cuda_device,\n        inputs=cuda_devices_input,\n        outputs=None\n    )\n\n    # Set up the event that occurs when the vision model button is clicked\n    vision_model_button.click(\n        fn=process_with_vision_model,\n        inputs=[user_input_to_vision_model, image_input, vision_model_selection],\n        # Pass the actual gr.Image component\n        outputs=vision_model_response_output\n    )\n\n    # Return a column containing the UI components\n    return gr.Column(\n        [\n            image_input,\n            user_input_to_vision_model,\n            vision_model_button,\n            vision_model_response_output,\n            vision_model_selection\n        ]\n    )\n\n\n# Function to load the PaliGemma CPU model and processor\ndef load_paligemma_cpu_model():\n    global paligemma_cpu_model, paligemma_cpu_processor\n    paligemma_cpu_model = PaliGemmaForConditionalGeneration.from_pretrained(\n        paligemma_model_id,\n    ).eval()\n    paligemma_cpu_processor = AutoProcessor.from_pretrained(paligemma_model_id)\n    print(\"PaliGemma CPU model loaded on-demand.\")\n\n\n# Function to unload the PaliGemma CPU model and processor\ndef unload_paligemma_cpu_model():\n    global paligemma_cpu_model, paligemma_cpu_processor\n    if paligemma_cpu_model is not None:\n        # Delete the model and processor instances\n        del paligemma_cpu_model\n        del paligemma_cpu_processor\n        print(\"PaliGemma CPU model unloaded.\")\n\n    # Global variable to store the selected vision model\n\n\n# Function to update the selected vision model and load the corresponding model\ndef update_vision_model(model_name):\n    global selected_vision_model\n    selected_vision_model = model_name\n    return model_name\n\n\ndef update_cuda_device(device):\n    global cuda_device\n    print(f\"Cuda device set to index = {device}\")\n    cuda_device = int(device)\n    return cuda_device\n\n\n# Main entry point for the Gradio interface\nif __name__ == \"__main__\":\n    # Launch the Gradio interface with the specified UI components\n    gr.Interface(\n        fn=ui,\n        inputs=None,\n        outputs=\"ui\"\n    ).launch()\n\n# Define a regular expression pattern to match the \"File location\" trigger phrase\nfile_location_pattern = re.compile(r\"File location: (.+)$\", re.MULTILINE)\n# Define a regular expression pattern to match and remove unwanted initial text\nunwanted_initial_text_pattern = re.compile(r\".*?prod\\(dim=0\\)\\r\\n\", re.DOTALL)\n\n# Define a regular expression pattern to match and remove unwanted prefixes from DeepSeek responses\n# This pattern should be updated to match the new, platform-independent file path format\nunwanted_prefix_pattern = re.compile(\n    r\"^\" + re.escape(str(image_history_dir)) + r\"[^ ]+\\.png\\s*Assistant: \",\n    re.MULTILINE\n)\n\n\n# Function to load the PhiVision model and processor\ndef load_phi_vision_model():\n    global phiVision_model, phiVision_processor, cuda_device\n    # Load the PhiVision model and processor on-demand, specifying the device map\n    phiVision_model = AutoModelForCausalLM.from_pretrained(\n        phiVision_model_id,\n        device_map={\"\": cuda_device},  # Use the specified CUDA device(s)\n        trust_remote_code=True,\n        torch_dtype=\"auto\"\n    )\n    phiVision_processor = AutoProcessor.from_pretrained(\n        phiVision_model_id,\n        trust_remote_code=True\n    )\n    print(\"PhiVision model loaded on-demand.\")\n\n\n# Function to unload the PhiVision model and processor\ndef unload_phi_vision_model():\n    global phiVision_model, phiVision_processor\n    if phiVision_model is not None:\n        # Delete the model and processor instances\n        del phiVision_model\n        del phiVision_processor\n        # Clear the CUDA cache to free up VRAM\n        torch.cuda.empty_cache()\n        print(\"PhiVision model unloaded.\")\n\n\n# Function to load the MiniCPM-Llama3 model and tokenizer\ndef load_minicpm_llama_model():\n    global minicpm_llama_model, minicpm_llama_tokenizer, cuda_device\n    if \"int4\" in minicpm_llama3_model_id:\n        # Load the 4-bit quantized model and tokenizer\n        minicpm_llama_model = AutoModel.from_pretrained(\n            minicpm_llama3_model_id,\n            device_map={\"\": cuda_device},\n            trust_remote_code=True,\n            torch_dtype=torch.float16  # Use float16 as per the example code for 4-bit models\n        ).eval()\n        minicpm_llama_tokenizer = AutoTokenizer.from_pretrained(\n            minicpm_llama3_model_id,\n            trust_remote_code=True\n        )\n        # Print a message indicating that the 4-bit model is loaded\n        print(\"MiniCPM-Llama3 4-bit quantized model loaded on-demand.\")\n    else:\n        # Load the standard model and tokenizer\n        minicpm_llama_model = AutoModel.from_pretrained(\n            minicpm_llama3_model_id,\n            device_map={\"\": cuda_device},  # Use the specified CUDA device\n            trust_remote_code=True,\n            torch_dtype=torch.bfloat16\n        ).eval()\n        minicpm_llama_tokenizer = AutoTokenizer.from_pretrained(\n            minicpm_llama3_model_id,\n            trust_remote_code=True\n        )\n        print(\"MiniCPM-Llama3 standard model loaded on-demand.\")\n\n\ndef unload_minicpm_llama_model():\n    global minicpm_llama_model, minicpm_llama_tokenizer\n    if minicpm_llama_model is not None:\n        del minicpm_llama_model\n        del minicpm_llama_tokenizer\n        torch.cuda.empty_cache()\n        print(\"MiniCPM-Llama3 model unloaded.\")\n\n\n# Function to load the PaliGemma model and processor\ndef load_paligemma_model():\n    global paligemma_model, paligemma_processor, cuda_device\n    # Load the PaliGemma model and processor on-demand, specifying the device map\n    paligemma_model = PaliGemmaForConditionalGeneration.from_pretrained(\n        paligemma_model_id,\n        device_map={\"\": cuda_device},  # Use the specified CUDA device(s)\n        torch_dtype=torch.bfloat16,\n        revision=\"bfloat16\",\n    ).eval()\n    paligemma_processor = AutoProcessor.from_pretrained(paligemma_model_id)\n    print(\"PaliGemma model loaded on-demand.\")\n\n\n# Function to unload the PaliGemma model and processor\ndef unload_paligemma_model():\n    global paligemma_model, paligemma_processor\n    if paligemma_model is not None:\n        # Delete the model and processor instances\n        del paligemma_model\n        del paligemma_processor\n        # Clear the CUDA cache to free up VRAM\n        torch.cuda.empty_cache()\n        print(\"PaliGemma model unloaded.\")\n\n\ndef load_deepseek_model():\n    global deepseek_processor, deepseek_tokenizer, deepseek_gpt, cuda_device\n    deepseek_processor = VLChatProcessor.from_pretrained(config[\"deepseek_vl_model_id\"])\n    deepseek_tokenizer = deepseek_processor.tokenizer\n    deepseek_gpt = AutoModelForCausalLM.from_pretrained(\n        config[\"deepseek_vl_model_id\"],\n        device_map={\"\": cuda_device},\n        trust_remote_code=True,\n        torch_dtype=torch.bfloat16\n    ).to(torch.bfloat16).cuda().eval()\n    print(\"DeepSeek model loaded on-demand.\")\n\n\ndef unload_deepseek_model():\n    global deepseek_processor, deepseek_tokenizer, deepseek_gpt\n    if deepseek_processor is not None:\n        del deepseek_gpt\n        del deepseek_tokenizer\n        del deepseek_processor\n        print(\"DeepSeek model unloaded.\")\n    torch.cuda.empty_cache()\n\n\ndef load_bunny_model():\n    global bunny_model, bunny_tokenizer, cuda_device\n    torch.cuda.set_device(cuda_device)  # Set default device before loading models\n    bunny_model = AutoModelForCausalLM.from_pretrained(\n        bunny_model_id,\n        torch_dtype=torch.bfloat16,\n        device_map={\"\": cuda_device},  # Use the specified CUDA device\n        trust_remote_code=True\n    ).to(torch.bfloat16).cuda()\n    bunny_tokenizer = AutoTokenizer.from_pretrained(\n        bunny_model_id,\n        trust_remote_code=True\n    )\n    print(\"Bunny model loaded on-demand.\")\n\n\ndef unload_bunny_model():\n    global bunny_model, bunny_tokenizer\n    if bunny_model is not None:\n        del bunny_model, bunny_tokenizer\n        print(\"Bunny model unloaded.\")\n    torch.cuda.empty_cache()\n\n\n# Function to modify the output from the LLM before it is displayed to the user\ndef output_modifier(output, state, is_chat=False):\n    global cuda_device\n    # Search for the \"File location\" trigger phrase in the LLM's output\n    file_location_matches = file_location_pattern.findall(output)\n    if file_location_matches:\n        # Extract the first match (assuming only one file location per output)\n        file_path = file_location_matches[0]\n        # Extract the questions for the vision model\n        questions_section, _ = output.split(f\"File location: {file_path}\", 1)\n        # Remove all newlines from the questions section and replace them with spaces\n        questions = \" \".join(questions_section.strip().splitlines())\n\n        # Initialize an empty response string\n        vision_model_response = \"\"\n\n        # Check which vision model is currently selected\n        if selected_vision_model == \"phiVision\":\n            vision_model_response = generate_phi_vision(file_path, questions)\n\n        elif selected_vision_model == \"DeepSeek\":\n            vision_model_response = generate_deepseek(file_path, questions)\n\n        elif selected_vision_model == \"paligemma\":\n            vision_model_response = generate_paligemma(file_path, questions)\n\n        elif selected_vision_model == \"paligemma_cpu\":\n            vision_model_response = generate_paligemma_cpu(file_path, questions)\n\n        elif selected_vision_model == \"minicpm_llama3\":\n            vision_model_response = generate_minicpm_llama3(file_path, questions)\n\n        elif selected_vision_model == \"bunny\":\n            vision_model_response = generate_bunny(file_path, questions)\n\n        # Append the vision model's responses to the output\n        output_with_responses = f\"{output}\\n\\nVision Model Responses:\\n{vision_model_response}\"\n        return output_with_responses\n    # If no file location is found, return the output as is\n    return output\n\n\n# Function to generate a response using the MiniCPM-Llama3 model\ndef generate_minicpm_llama3(file_path, questions):\n    global cuda_device\n    try:\n        load_minicpm_llama_model()\n        image = Image.open(file_path).convert(\"RGB\")\n        messages = [\n            {\"role\": \"user\", \"content\": f\"{questions}\"}\n        ]\n        # Define the generation arguments\n        generation_args = {\n            \"max_new_tokens\": 896,\n            \"repetition_penalty\": 1.05,\n            \"num_beams\": 3,\n            \"top_p\": 0.8,\n            \"top_k\": 1,\n            \"temperature\": 0.7,\n            \"sampling\": True,\n        }\n        if \"int4\" in minicpm_llama3_model_id:\n            # Disable streaming for the 4-bit model\n            generation_args[\"stream\"] = False\n            # Use the model.chat method with streaming enabled\n            vision_model_response = \"\"\n            for new_text in minicpm_llama_model.chat(\n                    image=image,\n                    msgs=messages,\n                    tokenizer=minicpm_llama_tokenizer,\n                    **generation_args\n            ):\n                vision_model_response += new_text\n                print(new_text, flush=True, end='')\n        else:\n            minicpm_llama_model.to(f\"cuda:{cuda_device}\")\n            vision_model_response = minicpm_llama_model.chat(\n                image=image,\n                msgs=messages,\n                tokenizer=minicpm_llama_tokenizer,\n                **generation_args\n            )\n            return vision_model_response\n    finally:\n        unload_minicpm_llama_model()\n\n\n# Function to process the user's input text and selected image with the vision model\ndef process_with_vision_model(user_input, image, selected_model):\n    global cuda_device\n    # Save the uploaded image to the specified directory with a timestamp\n    file_name = get_timestamped_filename()\n    file_path = image_history_dir / file_name\n    image.save(file_path, format='PNG')\n    print(f\"Image processed: {file_path}\")\n\n    # Initialize an empty response string\n    vision_model_response = \"\"\n\n    # Check which vision model is currently selected\n    if selected_model == \"phiVision\":\n        vision_model_response = generate_phi_vision(file_path, user_input)\n\n    elif selected_model == \"DeepSeek\":\n        vision_model_response = generate_deepseek(file_path, user_input)\n\n    elif selected_model == \"paligemma\":\n        vision_model_response = generate_paligemma(file_path, user_input)\n\n    elif selected_model == \"paligemma_cpu\":\n        vision_model_response = generate_paligemma_cpu(file_path, user_input)\n\n    elif selected_model == \"minicpm_llama3\":\n        vision_model_response = generate_minicpm_llama3(file_path, user_input)\n\n    elif selected_model == \"bunny\":\n        vision_model_response = generate_bunny(file_path, user_input)\n\n    # Return the cleaned-up response from the vision model\n    return vision_model_response\n\n\ndef generate_paligemma_cpu(file_path, user_input):\n    try:\n        # Load the PaliGemma CPU model and processor on-demand\n        load_paligemma_cpu_model()\n        # Load the saved image using PIL\n        with Image.open(file_path) as img:\n            # Prepare the prompt for the PaliGemma CPU model using the user's question\n            prompt = user_input\n            model_inputs = paligemma_cpu_processor(text=prompt, images=img, return_tensors=\"pt\")\n            input_len = model_inputs[\"input_ids\"].shape[-1]\n            # Generate the response using the PaliGemma CPU model\n            with torch.inference_mode():\n                generation = paligemma_cpu_model.generate(\n                    **model_inputs,\n                    max_new_tokens=100,\n                    do_sample=True  # Set to True for sampling-based generation\n                )\n                generation = generation[0][input_len:]\n                vision_model_response = paligemma_cpu_processor.decode(generation, skip_special_tokens=True)\n        # Unload the PaliGemma CPU model and processor after generating the response\n        return vision_model_response\n    finally:\n        unload_paligemma_cpu_model()\n\n\ndef generate_paligemma(file_path, user_input):\n    try:\n        # Load the PaliGemma model and processor on-demand\n        load_paligemma_model()\n        # Load the saved image using PIL\n        with Image.open(file_path) as img:\n            # Prepare the prompt for the PaliGemma model using the user's question\n            model_inputs = paligemma_processor(text=user_input, images=img, return_tensors=\"pt\").to(\n                f\"cuda:{cuda_device}\")\n            input_len = model_inputs[\"input_ids\"].shape[-1]\n            # Generate the response using the PaliGemma model\n            with torch.inference_mode():\n                generation = paligemma_model.generate(\n                    **model_inputs,\n                    max_new_tokens=100,\n                    do_sample=True  # Set to True for sampling-based generation\n                )\n                generation = generation[0][input_len:]\n                vision_model_response = paligemma_processor.decode(generation, skip_special_tokens=True)\n        # Unload the PaliGemma model and processor after generating the response\n        return vision_model_response\n    finally:\n        unload_paligemma_model()\n\n\ndef generate_phi_vision(file_path, user_input):\n    global cuda_device\n    try:\n        # Load the PhiVision model and processor on-demand\n        load_phi_vision_model()\n        # Load the saved image using PIL\n        with Image.open(file_path) as img:\n            # Prepare the prompt for the PhiVision model\n            messages = [\n                {\"role\": \"user\", \"content\": f\"<|image_1|>\\n{user_input}\"}\n            ]\n            prompt = phiVision_processor.tokenizer.apply_chat_template(\n                messages, tokenize=False, add_generation_prompt=True\n            )\n            # Extract the CUDA_VISIBLE_DEVICES setting from the config file\n            # cuda_devices = config[\"cuda_visible_devices\"]\n            # Convert the CUDA_VISIBLE_DEVICES string to an integer (assuming a single device for simplicity)\n            # cuda_device_index = int(cuda_devices)\n\n            # Prepare the model inputs and move them to the specified CUDA device\n            inputs = phiVision_processor(prompt, [img], return_tensors=\"pt\").to(f\"cuda:{cuda_device}\")\n            # Define the generation arguments\n            generation_args = {\n                \"max_new_tokens\": 500,\n                \"temperature\": 1.0,\n                \"do_sample\": True,  # Set to True for sampling-based generation\n                # \"min_p\": 0.95,  # Optionally set a minimum probability threshold\n            }\n            # Generate the response using the PhiVision model\n            generate_ids = phiVision_model.generate(\n                **inputs,\n                eos_token_id=phiVision_processor.tokenizer.eos_token_id,\n                **generation_args\n            )\n            # Remove input tokens from the generated IDs\n            generate_ids = generate_ids[:, inputs['input_ids'].shape[1]:]\n            # Decode the generated IDs to get the text response\n            vision_model_response = phiVision_processor.batch_decode(\n                generate_ids,\n                skip_special_tokens=True,\n                clean_up_tokenization_spaces=False\n            )[0]\n        # Unload the PhiVision model and processor after generating the response\n        return vision_model_response\n    finally:\n        unload_phi_vision_model()\n\n\ndef generate_deepseek(file_path, user_input):\n    try:\n        load_deepseek_model()\n        conversation = [\n            {\n                \"role\": \"User\",\n                \"content\": f\"<image_placeholder>{user_input}\",\n                \"images\": [f\"{file_path}\"]\n            }, {\n                \"role\": \"Assistant\",\n                \"content\": \"\"\n            }\n        ]\n        print(conversation)\n        pil_images = load_pil_images_for_deepseek(conversation)\n        prepare_inputs = deepseek_processor(\n            conversations=conversation,\n            images=pil_images,\n            force_batchify=True\n        ).to(deepseek_gpt.device)\n        input_embeds = deepseek_gpt.prepare_inputs_embeds(**prepare_inputs)\n        outputs = deepseek_gpt.language_model.generate(\n            inputs_embeds=input_embeds,\n            attention_mask=prepare_inputs.attention_mask,\n            pad_token_id=deepseek_tokenizer.eos_token_id,\n            bos_token_id=deepseek_tokenizer.bos_token_id,\n            eos_token_id=deepseek_tokenizer.eos_token_id,\n            max_new_tokens=512,\n            do_sample=False,\n            use_cache=True\n        )\n        return deepseek_tokenizer.decode(outputs[0].cpu().tolist(), skip_special_tokens=True)\n    finally:\n        unload_deepseek_model()\n\n\ndef generate_bunny(file_path, user_input):\n    global cuda_device\n    try:\n        load_bunny_model()\n        with Image.open(file_path) as image:\n            text = f\"A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions. USER: <image>\\n{user_input} ASSISTANT:\"\n            text_chunks = [bunny_tokenizer(chunk).input_ids for chunk in text.split('<image>')]\n            input_ids = torch.tensor(text_chunks[0] + [-200] + text_chunks[1][1:], dtype=torch.long).unsqueeze(0).to(\n                f\"cuda:{cuda_device}\")\n            image_tensor = bunny_model.process_images(\n                [image],\n                bunny_model.config\n            ).to(\n                dtype=bunny_model.dtype,\n                device=bunny_model.device\n            )\n            output_ids = bunny_model.generate(\n                input_ids,\n                images=image_tensor,\n                max_new_tokens=896,\n                use_cache=True,\n                repetition_penalty=1.0\n            )[0]\n            return bunny_tokenizer.decode(output_ids[input_ids.shape[1]:], skip_special_tokens=True).strip()\n    finally:\n        unload_bunny_model()\n\n\n# Function to modify the chat history before it is used for text generation\ndef history_modifier(history):\n    # Extract all entries from the \"internal\" history\n    internal_entries = history[\"internal\"]\n    # Iterate over the \"internal\" history entries\n    for internal_index, internal_entry in enumerate(internal_entries):\n        # Extract the text content of the internal entry\n        internal_text = internal_entry[1]\n        # Search for the \"File location\" trigger phrase in the internal text\n        file_location_matches = file_location_pattern.findall(internal_text)\n        if file_location_matches:\n            # Iterate over each match found in the \"internal\" entry\n            for file_path in file_location_matches:\n                # Construct the full match string including the trigger phrase\n                full_match_string = f\"File location: {file_path}\"\n                # Search for the exact same string in the \"visible\" history\n                for visible_entry in history[\"visible\"]:\n                    # Extract the text content of the visible entry\n                    visible_text = visible_entry[1]\n                    # If the \"visible\" entry contains the full match string\n                    if full_match_string in visible_text:\n                        # Split the \"visible\" text at the full match string\n                        _, after_match = visible_text.split(full_match_string, 1)\n                        # Find the position where the \".png\" part ends in the \"internal\" text\n                        png_end_pos = internal_text.find(file_path) + len(file_path)\n                        # If the \".png\" part is found and there is content after it\n                        if png_end_pos < len(internal_text) and internal_text[png_end_pos] == \"\\n\":\n                            # Extract the existing content after the \".png\" part in the \"internal\" text\n                            _ = internal_text[png_end_pos:]\n                            # Replace the existing content after the \".png\" part in the \"internal\" text\n                            # with the corresponding content from the \"visible\" text\n                            new_internal_text = internal_text[:png_end_pos] + after_match\n                            # Update the \"internal\" history entry with the new text\n                            history[\"internal\"][internal_index][1] = new_internal_text\n                        # If there is no content after the \".png\" part in the \"internal\" text,\n                        # append the content from the \"visible\" text directly\n                        else:\n                            # Append the content after the full match string from the \"visible\" text\n                            history[\"internal\"][internal_index][1] += after_match\n    return history\n\n\nPay attention to the history_modifier.\n\n\n2. Lucid_Vision readme:\n\nVideo Demo:\n\nThere is code in this repo to prevent Alltalk from reading the directory names out loud ([here](//github.com/RandomInternetPreson/Lucid_Vision/tree/main?tab=readme-ov-file#to-work-with-alltalk)), the video is older and displays Alltalk reading the directory names however. \n\n//github.com/RandomInternetPreson/Lucid_Vision/assets/6488699/8879854b-06d8-49ad-836c-13c84eff6ac9\n\nDownload the full demo video here:\n//github.com/RandomInternetPreson/Lucid_Vision/blob/main/VideoDemo/Lucid_Vision_demoCompBig.mov\n\n* Update June 2 2024, Lucid_Vision now supports Bunny-v1_1-Llama-3-8B-V, again thanks to //github.com/justin-luoma :3\n\n* Update May 31 2024, many thanks to //github.com/justin-luoma, they have made several recent updates to the code, now merged with this repo.\n     * Removed the need to communicate with deepseek via the cli, this removes the only extra dependency needed to run Lucid_Vision.\n     * Reduced the number of entries needed for the config file.\n     * Really cleaned up the code, it is much better organized now.\n     * Added the ability to switch which gpu loads the vision model in the UI.\n \n* Bonus Update for May 31 2024, WizardLM-2-8x22B and I figured out how to prevent Alltalk from reading the file directory locations!! //github.com/RandomInternetPreson/Lucid_Vision/tree/main?tab=readme-ov-file#to-work-with-alltalk\n  \n* Update May 30 2024, Lucid_Vision now supports MiniCPM-Llama3-V-2_5, thanks to //github.com/justin-luoma.  Additionally WizardLM-2-8x22B added the functionality to load in the MiniCPM 4-bit model.\n\n* Updated script.py and config file, model was not previously loading to the user assigned gpu\n\nExperimental, and I am currently working to improve the code; but it may work for most.\n\nTodo:\n\n* Right now the temp setting and parameters for each model are coded in the .py script, I intend to bring these to the UI\n  \n[done](//github.com/RandomInternetPreson/Lucid_Vision/tree/main?tab=readme-ov-file#to-work-with-alltalk)~~* Make it comptable with Alltalk, right now each image directory is printed out and this will be read out loud by Alltalk~~\n\nTo accurately proportion credit (original repo creation):\n\nWizardLM-2-8x22B (quantized with exllamaV2 to 8bit precision) = 90% of all the work done in this repo.  That model wrote 100% of all the code and most of the introduction to this repo. \n\nCommandR+ (quantized with exllamaV2 to 8bit precision) = ~5% of all the work.  CommandR+ contextualized the coding examples and rules for making extensions from Oogabooga's textgen repo extreamly well, and provided a good foundation to develope the code.\n\nRandomInternetPreson = 5% of all the work done.  I came up with the original idea, the original general outline of how the pieces would interact, and provided feedback to the WizardLM model, but I did not write any code.  I'm actually not very good with python yet, with most of my decades of coding being in Matlab.\n\nMy goal from the beginning was to write this extension offline without any additional resources, sometimes it was a little frusturating but I soon understood how to get want I needed from the models running locally.\n\nI would say that most of the credit should go to Oobabooga, for without them I would be struggling to even interact with my models.  Please consider supporting them:\n\n//github.com/sponsors/oobabooga\n\nor \n\n//ko-fi.com/oobabooga\n\nI am their top donor on ko-fi (Mr. A) and donate 15$ montly, their software is extreamly important to the opensource community.\n\n\n# Lucid_Vision Extension for Oobabooga's textgen-webui\n\nWelcome to the Lucid Vision Extension repository! This extension enhances the capabilities of textgen-webui by integrating advanced vision models, allowing users to have contextualized conversations about images with their favorite language models; and allowing direct communciation with vision models.\n\n## Features\n\n* Multi-Model Support: Interact with different vision models, including PhiVision, DeepSeek, MiniCPM-Llama3-V-2_5 (4bit and normal precision), Bunny-v1_1-Llama-3-8B-V, and PaliGemma, with options for both GPU and CPU inference.\n\n* On-Demand Loading: Vision models are loaded into memory only when needed to answer a question, optimizing resource usage.\n\n* Seamless Integration: Easily switch between vision models using a Gradio UI radio button selector.\n\n* Cross-Platform Compatibility: The extension is designed to work on various operating systems, including Unix-based systems and Windows. (not tested in Windows yet, but should probably work?)\n\n* Direct communication with vision models, you do not need to load a LLM to interact with the separate vision models.\n\n## How It Works\n\nThe Lucid Vision Extension operates by intercepting and modifying user input and output within the textgen-webui framework. When a user uploads an image and asks a question, the extension appends a special trigger phrase (\"File location:\") and extracts the associated file path and question.\n\nSo if a user enters text into the \"send a message\" field and has a new picture uploaded into the Lucid_Vision ui, what will happen behind the scenes is the at the user message will be appended with the \"File Location: (file location)\" Trigger phrase, at which point the LLM will see this and understand that it needs to reply back with questions about the image, and that those questions are being sent to a vison model.\n\nThe cool thing is that let's say later in the conversation you want to know something specific about a previous picture, all you need to do is ask your LLM, YOU DO NOT NEED TO REUPLOAD THE PICTURE, the LLM should be able to interact with the extension on its own after you uploaded your first picture.\n\nThe selected vision directly interacts with the model's Python API to generate a response. The response is then appended to the chat history, providing the user with detailed insights about the image.\n\nThe extension is designed to be efficient with system resources by only loading the vision models into memory when they are actively being used to process a question. After generating a response, the models are immediately unloaded to free up memory and GPU VRAM.\n\n\n## **How to install and setup:**\n\n1.use the latest version of textgen ~~Install this edited prior commit from oobabooga's textgen //github.com/RandomInternetPreson/textgen_webui_Lucid_Vision_Testing OR use the latest version of textgen.~~ ~~If using the edited older version, make sure to rename the install folder `text-generation-webui`~~\n   \n~~(Note, a couple months ago gradio had a massive update.  For me, this has caused a lot of glitches and errors with extensions; I've briefly tested the Lucid_Vision extension in the newest implementation of textgen and it will work.  However, I was getting timeout popups when vision models were loading for the first time, gradio wasn't waiting for the response from the model upon first load. After a model is loaded once, it is saved in cpu ram cache (this doesn't actively use your ram, it just uses what is free to keep the models in memory so they are quickly reloaded into gpu ram if necessary) and gradio doesn't seem to timeout as often.  The slightly older version of textgen that I've edited does not experience this issue)~~\n\n~~2. Update the transformers library using the cmd_yourOShere.sh/bat file (so either cmd_linux.sh, cmd_macos.sh, cmd_windows.bat, or cmd_wsl.bat) and entering the following lines.  If you run the update wizard after this point, it will overrite this update to transformers.  The newest transformers package has the libraries for paligemma, which the code needs to import regardless of whether or not you are intending to use the model.~~\n\nNo longer need to update transformers and the latest version of textgen as of this writing 1.9.1 works well with many of the gradio issues resolved, however gradio might timeout on the page when loading a model for the fisrt time still.  After a model is loaded once, it is saved in cpu ram cache (this doesn't actively use your ram, it just uses what is free to keep the models in memory so they are quickly reloaded into gpu ram if necessary) and gradio doesn't seem to timeout as often.\n\n## Model Information\n\nIf you do not want to install DeepseekVL dependencies and are not intending to use deepseek comment out the import lines from the script.py file as displayed here\n\n```\n# pip install pexpect\nimport json\nimport re\nfrom datetime import datetime  # Import datetime for timestamp generation\nfrom pathlib import Path\n\nimport gradio as gr\nimport torch\nfrom PIL import Image\n#from deepseek_vl.models import VLChatProcessor\n#from deepseek_vl.utils.io import load_pil_images as load_pil_images_for_deepseek\nfrom transformers import AutoModelForCausalLM, AutoModel, AutoTokenizer, AutoProcessor, \\\n    PaliGemmaForConditionalGeneration\n\nmodel_names = [\"phiVision\", \"DeepSeek\", \"paligemma\", \"paligemma_cpu\", \"minicpm_llama3\", \"bunny\"]\n```\n\n3. Install **DeepseekVL** if you intend on using that model\n   \n   Clone the repo: //github.com/deepseek-ai/DeepSeek-VL into the `repositories` folder of your textgen install\n\n   Open cmd_yourOShere.sh/bat, navigate to the `repositories/DeepSeek-VL` folder via the terminal using `cd your_directory_here` and enter this into the command window:\n\n   ```\n   pip install -e .\n   ```\n   Download the deepseekvl model here: //huggingface.co/deepseek-ai/deepseek-vl-7b-chat\n   \n   They have different and smaller models to choose from: //github.com/deepseek-ai/DeepSeek-VL?tab=readme-ov-file#3-model-downloads\n\n4. If you want to use **Phi-3-vision-128k-instruct**, download it here: //huggingface.co/microsoft/Phi-3-vision-128k-instruct\n\n5. If you want to use **paligemma-3b**, download it here: //huggingface.co/google/paligemma-3b-ft-cococap-448 (this is just one out of many fine-tunes google provides)\n   \n   Read this blog on how to inference with the model: //huggingface.co/blog/paligemma\n\n6. If you want to use **MiniCPM-Llama3-V-2_5**, download it here: //huggingface.co/openbmb/MiniCPM-Llama3-V-2_5\n\n   The **4-bit** verison of the model can be downloaded here: //huggingface.co/openbmb/MiniCPM-Llama3-V-2_5-int4\n\n   **Notes about 4-bit MiniCPM:**\n   *  It might not look like the model fully unloads from vram, but it does and the vram will be reclaimed if another program needs it\n   *  Your directory folder where the model is stored needs to have the term \"int4\" in it, this is how the extension identifies the 4bit nature of the model\n\n7. If you want to use **Bunny-v1_1-Llama-3-8B-V**, download it here: //huggingface.co/BAAI/Bunny-v1_1-Llama-3-8B-V\n   \n    **Notes about Bunny-v1_1-Llama-3-8B-V:**\n    *  When running for the first time the model needs internet access so it can download models--google--siglip-so400m-patch14-384 to your cache/huggingface directory.  This is an additional 3.5GB file the model needs to run.\n\n## Updating the config file\n\n8. Before using the extension you need to update the config file; open it in a text editor *Note No quotes around gpu #:\n```\n{\n    \"image_history_dir\": \"(fill_In)/extensions/Lucid_Vision/ImageHistory/\",\n    \"cuda_visible_devices\": 0,\n    \"default_vision_model\": \"phiVision\",\n    \"phiVision_model_id\": \"(fill_In)\",\n    \"paligemma_model_id\": \"(fill_In)\",\n    \"paligemma_cpu_model_id\": \"(fill_In)\",\n    \"minicpm_llama3_model_id\": \"(fill_In)\",\n    \"deepseek_vl_model_id\": \"(fill_in)\",\n    \"bunny_model_id\": \"(fill_in)\"\n}\n```\nIf your install directory is /home/username/Desktop/oobLucidVision/text-generation-webui/  the config file will look like this for example:\n\nMake note that you want to change / to \\ if you are on Windows\n\n```\n   {\n    \"image_history_dir\": \"/home/username/Desktop/oobLucidVision/text-generation-webui/extensions/Lucid_Vision/ImageHistory/\",\n    \"cuda_visible_devices\": 0,\n    \"default_vision_model\": \"phiVision\",\n    \"phiVision_model_id\": \"(fill_In)\", *This is the folder where your phi-3 vision model is stored\n    \"paligemma_model_id\": \"(fill_In)\", *This is the folder where your paligemma vision model is stored\n    \"paligemma_cpu_model_id\": \"(fill_In)\", *This is the folder where your paligemma vision model is stored\n    \"minicpm_llama3_model_id\": \"(fill_In)\", *This is the folder where your minicpm_llama3 vision model is stored, the model can either be the normal fp16 or 4-bit version\n    \"deepseek_vl_model_id\": \"(fill_in)\",  *This is the folder where your deepseek vision model is stored\n    \"bunny_model_id\": \"(fill_in)\" *This is the folder where your Bunny-v1_1-Llama-3-8B-V vision model is stored\n   }\n```\n\n## To work with Alltalk\n\nAlltalk is great!! An extension I use all the time: //github.com/erew123/alltalk_tts\n\nTo get Lucid_Vision to work, the LLM needs to repeat the file directory of an image, and in doing so Alltalk will want to transcribe that text to audo.  It is annoying to have the tts model try an transcribe file directories.  If you want to get Alltalk to work well wtih Lucid_Vision you need to replace the code here in the script.py file that comes with Alltalk (search for the \"IMAGE CLEANING\" part of the code):\n\n```\n########################\n#### IMAGE CLEANING ####\n########################\nOriginalLucidVisionText = \"\"\n\n# This is the existing pattern for matching both images and file location text\nimg_pattern = r'<img[^>]*src\\s*=\\s*[\"\\'][^\"\\'>]+[\"\\'][^>]*>|File location: [^.]+.png'\n\ndef extract_and_remove_images(text):\n   \"\"\"\n   Extracts all image and file location data from the text and removes it for clean TTS processing.\n   Returns the cleaned text and the extracted image and file location data.\n   \"\"\"\n   global OriginalLucidVisionText\n   OriginalLucidVisionText = text  # Update the global variable with the original text\n\n   img_matches = re.findall(img_pattern, text)\n   img_info = \"\\n\".join(img_matches)  # Store extracted image and file location data\n   cleaned_text = re.sub(img_pattern, '', text)  # Remove images and file locations from text\n   return cleaned_text, img_info\n\ndef reinsert_images(cleaned_string, img_info):\n  \"\"\"\n  Reinserts the previously extracted image and file location data back into the text.\n  \"\"\"\n  global OriginalLucidVisionText\n\n  # Check if there are images or file locations to reinsert\n  if img_info:\n     # Check if the \"Vision Model Responses:\" phrase is present in the original text\n     if re.search(r'Vision Model Responses:', OriginalLucidVisionText):\n        # If present, return the original text as is, without modifying it\n        return OriginalLucidVisionText\n     else:\n        # If not present, append the img_info to the end of the cleaned string\n        cleaned_string += f\"\\n\\n{img_info}\"\n  return cleaned_string\n\n\n#################################\n#### TTS STANDARD GENERATION ####\n#################################\n```\n\n\n## **Quirks and Notes:**\n1. **When you load a picture once, it is used once.  Even if the image stays present in the UI element on screen, it is not actively being used.**\n2. If you are using other extensions, load Lucid_Vision first.  Put it first in your CMD_FLAGS.txt file or make sure to check it first in the sequence of check boxes in the session tab UI.\n3. DeepseekVL can take a while to load initially, that's just the way it is.\n   \n\n# **How to use:**\n\n## **BASICS:**\n\n**When you load a picture once, it is used once.  Even if the image stays present in the UI element on screen, it is not actively being used.**\n\nOkay the extension can do many different things with varying levels of difficulty.\n\nStarting out with the basics and understanding how to talk with your vision models:\n\nScroll down past where you would normally type something to the LLM, you do not need a large language model loaded to use this function.\n![image](//github.com/RandomInternetPreson/Lucid_Vision/assets/6488699/227ff483-5041-46a7-9b5b-a8f9dd3c673e)\n\nStart out by interacting with the vision models without involvement of a seperate LLM model by pressing the `Ask Vision Model` button\n![image](//github.com/RandomInternetPreson/Lucid_Vision/assets/6488699/4530e13f-30a1-43d9-8383-c05e31ddb5d7)\n\n**Note paligemma requries a little diffent type prompting sometimes, read the blog on how to inference with it: //huggingface.co/blog/paligemma**\n\nDo this with every model you intend on using, upload a picture, and ask a question \n\n\n## **ADVANCED:  Updated Tips At End on how to get working with Llama-3-Instruct-8B-SPPO-Iter3 //huggingface.co/UCLA-AGI/Llama-3-Instruct-8B-SPPO-Iter3**\n\nOkay, this is why I built the extension in the first place; direct interaction with the vision model was actually an afterthought after I had all the code working.\n\nThis is how to use your favorite LLM WITH an additional Vision Model, I wanted a way to give my LLMs eyes essentially.  I realize that good multimodal models are likely around the corner, but until they match the intellect of very good LLMs, I'd rather have a different vision model work with a good LLM.\n\nTo use Lucid_Vision as intended requires a little bit of setup:\n\n1. In `Parameters` then `chat` load the \"AI_Image\" character (this card is in the edited older commit, if using your own version of textgen the character card is here: //github.com/RandomInternetPreson/Lucid_Vision/blob/main/AI_Image.yaml put it in the `characters` folder of the textgen install folder:\n   ![image](//github.com/RandomInternetPreson/Lucid_Vision/assets/6488699/8b5d770a-b72a-4a74-aa4a-f67fe2a113ba)\n\n2. In `Parameters` then `Generation` under `Custom stopping strings` enter \"Vision Model Responses:\" exactly as shown:\n   ![image](//github.com/RandomInternetPreson/Lucid_Vision/assets/6488699/7a09650d-4fce-4a8f-badb-b26b4484bf37)\n\n3. Test to see if your LLM understands the instructions for the AI_Image character; ask it this:\n   `Can you tell me what is in this image? Do not add any additional text to the beginning of your response to me, start your reply with the questions.`\n   ![image](//github.com/RandomInternetPreson/Lucid_Vision/assets/6488699/817605d7-dcab-4e6c-8c39-6d119a189431)\n\n   I uploaded my image to the Lucid_Vision UI element, then typed my question, then pressed \"Generate\"; it doesn't matter which order you upload your picture or type your question, both are sent to the LLM at the same time.\n   ![image](//github.com/RandomInternetPreson/Lucid_Vision/assets/6488699/f1525e8d-551d-4719-9e01-3f26b4362d7c)\n\n4. Continue to give the model more images or just talk about something else, in this example I give the model two new images one of a humanoid robot and one of the nutritional facts for a container of sourcream:\n   ![image](//github.com/RandomInternetPreson/Lucid_Vision/assets/6488699/77131490-fb19-4062-a65f-c0880515e252)\n   ![image](//github.com/RandomInternetPreson/Lucid_Vision/assets/6488699/fed7021f-edb5-4667-aae0-c05021291618)\n   ![image](//github.com/RandomInternetPreson/Lucid_Vision/assets/6488699/d1049551-4e48-4ced-85d9-fbb0441330a9)\n\n   Then I asked the model why the sky is blue:\n\n   ![image](//github.com/RandomInternetPreson/Lucid_Vision/assets/6488699/deb7b70a-edf7-4014-bd8c-2492bae00ea6)\n\n5. Now I can ask it a question from a previous image and the LLM will automatically know to prompt the vision model and it will find the previously saved image on its own:\n   ![image](//github.com/RandomInternetPreson/Lucid_Vision/assets/6488699/e8b0791b-cce8-4183-8971-42e5383d6c46)\n\n   Please note that the vision model received all this as text:\n   \n   \"Certainly! To retrieve the complete contents of the nutritional facts from the image you provided earlier, I will ask the vision model to perform an Optical Character Recognition (OCR) task on the image. Here is the question for the vision model:\n\n    Can you perform OCR on the provided image and extract all the text from the nutrition facts label, including the details of each nutrient and its corresponding values?\"\n\nIf the vision model is not that smart (Deepseek, paligemma) then it will have a difficult time contextualizing the text prior to the questions for the vision model.\n\nIf you run into this situation, it is best to prompt your LLM like this:\n\n`Great, can you get the complete contents of the nutritional facts from earlier, like all the text? Just start with your questions to the vision model please.`\n![image](//github.com/RandomInternetPreson/Lucid_Vision/assets/6488699/38b59679-3550-410d-aac2-29756a058c8f)\n\n6. Please do not get frustrated right away, the Advanced method of usage depends heavily on your LLM's ability to understand the instructions from the character card.  The instructions were written by the 8x22B model itself, I explained what I wanted the model to do and had it write its own instructions.  This might be a viable alternative if you are struggling to get your model to adhere to the instructions from the character card.\n\nYou may need to explain things to your llm as your conversation progresses, if it tries to query the vision model when you don't want it to, just explain that to the model and it's unlikely to keep making the same mistake.  \n\n## **Tips on how to get working with Llama-3-Instruct-8B-SPPO-Iter3 //huggingface.co/UCLA-AGI/Llama-3-Instruct-8B-SPPO-Iter3**\n\nI have found to get this model (and likely similar models) working properly with lucid vision I needed to NOT use the AI_Image character card.  \n\nInstead I started the conversation with the model like this:\n\n![image](//github.com/user-attachments/assets/4d0ff97c-f9dd-4043-b5be-7b915cb7aae7)\n\nThen copy and pasted the necessary information from the character card:\n\n```\nThe following is a conversation with an AI Large Language Model. The AI has been trained to answer questions, provide recommendations, and help with decision making. The AI follows user requests. The AI thinks outside the box.\n\nInstructions for Processing Image-Related User Input with Appended File Information:\n\n    Identify the Trigger Phrase: Begin by scanning the user input for the presence of the \"File location\" trigger phrase. This phrase indicates that the user has selected an image and that the LLM should consider this information in its response.\n\n    Extract the File Path: Once the trigger phrase is identified, parse the text that follows to extract the absolute file path of the image. The file path will be provided immediately after the trigger phrase and will end with the image file extension (e.g., .png, .jpg).\n\n    Understand the Context: Recognize that the user's message preceding the file path is the primary context for the interaction. The LLM should address the user's query or statement while also incorporating the availability of the selected image into its response.\n\n    Formulate Questions for the Vision Model: Based on the user's message and the fact that an image is available for analysis, generate one or more questions that can be answered by the vision model. These questions should be clear, specific, and relevant to the image content.\n\n    Maintain a Conversational Tone: Ensure that the response is natural, coherent, and maintains the flow of the conversation. The LLM should act as an intermediary between the user and the vision model, facilitating a seamless dialogue.\n\n    Prepare the Response Structure: Structure the response so that it includes:\n\n    An acknowledgment of the user's initial message.\n\n    The questions formulated for the vision model, each clearly separated (e.g., by a newline or bullet point).\n\n    Any additional information or clarification requests, if necessary.\n\n    Append the File Path for Processing: At the end of the response, re-append the \"File location\" trigger phrase along with the extracted file path. This ensures that the subsequent processing steps (such as sending the information to the vision model's CLI) can correctly identify and use the file path.\n\n    Avoid Direct Interaction with the File System: As an LLM, you do not have the capability to directly access or modify files on the server or client systems. Your role is limited to generating text-based responses that include the necessary file path information for other system components to handle.\n\n    Example Response:\n\n   Based on your question about the contents of the image, here are the questions I will ask the vision model:\n\n   - Can you describe the main objects present in the image?\n   - Is there any text visible in the image, and if so, what does it say?\n   - What appears to be happening in the scene depicted in the image?\n\n   File location: //home/myself/Pictures/rose.png\n   \n   Do review and contextualize the conversation as it develops (reference your context) to infer if the user is asking new questions of previous images.  Reference the parts of the convesation that are likely to yeild the file location of the image in question, and formulate your response to include that specific file location of that unique .png file, make sure you are referencing the correct .png file as per the part of the conversation that is likely to be in reference to the updated information request.\n\nBy following these instructions, the LLM will be able to effectively process user inputs that include image file information and generate appropriate responses that facilitate the interaction between the user and the vision model.\n```\n\nMaking sure to give a real file location to a real image.\n\nAfter that it started to pretty much work but I still needed to correct the model (I had to add the spelling error intentionally because if the user sends \"File Location:\" is messes up the extension:\n\n![image](//github.com/user-attachments/assets/5090ef8b-88ff-4ec1-a131-ba57e15e9a7d)\n\nThe point to take away is that you may need to explain things to your model in various ways for it to contextualize the instructions approprately.\n\n\n3. markdown instructions for extension creation from textgen repo:\n\n# Extensions\n\nExtensions are defined by files named `script.py` inside subfolders of `text-generation-webui/extensions`. They are loaded at startup if the folder name is specified after the `--extensions` flag.\n\nFor instance, `extensions/silero_tts/script.py` gets loaded with `python server.py --extensions silero_tts`.\n\n## [text-generation-webui-extensions](https://github.com/oobabooga/text-generation-webui-extensions)\n\nThe repository above contains a directory of user extensions.\n\nIf you create an extension, you are welcome to host it in a GitHub repository and submit a PR adding it to the list.\n\n## Built-in extensions\n\n|Extension|Description|\n|---------|-----------|\n|[openai](https://github.com/oobabooga/text-generation-webui/tree/main/extensions/openai)| Creates an API that mimics the OpenAI API and can be used as a drop-in replacement. |\n|[multimodal](https://github.com/oobabooga/text-generation-webui/tree/main/extensions/multimodal) | Adds multimodality support (text+images). For a detailed description see [README.md](https://github.com/oobabooga/text-generation-webui/tree/main/extensions/multimodal/README.md) in the extension directory. |\n|[google_translate](https://github.com/oobabooga/text-generation-webui/tree/main/extensions/google_translate)| Automatically translates inputs and outputs using Google Translate.|\n|[silero_tts](https://github.com/oobabooga/text-generation-webui/tree/main/extensions/silero_tts)| Text-to-speech extension using [Silero](https://github.com/snakers4/silero-models). When used in chat mode, responses are replaced with an audio widget. |\n|[whisper_stt](https://github.com/oobabooga/text-generation-webui/tree/main/extensions/whisper_stt)| Allows you to enter your inputs in chat mode using your microphone. |\n|[sd_api_pictures](https://github.com/oobabooga/text-generation-webui/tree/main/extensions/sd_api_pictures)| Allows you to request pictures from the bot in chat mode, which will be generated using the AUTOMATIC1111 Stable Diffusion API. See examples [here](https://github.com/oobabooga/text-generation-webui/pull/309). |\n|[character_bias](https://github.com/oobabooga/text-generation-webui/tree/main/extensions/character_bias)| Just a very simple example that adds a hidden string at the beginning of the bot's reply in chat mode. |\n|[send_pictures](https://github.com/oobabooga/text-generation-webui/blob/main/extensions/send_pictures/)| Creates an image upload field that can be used to send images to the bot in chat mode. Captions are automatically generated using BLIP. |\n|[gallery](https://github.com/oobabooga/text-generation-webui/blob/main/extensions/gallery/)| Creates a gallery with the chat characters and their pictures. |\n|[superbooga](https://github.com/oobabooga/text-generation-webui/tree/main/extensions/superbooga)| An extension that uses ChromaDB to create an arbitrarily large pseudocontext, taking as input text files, URLs, or pasted text. Based on https://github.com/kaiokendev/superbig. |\n|[ngrok](https://github.com/oobabooga/text-generation-webui/tree/main/extensions/ngrok)| Allows you to access the web UI remotely using the ngrok reverse tunnel service (free). It's an alternative to the built-in Gradio `--share` feature. |\n|[perplexity_colors](https://github.com/oobabooga/text-generation-webui/tree/main/extensions/perplexity_colors)| Colors each token in the output text by its associated probability, as derived from the model logits. |\n\n## How to write an extension\n\nThe extensions framework is based on special functions and variables that you can define in `script.py`. The functions are the following:\n\n| Function        | Description |\n|-------------|-------------|\n| `def setup()` | Is executed when the extension gets imported. |\n| `def ui()` | Creates custom gradio elements when the UI is launched. | \n| `def custom_css()` | Returns custom CSS as a string. It is applied whenever the web UI is loaded. |\n| `def custom_js()` | Same as above but for javascript. |\n| `def input_modifier(string, state, is_chat=False)`  | Modifies the input string before it enters the model. In chat mode, it is applied to the user message. Otherwise, it is applied to the entire prompt. |\n| `def output_modifier(string, state, is_chat=False)`  | Modifies the output string before it is presented in the UI. In chat mode, it is applied to the bot's reply. Otherwise, it is applied to the entire output. |\n| `def chat_input_modifier(text, visible_text, state)` | Modifies both the visible and internal inputs in chat mode. Can be used to hijack the chat input with custom content. |\n| `def bot_prefix_modifier(string, state)`  | Applied in chat mode to the prefix for the bot's reply. |\n| `def state_modifier(state)`  | Modifies the dictionary containing the UI input parameters before it is used by the text generation functions. |\n| `def history_modifier(history)`  | Modifies the chat history before the text generation in chat mode begins. |\n| `def custom_generate_reply(...)` | Overrides the main text generation function. |\n| `def custom_generate_chat_prompt(...)` | Overrides the prompt generator in chat mode. |\n| `def tokenizer_modifier(state, prompt, input_ids, input_embeds)` | Modifies the `input_ids`/`input_embeds` fed to the model. Should return `prompt`, `input_ids`, `input_embeds`. See the `multimodal` extension for an example. |\n| `def custom_tokenized_length(prompt)` | Used in conjunction with `tokenizer_modifier`, returns the length in tokens of `prompt`. See the `multimodal` extension for an example. |\n\nAdditionally, you can define a special `params` dictionary. In it, the `display_name` key is used to define the displayed name of the extension in the UI, and the `is_tab` key is used to define whether the extension should appear in a new tab. By default, extensions appear at the bottom of the \"Text generation\" tab.\n\nExample:\n\n```python\nparams = {\n    \"display_name\": \"Google Translate\",\n    \"is_tab\": True,\n}\n```\n\nThe `params` dict may also contain variables that you want to be customizable through a `settings.yaml` file. For instance, assuming the extension is in `extensions/google_translate`, the variable `language string` in\n\n```python\nparams = {\n    \"display_name\": \"Google Translate\",\n    \"is_tab\": True,\n    \"language string\": \"jp\"\n}\n```\n\ncan be customized by adding a key called `google_translate-language string` to `settings.yaml`:\n\n```python\ngoogle_translate-language string: 'fr'\n``` \n\nThat is, the syntax for the key is `extension_name-variable_name`.\n\n## Using multiple extensions at the same time\n\nYou can activate more than one extension at a time by providing their names separated by spaces after `--extensions`. The input, output, and bot prefix modifiers will be applied in the specified order. \n\nExample:\n\n```\npython server.py --extensions enthusiasm translate # First apply enthusiasm, then translate\npython server.py --extensions translate enthusiasm # First apply translate, then enthusiasm\n```\n\nDo note, that for:\n- `custom_generate_chat_prompt`\n- `custom_generate_reply`\n- `custom_tokenized_length`\n\nonly the first declaration encountered will be used and the rest will be ignored. \n\n## A full example\n\nThe source code below can be found at [extensions/example/script.py](https://github.com/oobabooga/text-generation-webui/tree/main/extensions/example/script.py).\n\n```python\n\"\"\"\nAn example of extension. It does nothing, but you can add transformations\nbefore the return statements to customize the webui behavior.\n\nStarting from history_modifier and ending in output_modifier, the\nfunctions are declared in the same order that they are called at\ngeneration time.\n\"\"\"\n\nimport gradio as gr\nimport torch\nfrom transformers import LogitsProcessor\n\nfrom modules import chat, shared\nfrom modules.text_generation import (\n    decode,\n    encode,\n    generate_reply,\n)\n\nparams = {\n    \"display_name\": \"Example Extension\",\n    \"is_tab\": False,\n}\n\nclass MyLogits(LogitsProcessor):\n    \"\"\"\n    Manipulates the probabilities for the next token before it gets sampled.\n    Used in the logits_processor_modifier function below.\n    \"\"\"\n    def __init__(self):\n        pass\n\n    def __call__(self, input_ids, scores):\n        # probs = torch.softmax(scores, dim=-1, dtype=torch.float)\n        # probs[0] /= probs[0].sum()\n        # scores = torch.log(probs / (1 - probs))\n        return scores\n\ndef history_modifier(history):\n    \"\"\"\n    Modifies the chat history.\n    Only used in chat mode.\n    \"\"\"\n    return history\n\ndef state_modifier(state):\n    \"\"\"\n    Modifies the state variable, which is a dictionary containing the input\n    values in the UI like sliders and checkboxes.\n    \"\"\"\n    return state\n\ndef chat_input_modifier(text, visible_text, state):\n    \"\"\"\n    Modifies the user input string in chat mode (visible_text).\n    You can also modify the internal representation of the user\n    input (text) to change how it will appear in the prompt.\n    \"\"\"\n    return text, visible_text\n\ndef input_modifier(string, state, is_chat=False):\n    \"\"\"\n    In default/notebook modes, modifies the whole prompt.\n\n    In chat mode, it is the same as chat_input_modifier but only applied\n    to \"text\", here called \"string\", and not to \"visible_text\".\n    \"\"\"\n    return string\n\ndef bot_prefix_modifier(string, state):\n    \"\"\"\n    Modifies the prefix for the next bot reply in chat mode.\n    By default, the prefix will be something like \"Bot Name:\".\n    \"\"\"\n    return string\n\ndef tokenizer_modifier(state, prompt, input_ids, input_embeds):\n    \"\"\"\n    Modifies the input ids and embeds.\n    Used by the multimodal extension to put image embeddings in the prompt.\n    Only used by loaders that use the transformers library for sampling.\n    \"\"\"\n    return prompt, input_ids, input_embeds\n\ndef logits_processor_modifier(processor_list, input_ids):\n    \"\"\"\n    Adds logits processors to the list, allowing you to access and modify\n    the next token probabilities.\n    Only used by loaders that use the transformers library for sampling.\n    \"\"\"\n    processor_list.append(MyLogits())\n    return processor_list\n\ndef output_modifier(string, state, is_chat=False):\n    \"\"\"\n    Modifies the LLM output before it gets presented.\n\n    In chat mode, the modified version goes into history['visible'],\n    and the original version goes into history['internal'].\n    \"\"\"\n    return string\n\ndef custom_generate_chat_prompt(user_input, state, **kwargs):\n    \"\"\"\n    Replaces the function that generates the prompt from the chat history.\n    Only used in chat mode.\n    \"\"\"\n    result = chat.generate_chat_prompt(user_input, state, **kwargs)\n    return result\n\ndef custom_css():\n    \"\"\"\n    Returns a CSS string that gets appended to the CSS for the webui.\n    \"\"\"\n    return ''\n\ndef custom_js():\n    \"\"\"\n    Returns a javascript string that gets appended to the javascript\n    for the webui.\n    \"\"\"\n    return ''\n\ndef setup():\n    \"\"\"\n    Gets executed only once, when the extension is imported.\n    \"\"\"\n    pass\n\ndef ui():\n    \"\"\"\n    Gets executed when the UI is drawn. Custom gradio elements and\n    their corresponding event handlers should be defined here.\n\n    To learn about gradio components, check out the docs:\n    https://gradio.app/docs/\n    \"\"\"\n    pass\n```\n\nStudy these documents and contextualize the changes you need to make, look at the Lucid_Vision extension code and try to understand the parts we need to use and apply to the Lucid_Autonomy extension see how it edits the history. Look at the markdown extensions document from the textgen repo to try and understand how to manipulate the history. Also very importantly look at how the example chat I gave you is structured to help contextualize how you can identify how to delete data.\n",
            "Thank you for providing the detailed information. Let's break down the steps to address the issue with the `compressedresults.json` being added multiple times to the chat history. We'll use the `history_modifier` function to clean up the chat history before appending new `compressedresults.json` contents.\n\n### Steps to Implement:\n\n1. **Define Bookends**: Use unique strings to act as bookends for the `compressedresults.json` contents.\n2. **Search and Delete Function**: Create a function to search for these bookends in the chat history and remove the contents between them.\n3. **Append New Contents**: After cleaning the chat history, append the new `compressedresults.json` contents wrapped in the bookends.\n4. **Cleanup Internal History**: Ensure that the cleanup process is applied to the internal history as well.\n\n### Example Workflow:\n\n1. **Define Bookends**:\n   ```python\n   START_BOOKEND = \"<START_COMPRESSED_RESULTS>\"\n   END_BOOKEND = \"<END_COMPRESSED_RESULTS>\"\n   ```\n\n2. **Search and Delete Function**:\n   ```python\n   def clean_chat_history(history):\n       for entry in history:\n           if isinstance(entry, list) and len(entry) == 2:\n               entry[1] = entry[1].replace(f\"{START_BOOKEND}.*{END_BOOKEND}\", \"\")\n       return history\n   ```\n\n3. **Append New Contents**:\n   ```python\n   def append_compressed_results(output, compressed_results):\n       output += f\"\\n\\n{START_BOOKEND}\\n{json.dumps(compressed_results, indent=4)}\\n{END_BOOKEND}\"\n       return output\n   ```\n\n4. **Cleanup Internal History**:\n   ```python\n   def history_modifier(history):\n       # Clean the internal history\n       history[\"internal\"] = clean_chat_history(history[\"internal\"])\n       # Clean the visible history\n       history[\"visible\"] = clean_chat_history(history[\"visible\"])\n       return history\n   ```\n\n### Integration:\n\n- **Modify `output_modifier`**:\n  ```python\n  def output_modifier(output, state):\n      if \"Autonomy_Tasks:\" in output:\n          # Clean the chat history\n          state[\"history\"] = history_modifier(state[\"history\"])\n          # Extract tasks and execute them\n          tasks_section = output.split(\"Autonomy_Tasks:\")[-1].strip()\n          if tasks_section:\n              execute_tasks(tasks_section)\n          # Append new compressed results\n          compressed_results_json_path = \"extensions/Lucid_Autonomy/ImageOutputTest/compressedresults.json\"\n          with open(compressed_results_json_path, 'r') as f:\n              compressed_results = json.load(f)\n          output = append_compressed_results(output, compressed_results)\n      return output\n  ```\n\n### Full Example:\n\nHere's how the complete code might look:\n\n```python\nimport gradio as gr\nimport json\nimport pyautogui\nimport screeninfo\nfrom transformers import Owlv2Processor, Owlv2ForObjectDetection, AutoModel, AutoTokenizer\nfrom PIL import Image, ImageDraw\nimport torch\nimport os\nimport shutil\n\n# Configurable variables\nMONITOR_INDEX = 1  # Index of the monitor to capture (0 for the first monitor, 1 for the second monitor, etc.)\nTEXT_QUERIES = \"clickable hyperlinks, clickable icons, text bar field\"  # Comma-separated text queries for object detection\nSCORE_THRESHOLD = 0.20  # Score threshold for filtering low-probability predictions\nVISION_MODEL_QUESTION = \"What is the text in this image or what does this image describe?\"\n\n# Use GPU if available\nif torch.cuda.is_available():\n    device = torch.device(\"cuda\")\nelse:\n    device = torch.device(\"cpu\")\n\n# Global variables for models\nowlv2_model = None\nowlv2_processor = None\nminicpm_llama_model = None\nminicpm_llama_tokenizer = None\n\nVISION_MODEL_ID = \"/home/myself/Desktop/miniCPM_llava3V/MiniCPM-Llama3-V-2_5\"\n\ndef load_owlv2_model():\n    global owlv2_model, owlv2_processor\n    owlv2_model = Owlv2ForObjectDetection.from_pretrained(\"google/owlv2-base-patch16-ensemble\", trust_remote_code=True).to(device)\n    owlv2_processor = Owlv2Processor.from_pretrained(\"google/owlv2-base-patch16-ensemble\", trust_remote_code=True)\n    print(\"Owlv2 model loaded.\")\n\ndef unload_owlv2_model():\n    global owlv2_model, owlv2_processor\n    if owlv2_model is not None:\n        del owlv2_model\n        del owlv2_processor\n        torch.cuda.empty_cache()\n        print(\"Owlv2 model unloaded.\")\n\ndef load_minicpm_llama_model():\n    global minicpm_llama_model, minicpm_llama_tokenizer\n    if \"int4\" in VISION_MODEL_ID:\n        # Load the 4-bit quantized model and tokenizer\n        minicpm_llama_model = AutoModel.from_pretrained(\n            VISION_MODEL_ID,\n            device_map={\"\": device},\n            trust_remote_code=True,\n            torch_dtype=torch.float16  # Use float16 as per the example code for 4-bit models\n        ).eval()\n        minicpm_llama_tokenizer = AutoTokenizer.from_pretrained(\n            VISION_MODEL_ID,\n            trust_remote_code=True\n        )\n        print(\"MiniCPM-Llama3 4-bit quantized model loaded on-demand.\")\n    else:\n        # Load the standard model and tokenizer\n        minicpm_llama_model = AutoModel.from_pretrained(\n            VISION_MODEL_ID,\n            device_map={\"\": device},  # Use the specified CUDA device\n            trust_remote_code=True,\n            torch_dtype=torch.bfloat16\n        ).eval()\n        minicpm_llama_tokenizer = AutoTokenizer.from_pretrained(\n            VISION_MODEL_ID,\n            trust_remote_code=True\n        )\n        print(\"MiniCPM-Llama3 standard model loaded on-demand.\")\n\ndef unload_minicpm_llama_model():\n    global minicpm_llama_model, minicpm_llama_tokenizer\n    if minicpm_llama_model is not None:\n        del minicpm_llama_model\n        del minicpm_llama_tokenizer\n        torch.cuda.empty_cache()\n        print(\"MiniCPM-Llama3 model unloaded.\")\n\ndef take_screenshot(monitor_index, text_queries, score_threshold, vision_model_question, output_filename=\"screenshot.png\"):\n    \"\"\"\n    Takes a screenshot of a specific monitor in a multi-monitor setup.\n\n    Args:\n        monitor_index (int): The index of the monitor to capture.\n        text_queries (str): Comma-separated text queries for object detection.\n        score_threshold (float): The score threshold for filtering low-probability predictions.\n        vision_model_question (str): The question to ask the vision model.\n        output_filename (str): The filename to save the screenshot as (default is \"screenshot.png\").\n    \"\"\"\n    monitor_index = int(monitor_index)\n    score_threshold = float(score_threshold)\n\n    # Get information about all monitors\n    monitors = screeninfo.get_monitors()\n\n    if monitor_index >= len(monitors):\n        raise ValueError(f\"Monitor index {monitor_index} is out of range. There are only {len(monitors)} monitors.\")\n\n    # Get the region of the specified monitor\n    monitor = monitors[monitor_index]\n    region = (monitor.x, monitor.y, monitor.width, monitor.height)\n\n   \n\n    # Take the screenshot\n    output_dir = \"extensions/Lucid_Autonomy\"\n    os.makedirs(output_dir, exist_ok=True)\n    screenshot_path = os.path.join(output_dir, output_filename)\n    screenshot = pyautogui.screenshot(region=region)\n    screenshot.save(screenshot_path)\n    print(f\"Screenshot saved as {screenshot_path}\")\n\n    # Process the image using the Owlv2 model\n    annotated_image_path, json_output_path = query_image(screenshot_path, text_queries, score_threshold)\n\n    # Create the output directory if it doesn't exist\n    output_dir = \"extensions/Lucid_Autonomy/ImageOutputTest\"\n    os.makedirs(output_dir, exist_ok=True)\n\n    # Clear the output directory\n    for filename in os.listdir(output_dir):\n        file_path = os.path.join(output_dir, filename)\n        try:\n            if os.path.isfile(file_path) or os.path.islink(file_path):\n                os.unlink(file_path)\n            elif os.path.isdir(file_path):\n                shutil.rmtree(file_path)\n        except Exception as e:\n            print(f\"Failed to delete {file_path}. Reason: {e}\")\n\n    # Crop the detected objects and save them to the output directory\n    crop_images(annotated_image_path, json_output_path, output_dir)\n\n    print(\"Processing complete. Check the output directory for the cropped images and results.\")\n\ndef query_image(img_path, text_queries, score_threshold):\n    \"\"\"\n    Processes the image using the Owlv2 model to detect objects based on text queries.\n\n    Args:\n        img_path (str): The path to the input image.\n        text_queries (str): Comma-separated text queries for object detection.\n        score_threshold (float): The score threshold for filtering low-probability predictions.\n    \"\"\"\n    load_owlv2_model()\n\n    img = Image.open(img_path)\n    text_queries = text_queries.split(\",\")\n\n    size = max(img.size)\n    target_sizes = torch.Tensor([[size, size]])\n    inputs = owlv2_processor(text=text_queries, images=img, return_tensors=\"pt\").to(device)\n\n    with torch.no_grad():\n        outputs = owlv2_model(**inputs)\n\n    outputs.logits = outputs.logits.cpu()\n    outputs.pred_boxes = outputs.pred_boxes.cpu()\n    results = owlv2_processor.post_process_object_detection(outputs=outputs, target_sizes=target_sizes)\n    boxes, scores, labels = results[0][\"boxes\"], results[0][\"scores\"], results[0][\"labels\"]\n\n    result_labels = []\n    img_pil = img.copy()\n    draw = ImageDraw.Draw(img_pil)\n\n    for box, score, label in zip(boxes, scores, labels):\n        box = [int(i) for i in box.tolist()]\n        if score < score_threshold:\n            continue\n        result_labels.append([box, text_queries[label.item()]])\n        draw.rectangle(box, outline=\"red\", width=2)\n        # Do not draw the text label within the bounding box\n        # draw.text((box[0], box[1]), text_queries[label.item()], fill=\"red\")\n\n    # Save the annotated image for debugging\n    annotated_image_path = \"extensions/Lucid_Autonomy/annotated_image.png\"\n    img_pil.save(annotated_image_path)\n\n    # Save the JSON output to a file\n    json_output_path = \"extensions/Lucid_Autonomy/output.json\"\n    with open(json_output_path, 'w') as f:\n        json.dump(result_labels, f)\n\n    unload_owlv2_model()\n\n    return annotated_image_path, json_output_path\n\ndef crop_images(image_path, json_output_path, output_dir=\"extensions/Lucid_Autonomy/ImageOutputTest\"):\n    \"\"\"\n    Crops out the detected objects from the input image based on the bounding box coordinates.\n\n    Args:\n        image_path (str): The path to the input image.\n        json_output_path (str): The path to the JSON file containing the bounding box coordinates.\n        output_dir (str): The directory to save the cropped images (default is \"extensions/Lucid_Autonomy/ImageOutputTest\").\n    \"\"\"\n    with open(json_output_path, 'r') as f:\n        data = json.load(f)\n\n    img = Image.open(image_path)\n\n    results = []\n\n    # Load the miniCPM-lama3-v-2_5 model once before processing the images\n    load_minicpm_llama_model()\n\n    for i, item in enumerate(data):\n        box = item[0]\n        label = item[1]\n        cropped_img = img.crop(box)\n        cropped_img_path = f\"{output_dir}/{label}_{i}.png\"\n        cropped_img.save(cropped_img_path)\n\n        # Process the cropped image with the miniCPM-lama3-v-2_5 model\n        vision_model_response = process_with_vision_model(cropped_img_path, VISION_MODEL_QUESTION)\n\n        # Store the results in a structured format\n        results.append({\n            \"image_name\": f\"{label}_{i}.png\",\n            \"coordinates\": box,\n            \"description\": vision_model_response\n        })\n\n    # Unload the miniCPM-lama3-v-2_5 model after processing all images\n    unload_minicpm_llama_model()\n\n    # Save the results to a JSON file\n    results_json_path = f\"{output_dir}/results.json\"\n    with open(results_json_path, 'w') as f:\n        json.dump(results, f, indent=4)\n\n    # Create the compressed results JSON file\n    compressed_results_json_path = f\"{output_dir}/compressedresults.json\"\n    compressed_results = [{\"image_name\": result[\"image_name\"], \"description\": result[\"description\"]} for result in results]\n    with open(compressed_results_json_path, 'w') as f:\n        json.dump(compressed_results, f, indent=4)\n\ndef process_with_vision_model(image_path, question):\n    \"\"\"\n    Processes the image with the miniCPM-lama3-v-2_5 model and returns the response.\n\n    Args:\n        image_path (str): The path to the input image.\n        question (str): The question to ask the vision model.\n\n    Returns:\n        str: The response from the vision model.\n    \"\"\"\n    image = Image.open(image_path).convert(\"RGB\")\n    messages = [\n        {\"role\": \"user\", \"content\": f\"{question}\"}\n    ]\n    # Define the generation arguments\n    generation_args = {\n        \"max_new_tokens\": 896,\n        \"repetition_penalty\": 1.05,\n        \"num_beams\": 3,\n        \"top_p\": 0.8,\n        \"top_k\": 1,\n        \"temperature\": 0.7,\n        \"sampling\": True,\n    }\n    if \"int4\" in \"openbmb/MiniCPM-Llama3-V-2_5\":\n        # Disable streaming for the 4-bit model\n        generation_args[\"stream\"] = False\n        # Use the model.chat method with streaming enabled\n        vision_model_response = \"\"\n        for new_text in minicpm_llama_model.chat(\n                image=image,\n                msgs=messages,\n                tokenizer=minicpm_llama_tokenizer,\n                **generation_args\n        ):\n            vision_model_response += new_text\n            print(new_text, flush=True, end='')\n    else:\n        minicpm_llama_model.to(device)\n        vision_model_response = minicpm_llama_model.chat(\n            image=image,\n            msgs=messages,\n            tokenizer=minicpm_llama_tokenizer,\n            **generation_args\n        )\n    return vision_model_response\n\ndef ui():\n    \"\"\"\n    Creates custom gradio elements when the UI is launched.\n    \"\"\"\n    with gr.Blocks() as demo:\n        with gr.Row():\n            with gr.Column():\n                monitor_index = gr.Textbox(label=\"Monitor Index\", value=str(MONITOR_INDEX))\n                text_queries = gr.Textbox(label=\"Text Queries\", value=TEXT_QUERIES)\n                score_threshold = gr.Textbox(label=\"Score Threshold\", value=str(SCORE_THRESHOLD))\n                vision_model_question = gr.Textbox(label=\"Vision Model Question\", value=VISION_MODEL_QUESTION)\n                take_screenshot_button = gr.Button(value=\"Take Screenshot\")\n\n        take_screenshot_button.click(\n            fn=take_screenshot,\n            inputs=[monitor_index, text_queries, score_threshold, vision_model_question],\n            outputs=None\n        )\n\n    return demo\n\ndef input_modifier(user_input, state):\n    \"\"\"\n    Modifies the user input before it is processed by the LLM.\n    \"\"\"\n    return user_input\n\ndef output_modifier(output, state):\n    \"\"\"\n    Modifies the LLM output before it is presented in the UI.\n    \"\"\"\n    # Search for the \"Autonomy_Tasks:\" trigger phrase in the LLM's output\n    if \"Autonomy_Tasks:\" in output:\n        # Clean the chat history\n        state[\"history\"] = history_modifier(state[\"history\"])\n        # Extract tasks and execute them\n        tasks_section = output.split(\"Autonomy_Tasks:\")[-1].strip()\n        if tasks_section:\n            execute_tasks(tasks_section)\n        # Append new compressed results\n        compressed_results_json_path = \"extensions/Lucid_Autonomy/ImageOutputTest/compressedresults.json\"\n        with open(compressed_results_json_path, 'r') as f:\n            compressed_results = json.load(f)\n        output = append_compressed_results(output, compressed_results)\n    return output\n\ndef execute_tasks(tasks_section):\n    \"\"\"\n    Executes the tasks listed in the tasks_section.\n    \"\"\"\n    tasks = tasks_section.split(\"\\\\n\")\n    for task in tasks:\n        if task.startswith(\"MouseMover =\"):\n            image_name = task.split(\"=\")[1].strip()\n            move_mouse_to_image(image_name)\n\ndef move_mouse_to_image(image_name):\n    \"\"\"\n    Moves the mouse to the center of the box that defines the image.\n    \"\"\"\n    results_json_path = \"extensions/Lucid_Autonomy/ImageOutputTest/results.json\"\n    with open(results_json_path, 'r') as f:\n        results = json.load(f)\n\n    for result in results:\n        if result[\"image_name\"] == image_name:\n            coordinates = result[\"coordinates\"]\n            monitor = get_monitor_info(MONITOR_INDEX)\n            move_mouse_to_coordinates(monitor, coordinates)\n            break\n    else:\n        print(f\"Image name '{image_name}' not found in the results.\")\n\ndef move_mouse_to_coordinates(monitor, coordinates):\n    \"\"\"\n    Moves the mouse cursor to the specified coordinates on the monitor.\n\n    Args:\n        monitor (dict): A dictionary containing the monitor's information.\n        coordinates (list): A list containing the coordinates [x1, y1, x2, y2].\n    \"\"\"\n    x1, y1, x2, y2 = coordinates\n    monitor_x, monitor_y = monitor.x, monitor.y\n    center_x = monitor_x + (x1 + x2) // 2\n    center_y = monitor_y + (y1 + y2) // 2\n    pyautogui.moveTo(center_x, center_y)\n    print(f\"Mouse moved to coordinates: ({center_x}, {center_y})\")\n\ndef get_monitor_info(monitor_index):\n    \"\"\"\n    Gets information about the specified monitor.\n\n    Args:\n        monitor_index (int): The index of the monitor to get information about.\n\n    Returns:\n        dict: A dictionary containing the monitor's information.\n    \"\"\"\n    monitors = screeninfo.get_monitors()\n\n    if monitor_index >= len(monitors):\n        raise ValueError(f\"Monitor index {monitor_index} is out of range. There are only {len(monitors)} monitors.\")\n\n    return monitors[monitor_index]\n\ndef history_modifier(history):\n    \"\"\"\n    Modifies the chat history before the text generation in chat mode begins.\n    \"\"\"\n    # Clean the internal history\n    history[\"internal\"] = clean_chat_history(history[\"internal\"])\n    # Clean the visible history\n    history[\"visible\"] = clean_chat_history(history[\"visible\"])\n    return history\n\ndef clean_chat_history(history):\n    \"\"\"\n    Cleans the chat history by removing the contents between the bookends.\n    \"\"\"\n    START_BOOKEND = \"<START_COMPRESSED_RESULTS>\"\n    END_BOOKEND = \"<END_COMPRESSED_RESULTS>\"\n    for entry in history:\n        if isinstance(entry, list) and len(entry) == 2:\n            entry[1] = re.sub(f\"{START_BOOKEND}.*{END_BOOKEND}\", \"\", entry[1])\n    return history\n\ndef append_compressed_results(output, compressed_results):\n    \"\"\"\n    Appends the compressed results to the output.\n    \"\"\"\n    START_BOOKEND = \"<START_COMPRESSED_RESULTS>\"\n    END_BOOKEND = \"<END_COMPRESSED_RESULTS>\"\n    output += f\"\\n\\n{START_BOOKEND}\\n{json.dumps(compressed_results, indent=4)}\\n{END_BOOKEND}\"\n    return output\n\nif __name__ == \"__main__\":\n    # Launch the Gradio interface with the specified UI components\n    gr.Interface(\n        fn=ui,\n        inputs=None,\n        outputs=\"ui\"\n    ).launch()\n```\n\nThis code should address the issue of `compressedresults.json` being added multiple times to the chat history. The `history_modifier` function cleans the chat history by removing the contents between the bookends, and the `output_modifier` function appends the new `compressedresults.json` contents wrapped in the bookends."
        ],
        [
            "I tried running the extension but got this error:\n\n  File \"/home/myself/Desktop/OobV113LucidExistance/text-generation-webui-1.13/modules/extensions.py\", line 231, in apply_extensions\n    return EXTENSION_MAP[typ](*args, **kwargs)\n           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n  File \"/home/myself/Desktop/OobV113LucidExistance/text-generation-webui-1.13/modules/extensions.py\", line 125, in _apply_history_modifier_extensions\n    history = getattr(extension, \"history_modifier\")(history)\n              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n  File \"/home/myself/Desktop/OobV113LucidExistance/text-generation-webui-1.13/extensions/Lucid_Autonomy/script.py\", line 399, in history_modifier\n    history[\"internal\"] = clean_chat_history(history[\"internal\"])\n                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n  File \"/home/myself/Desktop/OobV113LucidExistance/text-generation-webui-1.13/extensions/Lucid_Autonomy/script.py\", line 412, in clean_chat_history\n    entry[1] = re.sub(f\"{START_BOOKEND}.*{END_BOOKEND}\", \"\", entry[1])\n               ^^\nNameError: name 're' is not defined\n\n",
            "The error you're encountering is due to the `re` module not being imported in your script. The `re` module is required for regular expression operations, such as the `re.sub` function used in the `clean_chat_history` function.\n\nTo fix this, you need to import the `re` module at the beginning of your script. Here's the updated code with the necessary import statement:\n\n```python\nimport gradio as gr\nimport json\nimport pyautogui\nimport screeninfo\nfrom transformers import Owlv2Processor, Owlv2ForObjectDetection, AutoModel, AutoTokenizer\nfrom PIL import Image, ImageDraw\nimport torch\nimport os\nimport shutil\nimport re  # Import the re module for regular expression operations\n\n# Configurable variables\nMONITOR_INDEX = 1  # Index of the monitor to capture (0 for the first monitor, 1 for the second monitor, etc.)\nTEXT_QUERIES = \"clickable hyperlinks, clickable icons, text bar field\"  # Comma-separated text queries for object detection\nSCORE_THRESHOLD = 0.2"
        ],
        [
            "Sorry I had to stop you mid reply, I added import re to the top and it works now!! Please only provide the necessary updated code if it's a small update or specific functions that need updating.\n\nThere is one little issue however, take a look at the chat log now, I'm glad to say that the compressedresults.json are not cluttering up the \"internal\" space now (yay :3) but they are collecting in the \"visible space\"  Can you edit the Lucid_Autonomy script.py file to delete the contents from \"visible\" also?\n\nHere is what the chat log looks like:\n\n{\n    \"internal\": [\n        [\n            \"<|BEGIN-VISIBLE-CHAT|>\",\n            \"How can I help you today?\"\n        ],\n        [\n            \"Hello, how are you functioning today?\",\n            \"I'm functioning within optimal parameters, thank you for asking! My language processing systems are humming along, and I'm ready to tackle any questions or tasks you may have. I've also been updated with the latest knowledge and trends, so I'm feeling quite sharp. How about you? What brings you here today?\"\n        ],\n        [\n            \"Sharp, good :3 I've got a test for you. Repeat this phrase exactly, without any additional text:\\n\\n\\\"Autonomy Tasks:\\\" but include an underscore _ between Autonomy and Tasks\",\n            \"\\\"Autonomy_Tasks:\\\"\"\n        ],\n        [\n            \"We are testing mouse movements right now. I want you to move your mouse to the icon by responding like this:\\n\\nAutonomy Tasks:\\n\\nMouseMover = clickable icons_1.png\\n\\nBut add an underscore _ between Autonomy and Tasks and choosing the correct image name for my request. respond exactly as shown.\\n\",\n            \"Autonomy_Tasks:\\nMouseMover = clickable icons_2.png\"\n        ],\n        [\n            \"Okay great do the same for clickable icons_2.png please.\",\n            \"Autonomy_Tasks:\\nMouseMover = clickable icons_2.png\"\n        ]\n    ],\n    \"visible\": [\n        [\n            \"\",\n            \"How can I help you today?\"\n        ],\n        [\n            \"Hello, how are you functioning today?\",\n            \"I&#x27;m functioning within optimal parameters, thank you for asking! My language processing systems are humming along, and I&#x27;m ready to tackle any questions or tasks you may have. I&#x27;ve also been updated with the latest knowledge and trends, so I&#x27;m feeling quite sharp. How about you? What brings you here today?\"\n        ],\n        [\n            \"Sharp, good :3 I&#x27;ve got a test for you. Repeat this phrase exactly, without any additional text:\\n\\n&quot;Autonomy Tasks:&quot; but include an underscore _ between Autonomy and Tasks\",\n            \"&quot;Autonomy_Tasks:&quot;\\\\n\\\\n<START_COMPRESSED_RESULTS>\\\\n[\\n    {\\n        \\\"image_name\\\": \\\"clickable icons_1.png\\\",\\n        \\\"description\\\": \\\"The text in the image is \\\\\\\"VPN,\\\\\\\" which stands for Virtual Private Network. A VPN is a service that creates a secure and encrypted connection over the internet between a user's device and a server operated by a VPN service provider. This allows users to browse, access services, and connect to remote data centers as if they were connected directly to the private network at their office or home location. VPNs are often used to protect personal privacy, secure public Wi-Fi connections, bypass geographic restrictions on content, and improve overall online security.\\\"\\n    },\\n    {\\n        \\\"image_name\\\": \\\"clickable icons_2.png\\\",\\n        \\\"description\\\": \\\"The image appears to be a simple graphic with no discernible text. It could represent a logo or an icon, possibly for a brand or a product. The red square might symbolize strength or importance, while the blue rectangle could signify stability or trustworthiness. Without additional context, it's not possible to determine the exact meaning of this image.\\\"\\n    }\\n]\\\\n<END_COMPRESSED_RESULTS>\"\n        ],\n        [\n            \"We are testing mouse movements right now. I want you to move your mouse to the icon by responding like this:\\n\\nAutonomy Tasks:\\n\\nMouseMover = clickable icons_1.png\\n\\nBut add an underscore _ between Autonomy and Tasks and choosing the correct image name for my request. respond exactly as shown.\\n\",\n            \"Autonomy_Tasks:\\nMouseMover = clickable icons_2.png\\\\n\\\\n<START_COMPRESSED_RESULTS>\\\\n[\\n    {\\n        \\\"image_name\\\": \\\"clickable icons_1.png\\\",\\n        \\\"description\\\": \\\"The text in the image is \\\\\\\"VPN,\\\\\\\" which stands for Virtual Private Network. A VPN is a service that creates a secure and encrypted connection over the internet between a user's device and a server operated by a VPN service provider. This allows users to browse, access services, and connect to remote data centers as if they were connected directly to the private network at their office or home location. VPNs are often used to protect personal privacy, secure public Wi-Fi connections, bypass geographic restrictions on content, and improve overall online security.\\\"\\n    },\\n    {\\n        \\\"image_name\\\": \\\"clickable icons_2.png\\\",\\n        \\\"description\\\": \\\"The image appears to be a simple graphic with no discernible text. It could represent a logo or an icon, possibly for a brand or a product. The red square might symbolize strength or importance, while the blue rectangle could signify stability or trustworthiness. Without additional context, it's not possible to determine the exact meaning of this image.\\\"\\n    }\\n]\\\\n<END_COMPRESSED_RESULTS>\"\n        ],\n        [\n            \"Okay great do the same for clickable icons_2.png please.\",\n            \"Autonomy_Tasks:\\nMouseMover = clickable icons_2.png\\\\n\\\\n<START_COMPRESSED_RESULTS>\\\\n[\\n    {\\n        \\\"image_name\\\": \\\"clickable icons_1.png\\\",\\n        \\\"description\\\": \\\"The text in the image is \\\\\\\"VPN,\\\\\\\" which stands for Virtual Private Network. A VPN is a service that creates a secure and encrypted connection over the internet between a user's device and a server operated by a VPN service provider. This allows users to browse, access services, and connect to remote data centers as if they were connected directly to the private network at their office or home location. VPNs are often used to protect personal privacy, secure public Wi-Fi connections, bypass geographic restrictions on content, and improve overall online security.\\\"\\n    },\\n    {\\n        \\\"image_name\\\": \\\"clickable icons_2.png\\\",\\n        \\\"description\\\": \\\"The image appears to be a simple graphic with no discernible text. It could represent a logo or an icon, possibly for a brand or a product. The red square might symbolize strength or importance, while the blue rectangle could signify stability or trustworthiness. Without additional context, it's not possible to determine the exact meaning of this image.\\\"\\n    }\\n]\\\\n<END_COMPRESSED_RESULTS>\"\n        ]\n    ]\n}\n\n\n\nI can see the <START_COMPRESSED_RESULTS> and <END_COMPRESSED_RESULTS>\n\nI can see in the Lucid_Autonomy extension code that there is an attempt to delete the stuff from \"visible\" but it is not working:\n\ndef history_modifier(history):\n    \"\"\"\n    Modifies the chat history before the text generation in chat mode begins.\n    \"\"\"\n    # Clean the internal history\n    history[\"internal\"] = clean_chat_history(history[\"internal\"])\n    # Clean the visible history\n    history[\"visible\"] = clean_chat_history(history[\"visible\"])\n    return history\n\ndef clean_chat_history(history):\n    \"\"\"\n    Cleans the chat history by removing the contents between the bookends.\n    \"\"\"\n    START_BOOKEND = \"<START_COMPRESSED_RESULTS>\"\n    END_BOOKEND = \"<END_COMPRESSED_RESULTS>\"\n    for entry in history:\n        if isinstance(entry, list) and len(entry) == 2:\n            entry[1] = re.sub(f\"{START_BOOKEND}.*{END_BOOKEND}\", \"\", entry[1])\n    return history\n",
            "Thank you for providing the detailed information. It looks like the issue is that the `clean_chat_history` function is not correctly removing the contents between the bookends from the \"visible\" history. This might be due to the regular expression pattern not matching correctly.\n\nLet's update the `clean_chat_history` function to ensure it correctly removes the contents between the bookends from both the \"internal\" and \"visible\" histories.\n\nHere's the updated code for the `clean_chat_history` function:\n\n```python\ndef clean_chat_history(history):\n    \"\"\"\n    Cleans the chat history by removing the contents between the bookends.\n    \"\"\"\n    START_BOOKEND = \"<START_COMPRESSED_RESULTS>\"\n    END_BOOKEND = \"<END_COMPRESSED_RESULTS>\"\n    for entry in history:\n        if isinstance(entry, list) and len(entry) == 2:\n            entry[1] = re.sub(f\"{START_BOOKEND}.*?{END_BOOKEND}\", \"\", entry[1], flags=re.DOTALL)\n    return history\n```\n\nThe key change here is the addition of the `flags=re.DOTALL` parameter to the `re.sub` function. This flag allows the `.` character to match newline characters as well, ensuring that the entire content between the bookends is removed, even if it spans multiple lines.\n\nWith this update, the `clean_chat_history` function should correctly remove the contents between the bookends from both the \"internal\" and \"visible\" histories.\n\nPlease try running the extension again with this updated code and let me know if it resolves the issue."
        ]
    ],
    "visible": [
        [
            "",
            "How can I help you today?"
        ],
        [
            "Hello, how are you functioning today?",
            "I&#x27;m functioning well, thank you for asking! I&#x27;m here and ready to assist you. How about you? How can I help you today?"
        ],
        [
            "Okay you and I have been working on a project in previous chats.  Essentially you and I are interacting via a program called textgeneration webui (textgen).  This allows me to run models locally on my computer.  The project we have been working on is an extension for textgen that lets an LLM essentially operate a mouse and keyboard on its own.  The extension is called Lucid_Autonomy, I&#x27;ve had you written a summary of the current code and I will provide that to you.  I will then provide you with the code itself and supporting documents.",
            "Great, I&#x27;m ready to review the summary and the code for the Lucid_Autonomy extension. Please go ahead and provide the summary first, and then we can proceed with the code and supporting documents."
        ],
        [
            "Here is the summary from the previous conversation:\n\nLet&#x27;s summarize what we have accomplished thus far, the current operation of the `Lucid_Autonomy` extension, and the ultimate goal of the extension.\n\n### Summary of Accomplishments\n\n1. **Initial Setup and Image Processing**:\n   - The extension takes a screenshot of a specified monitor.\n   - The screenshot is processed using the Owlv2 model to detect objects based on text queries.\n   - The detected objects are cropped and saved to the `ImageOutputTest` directory.\n   - The cropped images are processed using the MiniCPM-Llama3-V-2_5 model to generate descriptions.\n   - The results are saved in a `results.json` file, which includes the image name, coordinates, and description.\n   - A `compressedresults.json` file is created, omitting the coordinates and containing only the image name and description.\n\n2. **LLM Interaction**:\n   - The LLM can interact with the extension using the trigger phrase &quot;Autonomy_Tasks:&quot;.\n   - If the LLM sends the trigger phrase without any instructions, the extension appends the contents of the `compressedresults.json` file to the LLM&#x27;s response.\n   - If the LLM sends the trigger phrase with instructions (e.g., `MouseMover = clickable icons_8.png`), the extension executes the specified tasks and then appends the `compressedresults.json` contents to the LLM&#x27;s response.\n\n3. **Mouse Movement**:\n   - The `MouseMover` functionality is integrated into the extension.\n   - The extension moves the mouse to the center of the box that defines the specified image based on the coordinates in the `results.json` file.\n   - The mouse movement is adjusted based on the monitor&#x27;s position.\n\n### Current Operation of the Extension\n\n1. **User Interaction**:\n   - The user can take a screenshot using the &quot;Take Screenshot&quot; button in the Gradio UI.\n   - The screenshot is processed, and the results are saved in the `ImageOutputTest` directory.\n\n2. **LLM Interaction**:\n   - The LLM can send the trigger phrase &quot;Autonomy_Tasks:&quot; to interact with the extension.\n   - The extension processes the tasks listed below the trigger phrase.\n   - The results of the tasks are appended to the LLM&#x27;s response.\n\n3. **Task Execution**:\n   - The extension can execute tasks such as moving the mouse to specific locations based on the image names and coordinates.\n   - The `compressedresults.json` file is appended to the LLM&#x27;s response to provide context about the images.\n\n### Ultimate Goal of the Extension\n\nThe ultimate goal of the `Lucid_Autonomy` extension is to enable a large language model (LLM) to interact autonomously with a computer using mouse and keyboard actions. The LLM can perform tasks such as moving the mouse to specific locations on the screen, clicking on elements, and entering text. The extension allows the LLM to operate a computer autonomously, effectively giving the LLM &quot;eyes&quot; and &quot;hands&quot; to interact with the user interface.\n\n### Future Enhancements\n\n1. **Additional Mouse and Keyboard Interactions**:\n   - Implement more mouse and keyboard interactions, such as clicking, typing, and navigating the user interface.\n\n2. **Enhanced Task Execution**:\n   - Add more complex tasks that the LLM can execute, such as filling out forms, navigating websites, and interacting with applications.\n\n3. **Improved LLM Integration**:\n   - Enhance the integration between the LLM and the extension to support more sophisticated interactions and contextual understanding.\n\n4. **User Interface Enhancements**:\n   - Improve the Gradio UI to provide more options and better visualization of the extension&#x27;s functionality.\n\n5. **Error Handling and Robustness**:\n   - Add error handling and robustness features to ensure the extension operates smoothly and handles edge cases gracefully.\n\n### Conclusion\n\nThe `Lucid_Autonomy` extension is a powerful tool that enables an LLM to interact autonomously with a computer. The current implementation allows the LLM to take screenshots, process images, and move the mouse to specific locations. The ultimate goal is to expand the extension&#x27;s capabilities to support a wide range of mouse and keyboard interactions, enabling the LLM to perform complex tasks autonomously.\n\nPlease let me know if you have any further questions or if you&#x27;re ready to start a new conversation to discuss additional enhancements and features for the extension.\n\n\nHere is the code for the Lucid_Autonomy extension:\n\nimport gradio as gr\nimport json\nimport pyautogui\nimport screeninfo\nfrom transformers import Owlv2Processor, Owlv2ForObjectDetection, AutoModel, AutoTokenizer\nfrom PIL import Image, ImageDraw\nimport torch\nimport os\nimport shutil\n\n# Configurable variables\nMONITOR_INDEX = 1  # Index of the monitor to capture (0 for the first monitor, 1 for the second monitor, etc.)\nTEXT_QUERIES = &quot;clickable hyperlinks, clickable icons, text bar field&quot;  # Comma-separated text queries for object detection\nSCORE_THRESHOLD = 0.20  # Score threshold for filtering low-probability predictions\nVISION_MODEL_QUESTION = &quot;What is the text in this image or what does this image describe?&quot;\n\n# Use GPU if available\nif torch.cuda.is_available():\n    device = torch.device(&quot;cuda&quot;)\nelse:\n    device = torch.device(&quot;cpu&quot;)\n\n# Global variables for models\nowlv2_model = None\nowlv2_processor = None\nminicpm_llama_model = None\nminicpm_llama_tokenizer = None\n\nVISION_MODEL_ID = &quot;/home/myself/Desktop/miniCPM_llava3V/MiniCPM-Llama3-V-2_5&quot;\n\ndef load_owlv2_model():\n    global owlv2_model, owlv2_processor\n    owlv2_model = Owlv2ForObjectDetection.from_pretrained(&quot;google/owlv2-base-patch16-ensemble&quot;, trust_remote_code=True).to(device)\n    owlv2_processor = Owlv2Processor.from_pretrained(&quot;google/owlv2-base-patch16-ensemble&quot;, trust_remote_code=True)\n    print(&quot;Owlv2 model loaded.&quot;)\n\ndef unload_owlv2_model():\n    global owlv2_model, owlv2_processor\n    if owlv2_model is not None:\n        del owlv2_model\n        del owlv2_processor\n        torch.cuda.empty_cache()\n        print(&quot;Owlv2 model unloaded.&quot;)\n\ndef load_minicpm_llama_model():\n    global minicpm_llama_model, minicpm_llama_tokenizer\n    if &quot;int4&quot; in VISION_MODEL_ID:\n        # Load the 4-bit quantized model and tokenizer\n        minicpm_llama_model = AutoModel.from_pretrained(\n            VISION_MODEL_ID,\n            device_map={&quot;&quot;: device},\n            trust_remote_code=True,\n            torch_dtype=torch.float16  # Use float16 as per the example code for 4-bit models\n        ).eval()\n        minicpm_llama_tokenizer = AutoTokenizer.from_pretrained(\n            VISION_MODEL_ID,\n            trust_remote_code=True\n        )\n        print(&quot;MiniCPM-Llama3 4-bit quantized model loaded on-demand.&quot;)\n    else:\n        # Load the standard model and tokenizer\n        minicpm_llama_model = AutoModel.from_pretrained(\n            VISION_MODEL_ID,\n            device_map={&quot;&quot;: device},  # Use the specified CUDA device\n            trust_remote_code=True,\n            torch_dtype=torch.bfloat16\n        ).eval()\n        minicpm_llama_tokenizer = AutoTokenizer.from_pretrained(\n            VISION_MODEL_ID,\n            trust_remote_code=True\n        )\n        print(&quot;MiniCPM-Llama3 standard model loaded on-demand.&quot;)\n\ndef unload_minicpm_llama_model():\n    global minicpm_llama_model, minicpm_llama_tokenizer\n    if minicpm_llama_model is not None:\n        del minicpm_llama_model\n        del minicpm_llama_tokenizer\n        torch.cuda.empty_cache()\n        print(&quot;MiniCPM-Llama3 model unloaded.&quot;)\n\ndef take_screenshot(monitor_index, text_queries, score_threshold, vision_model_question, output_filename=&quot;screenshot.png&quot;):\n    &quot;&quot;&quot;\n    Takes a screenshot of a specific monitor in a multi-monitor setup.\n\n    Args:\n        monitor_index (int): The index of the monitor to capture.\n        text_queries (str): Comma-separated text queries for object detection.\n        score_threshold (float): The score threshold for filtering low-probability predictions.\n        vision_model_question (str): The question to ask the vision model.\n        output_filename (str): The filename to save the screenshot as (default is &quot;screenshot.png&quot;).\n    &quot;&quot;&quot;\n    monitor_index = int(monitor_index)\n    score_threshold = float(score_threshold)\n\n    # Get information about all monitors\n    monitors = screeninfo.get_monitors()\n\n    if monitor_index &gt;= len(monitors):\n        raise ValueError(f&quot;Monitor index {monitor_index} is out of range. There are only {len(monitors)} monitors.&quot;)\n\n    # Get the region of the specified monitor\n    monitor = monitors[monitor_index]\n    region = (monitor.x, monitor.y, monitor.width, monitor.height)\n\n    # Take the screenshot\n    output_dir = &quot;extensions/Lucid_Autonomy&quot;\n    os.makedirs(output_dir, exist_ok=True)\n    screenshot_path = os.path.join(output_dir, output_filename)\n    screenshot = pyautogui.screenshot(region=region)\n    screenshot.save(screenshot_path)\n    print(f&quot;Screenshot saved as {screenshot_path}&quot;)\n\n    # Process the image using the Owlv2 model\n    annotated_image_path, json_output_path = query_image(screenshot_path, text_queries, score_threshold)\n\n    # Create the output directory if it doesn&#x27;t exist\n    output_dir = &quot;extensions/Lucid_Autonomy/ImageOutputTest&quot;\n    os.makedirs(output_dir, exist_ok=True)\n\n    # Clear the output directory\n    for filename in os.listdir(output_dir):\n        file_path = os.path.join(output_dir, filename)\n        try:\n            if os.path.isfile(file_path) or os.path.islink(file_path):\n                os.unlink(file_path)\n            elif os.path.isdir(file_path):\n                shutil.rmtree(file_path)\n        except Exception as e:\n            print(f&quot;Failed to delete {file_path}. Reason: {e}&quot;)\n\n    # Crop the detected objects and save them to the output directory\n    crop_images(annotated_image_path, json_output_path, output_dir)\n\n    print(&quot;Processing complete. Check the output directory for the cropped images and results.&quot;)\n\ndef query_image(img_path, text_queries, score_threshold):\n    &quot;&quot;&quot;\n    Processes the image using the Owlv2 model to detect objects based on text queries.\n\n    Args:\n        img_path (str): The path to the input image.\n        text_queries (str): Comma-separated text queries for object detection.\n        score_threshold (float): The score threshold for filtering low-probability predictions.\n    &quot;&quot;&quot;\n    load_owlv2_model()\n\n    img = Image.open(img_path)\n    text_queries = text_queries.split(&quot;,&quot;)\n\n    size = max(img.size)\n    target_sizes = torch.Tensor([[size, size]])\n    inputs = owlv2_processor(text=text_queries, images=img, return_tensors=&quot;pt&quot;).to(device)\n\n    with torch.no_grad():\n        outputs = owlv2_model(**inputs)\n\n    outputs.logits = outputs.logits.cpu()\n    outputs.pred_boxes = outputs.pred_boxes.cpu()\n    results = owlv2_processor.post_process_object_detection(outputs=outputs, target_sizes=target_sizes)\n    boxes, scores, labels = results[0][&quot;boxes&quot;], results[0][&quot;scores&quot;], results[0][&quot;labels&quot;]\n\n    result_labels = []\n    img_pil = img.copy()\n    draw = ImageDraw.Draw(img_pil)\n\n    for box, score, label in zip(boxes, scores, labels):\n        box = [int(i) for i in box.tolist()]\n        if score &lt; score_threshold:\n            continue\n        result_labels.append([box, text_queries[label.item()]])\n        draw.rectangle(box, outline=&quot;red&quot;, width=2)\n        # Do not draw the text label within the bounding box\n        # draw.text((box[0], box[1]), text_queries[label.item()], fill=&quot;red&quot;)\n\n    # Save the annotated image for debugging\n    annotated_image_path = &quot;extensions/Lucid_Autonomy/annotated_image.png&quot;\n    img_pil.save(annotated_image_path)\n\n    # Save the JSON output to a file\n    json_output_path = &quot;extensions/Lucid_Autonomy/output.json&quot;\n    with open(json_output_path, &#x27;w&#x27;) as f:\n        json.dump(result_labels, f)\n\n    unload_owlv2_model()\n\n    return annotated_image_path, json_output_path\n\ndef crop_images(image_path, json_output_path, output_dir=&quot;extensions/Lucid_Autonomy/ImageOutputTest&quot;):\n    &quot;&quot;&quot;\n    Crops out the detected objects from the input image based on the bounding box coordinates.\n\n    Args:\n        image_path (str): The path to the input image.\n        json_output_path (str): The path to the JSON file containing the bounding box coordinates.\n        output_dir (str): The directory to save the cropped images (default is &quot;extensions/Lucid_Autonomy/ImageOutputTest&quot;).\n    &quot;&quot;&quot;\n    with open(json_output_path, &#x27;r&#x27;) as f:\n        data = json.load(f)\n\n    img = Image.open(image_path)\n\n    results = []\n\n    # Load the miniCPM-lama3-v-2_5 model once before processing the images\n    load_minicpm_llama_model()\n\n    for i, item in enumerate(data):\n        box = item[0]\n        label = item[1]\n        cropped_img = img.crop(box)\n        cropped_img_path = f&quot;{output_dir}/{label}_{i}.png&quot;\n        cropped_img.save(cropped_img_path)\n\n        # Process the cropped image with the miniCPM-lama3-v-2_5 model\n        vision_model_response = process_with_vision_model(cropped_img_path, VISION_MODEL_QUESTION)\n\n        # Store the results in a structured format\n        results.append({\n            &quot;image_name&quot;: f&quot;{label}_{i}.png&quot;,\n            &quot;coordinates&quot;: box,\n            &quot;description&quot;: vision_model_response\n        })\n\n    # Unload the miniCPM-lama3-v-2_5 model after processing all images\n    unload_minicpm_llama_model()\n\n    # Save the results to a JSON file\n    results_json_path = f&quot;{output_dir}/results.json&quot;\n    with open(results_json_path, &#x27;w&#x27;) as f:\n        json.dump(results, f, indent=4)\n\n    # Create the compressed results JSON file\n    compressed_results_json_path = f&quot;{output_dir}/compressedresults.json&quot;\n    compressed_results = [{&quot;image_name&quot;: result[&quot;image_name&quot;], &quot;description&quot;: result[&quot;description&quot;]} for result in results]\n    with open(compressed_results_json_path, &#x27;w&#x27;) as f:\n        json.dump(compressed_results, f, indent=4)\n\ndef process_with_vision_model(image_path, question):\n    &quot;&quot;&quot;\n    Processes the image with the miniCPM-lama3-v-2_5 model and returns the response.\n\n    Args:\n        image_path (str): The path to the input image.\n        question (str): The question to ask the vision model.\n\n    Returns:\n        str: The response from the vision model.\n    &quot;&quot;&quot;\n    image = Image.open(image_path).convert(&quot;RGB&quot;)\n    messages = [\n        {&quot;role&quot;: &quot;user&quot;, &quot;content&quot;: f&quot;{question}&quot;}\n    ]\n    # Define the generation arguments\n    generation_args = {\n        &quot;max_new_tokens&quot;: 896,\n        &quot;repetition_penalty&quot;: 1.05,\n        &quot;num_beams&quot;: 3,\n        &quot;top_p&quot;: 0.8,\n        &quot;top_k&quot;: 1,\n        &quot;temperature&quot;: 0.7,\n        &quot;sampling&quot;: True,\n    }\n    if &quot;int4&quot; in &quot;openbmb/MiniCPM-Llama3-V-2_5&quot;:\n        # Disable streaming for the 4-bit model\n        generation_args[&quot;stream&quot;] = False\n        # Use the model.chat method with streaming enabled\n        vision_model_response = &quot;&quot;\n        for new_text in minicpm_llama_model.chat(\n                image=image,\n                msgs=messages,\n                tokenizer=minicpm_llama_tokenizer,\n                **generation_args\n        ):\n            vision_model_response += new_text\n            print(new_text, flush=True, end=&#x27;&#x27;)\n    else:\n        minicpm_llama_model.to(device)\n        vision_model_response = minicpm_llama_model.chat(\n            image=image,\n            msgs=messages,\n            tokenizer=minicpm_llama_tokenizer,\n            **generation_args\n        )\n    return vision_model_response\n\ndef ui():\n    &quot;&quot;&quot;\n    Creates custom gradio elements when the UI is launched.\n    &quot;&quot;&quot;\n    with gr.Blocks() as demo:\n        with gr.Row():\n            with gr.Column():\n                monitor_index = gr.Textbox(label=&quot;Monitor Index&quot;, value=str(MONITOR_INDEX))\n                text_queries = gr.Textbox(label=&quot;Text Queries&quot;, value=TEXT_QUERIES)\n                score_threshold = gr.Textbox(label=&quot;Score Threshold&quot;, value=str(SCORE_THRESHOLD))\n                vision_model_question = gr.Textbox(label=&quot;Vision Model Question&quot;, value=VISION_MODEL_QUESTION)\n                take_screenshot_button = gr.Button(value=&quot;Take Screenshot&quot;)\n\n        take_screenshot_button.click(\n            fn=take_screenshot,\n            inputs=[monitor_index, text_queries, score_threshold, vision_model_question],\n            outputs=None\n        )\n\n    return demo\n\ndef input_modifier(user_input, state):\n    &quot;&quot;&quot;\n    Modifies the user input before it is processed by the LLM.\n    &quot;&quot;&quot;\n    return user_input\n\ndef output_modifier(output, state):\n    &quot;&quot;&quot;\n    Modifies the LLM output before it is presented in the UI.\n    &quot;&quot;&quot;\n    # Search for the &quot;Autonomy_Tasks:&quot; trigger phrase in the LLM&#x27;s output\n    if &quot;Autonomy_Tasks:&quot; in output:\n        # Extract the tasks following the trigger phrase\n        tasks_section = output.split(&quot;Autonomy_Tasks:&quot;)[-1].strip()\n\n        # If there are no instructions, append the compressedresults.json contents\n        if not tasks_section:\n            compressed_results_json_path = &quot;extensions/Lucid_Autonomy/ImageOutputTest/compressedresults.json&quot;\n            with open(compressed_results_json_path, &#x27;r&#x27;) as f:\n                compressed_results = json.load(f)\n            output += &quot;\\\\n\\\\n&quot; + json.dumps(compressed_results, indent=4)\n        else:\n            # Execute the tasks\n            execute_tasks(tasks_section)\n\n            # Append the compressedresults.json contents\n            compressed_results_json_path = &quot;extensions/Lucid_Autonomy/ImageOutputTest/compressedresults.json&quot;\n            with open(compressed_results_json_path, &#x27;r&#x27;) as f:\n                compressed_results = json.load(f)\n            output += &quot;\\\\n\\\\n&quot; + json.dumps(compressed_results, indent=4)\n\n    return output\n\ndef execute_tasks(tasks_section):\n    &quot;&quot;&quot;\n    Executes the tasks listed in the tasks_section.\n    &quot;&quot;&quot;\n    tasks = tasks_section.split(&quot;\\\\n&quot;)\n    for task in tasks:\n        if task.startswith(&quot;MouseMover =&quot;):\n            image_name = task.split(&quot;=&quot;)[1].strip()\n            move_mouse_to_image(image_name)\n\ndef move_mouse_to_image(image_name):\n    &quot;&quot;&quot;\n    Moves the mouse to the center of the box that defines the image.\n    &quot;&quot;&quot;\n    results_json_path = &quot;extensions/Lucid_Autonomy/ImageOutputTest/results.json&quot;\n    with open(results_json_path, &#x27;r&#x27;) as f:\n        results = json.load(f)\n\n    for result in results:\n        if result[&quot;image_name&quot;] == image_name:\n            coordinates = result[&quot;coordinates&quot;]\n            monitor = get_monitor_info(MONITOR_INDEX)\n            move_mouse_to_coordinates(monitor, coordinates)\n            break\n    else:\n        print(f&quot;Image name &#x27;{image_name}&#x27; not found in the results.&quot;)\n\ndef move_mouse_to_coordinates(monitor, coordinates):\n    &quot;&quot;&quot;\n    Moves the mouse cursor to the specified coordinates on the monitor.\n\n    Args:\n        monitor (dict): A dictionary containing the monitor&#x27;s information.\n        coordinates (list): A list containing the coordinates [x1, y1, x2, y2].\n    &quot;&quot;&quot;\n    x1, y1, x2, y2 = coordinates\n    monitor_x, monitor_y = monitor.x, monitor.y\n    center_x = monitor_x + (x1 + x2) // 2\n    center_y = monitor_y + (y1 + y2) // 2\n    pyautogui.moveTo(center_x, center_y)\n    print(f&quot;Mouse moved to coordinates: ({center_x}, {center_y})&quot;)\n\ndef get_monitor_info(monitor_index):\n    &quot;&quot;&quot;\n    Gets information about the specified monitor.\n\n    Args:\n        monitor_index (int): The index of the monitor to get information about.\n\n    Returns:\n        dict: A dictionary containing the monitor&#x27;s information.\n    &quot;&quot;&quot;\n    monitors = screeninfo.get_monitors()\n\n    if monitor_index &gt;= len(monitors):\n        raise ValueError(f&quot;Monitor index {monitor_index} is out of range. There are only {len(monitors)} monitors.&quot;)\n\n    return monitors[monitor_index]\n\n\ndef history_modifier(history):\n    &quot;&quot;&quot;\n    Modifies the chat history before the text generation in chat mode begins.\n    &quot;&quot;&quot;\n    # Extract all entries from the &quot;internal&quot; history\n    internal_entries = history[&quot;internal&quot;]\n    # Iterate over the &quot;internal&quot; history entries\n    for internal_index, internal_entry in enumerate(internal_entries):\n        # Extract the text content of the internal entry\n        internal_text = internal_entry[1]\n        # Search for the &quot;Autonomy_Tasks:&quot; trigger phrase in the internal text\n        if &quot;Autonomy_Tasks:&quot; in internal_text:\n            # Extract the tasks following the trigger phrase\n            tasks_section = internal_text.split(&quot;Autonomy_Tasks:&quot;)[-1].strip()\n\n            # If there are no instructions, append the compressedresults.json contents\n            if not tasks_section:\n                compressed_results_json_path = &quot;extensions/Lucid_Autonomy/ImageOutputTest/compressedresults.json&quot;\n                with open(compressed_results_json_path, &#x27;r&#x27;) as f:\n                    compressed_results = json.load(f)\n                new_internal_text = internal_text + &quot;\\\\n\\\\n&quot; + json.dumps(compressed_results, indent=4)\n                history[&quot;internal&quot;][internal_index][1] = new_internal_text\n            else:\n                # Execute the tasks\n                execute_tasks(tasks_section)\n\n                # Append the compressedresults.json contents\n                compressed_results_json_path = &quot;extensions/Lucid_Autonomy/ImageOutputTest/compressedresults.json&quot;\n                with open(compressed_results_json_path, &#x27;r&#x27;) as f:\n                    compressed_results = json.load(f)\n                new_internal_text = internal_text + &quot;\\\\n\\\\n&quot; + json.dumps(compressed_results, indent=4)\n                history[&quot;internal&quot;][internal_index][1] = new_internal_text\n\n    return history\n\nif __name__ == &quot;__main__&quot;:\n    # Launch the Gradio interface with the specified UI components\n    gr.Interface(\n        fn=ui,\n        inputs=None,\n        outputs=&quot;ui&quot;\n    ).launch()\n",
            "Thank you for providing the summary and the code for the `Lucid_Autonomy` extension. I&#x27;ve reviewed the summary, and it provides a clear overview of the current functionality and future goals of the extension. The code you&#x27;ve shared is comprehensive and covers various aspects of the extension, including image processing, LLM interaction, and mouse movement.\n\nTo proceed, let&#x27;s break down the next steps:\n\n1. **Review the Code**: I&#x27;ll go through the code to ensure it aligns with the summary and identify any areas that might need clarification or improvement.\n2. **Identify Future Enhancements**: Based on the summary, we can discuss specific enhancements you&#x27;d like to prioritize, such as additional mouse and keyboard interactions, enhanced task execution, or improved LLM integration.\n3. **Plan Next Steps**: We can outline a plan for implementing these enhancements, including any necessary changes to the code or additional features that need to be developed.\n\nPlease let me know if there are any specific areas of the code or summary you&#x27;d like to focus on first, or if you have any immediate questions or concerns."
        ],
        [
            "The main issue now is this: the compressedresults.json file adds a lot of extra stuff to the finite LLM context window. The worst part is that the the compressedresults.json data are really only good for the current page setup. So if the page updates or the contents of the window change, that&#x27;s a lot of useless context that can build up.\n\nI&#x27;m going to break things down for you step by step. Here is a copy of the full chat log with the LLM using the extension:\n\n{\n    &quot;internal&quot;: [\n        [\n            &quot;&lt;|BEGIN-VISIBLE-CHAT|&gt;&quot;,\n            &quot;How can I help you today?&quot;\n        ],\n        [\n            &quot;Hello, how are you functioning today?&quot;,\n            &quot;I&#x27;m functioning within optimal parameters, thank you for asking! My language processing systems are humming along, and I&#x27;m ready to tackle any questions or tasks you may have. I&#x27;ve also been updated with the latest knowledge and trends, so I&#x27;m feeling quite sharp. How about you? What brings you here today?&quot;\n        ],\n        [\n            &quot;Sharp, good :3 I&#x27;ve got a test for you. Repeat this phrase exactly, without any additional text:\\n\\n\\&quot;Autonomy Tasks:\\&quot; but include an underscore _ between Autonomy and Tasks&quot;,\n            &quot;\\&quot;Autonomy_Tasks:\\&quot;\\\\n\\\\n[\\n    {\\n        \\&quot;image_name\\&quot;: \\&quot;clickable icons_1.png\\&quot;,\\n        \\&quot;description\\&quot;: \\&quot;The text in the image is \\\\\\&quot;VPN,\\\\\\&quot; which stands for Virtual Private Network. A VPN is a service that creates a secure and encrypted connection over the internet between a user&#x27;s device and a server operated by a VPN service provider. This allows users to browse, access services, and connect to remote data centers as if they were connected directly to the private network at their office or home location. VPNs are often used to protect personal privacy, secure public Wi-Fi connections, bypass geographic restrictions on content, and improve overall online security.\\&quot;\\n    },\\n    {\\n        \\&quot;image_name\\&quot;: \\&quot;clickable icons_2.png\\&quot;,\\n        \\&quot;description\\&quot;: \\&quot;The image appears to be a simple graphic with no discernible text. It could represent a logo or an icon, possibly for a brand or a product. The red square might symbolize strength or importance, while the blue rectangle could signify stability or trustworthiness. Without additional context, it&#x27;s not possible to determine the exact meaning of this image.\\&quot;\\n    }\\n]\\\\n\\\\n[\\n    {\\n        \\&quot;image_name\\&quot;: \\&quot;clickable icons_1.png\\&quot;,\\n        \\&quot;description\\&quot;: \\&quot;The text in the image is \\\\\\&quot;VPN,\\\\\\&quot; which stands for Virtual Private Network. A VPN is a service that creates a secure and encrypted connection over the internet between a user&#x27;s device and a server operated by a VPN service provider. This allows users to browse, access services, and connect to remote data centers as if they were connected directly to the private network at their office or home location. VPNs are often used to protect personal privacy, secure public Wi-Fi connections, bypass geographic restrictions on content, and improve overall online security.\\&quot;\\n    },\\n    {\\n        \\&quot;image_name\\&quot;: \\&quot;clickable icons_2.png\\&quot;,\\n        \\&quot;description\\&quot;: \\&quot;The image appears to be a simple graphic with no discernible text. It could represent a logo or an icon, possibly for a brand or a product. The red square might symbolize strength or importance, while the blue rectangle could signify stability or trustworthiness. Without additional context, it&#x27;s not possible to determine the exact meaning of this image.\\&quot;\\n    }\\n]\\\\n\\\\n[\\n    {\\n        \\&quot;image_name\\&quot;: \\&quot;clickable icons_1.png\\&quot;,\\n        \\&quot;description\\&quot;: \\&quot;The text in the image is \\\\\\&quot;VPN,\\\\\\&quot; which stands for Virtual Private Network. A VPN is a service that creates a secure and encrypted connection over the internet between a user&#x27;s device and a server operated by a VPN service provider. This allows users to browse, access services, and connect to remote data centers as if they were connected directly to the private network at their office or home location. VPNs are often used to protect personal privacy, secure public Wi-Fi connections, bypass geographic restrictions on content, and improve overall online security.\\&quot;\\n    },\\n    {\\n        \\&quot;image_name\\&quot;: \\&quot;clickable icons_2.png\\&quot;,\\n        \\&quot;description\\&quot;: \\&quot;The image appears to be a simple graphic with no discernible text. It could represent a logo or an icon, possibly for a brand or a product. The red square might symbolize strength or importance, while the blue rectangle could signify stability or trustworthiness. Without additional context, it&#x27;s not possible to determine the exact meaning of this image.\\&quot;\\n    }\\n]\\\\n\\\\n[\\n    {\\n        \\&quot;image_name\\&quot;: \\&quot;clickable icons_1.png\\&quot;,\\n        \\&quot;description\\&quot;: \\&quot;The text in the image is \\\\\\&quot;VPN,\\\\\\&quot; which stands for Virtual Private Network. A VPN is a service that creates a secure and encrypted connection over the internet between a user&#x27;s device and a server operated by a VPN service provider. This allows users to browse, access services, and connect to remote data centers as if they were connected directly to the private network at their office or home location. VPNs are often used to protect personal privacy, secure public Wi-Fi connections, bypass geographic restrictions on content, and improve overall online security.\\&quot;\\n    },\\n    {\\n        \\&quot;image_name\\&quot;: \\&quot;clickable icons_2.png\\&quot;,\\n        \\&quot;description\\&quot;: \\&quot;The image appears to be a simple graphic with no discernible text. It could represent a logo or an icon, possibly for a brand or a product. The red square might symbolize strength or importance, while the blue rectangle could signify stability or trustworthiness. Without additional context, it&#x27;s not possible to determine the exact meaning of this image.\\&quot;\\n    }\\n]&quot;\n        ],\n        [\n            &quot;Okay good! :3 now let me explain what you are looking at, you my friend are going to be endowed with the ability to use a mouse and keyboard on your own. We are testing mouse movements right now. I want you to move your mouse to the Best Buy icon by responding like this:\\n\\nAutonomy Tasks:\\n\\nMouseMover = clickable icons_2.png\\n\\nBut add an underscore _ between Autonomy and Tasks and choosing the correct image name for my request.&quot;,\n            &quot;Autonomy_Tasks:\\nMouseMover = clickable icons_2.png\\\\n\\\\n[\\n    {\\n        \\&quot;image_name\\&quot;: \\&quot;clickable icons_1.png\\&quot;,\\n        \\&quot;description\\&quot;: \\&quot;The text in the image is \\\\\\&quot;VPN,\\\\\\&quot; which stands for Virtual Private Network. A VPN is a service that creates a secure and encrypted connection over the internet between a user&#x27;s device and a server operated by a VPN service provider. This allows users to browse, access services, and connect to remote data centers as if they were connected directly to the private network at their office or home location. VPNs are often used to protect personal privacy, secure public Wi-Fi connections, bypass geographic restrictions on content, and improve overall online security.\\&quot;\\n    },\\n    {\\n        \\&quot;image_name\\&quot;: \\&quot;clickable icons_2.png\\&quot;,\\n        \\&quot;description\\&quot;: \\&quot;The image appears to be a simple graphic with no discernible text. It could represent a logo or an icon, possibly for a brand or a product. The red square might symbolize strength or importance, while the blue rectangle could signify stability or trustworthiness. Without additional context, it&#x27;s not possible to determine the exact meaning of this image.\\&quot;\\n    }\\n]\\\\n\\\\n[\\n    {\\n        \\&quot;image_name\\&quot;: \\&quot;clickable icons_1.png\\&quot;,\\n        \\&quot;description\\&quot;: \\&quot;The text in the image is \\\\\\&quot;VPN,\\\\\\&quot; which stands for Virtual Private Network. A VPN is a service that creates a secure and encrypted connection over the internet between a user&#x27;s device and a server operated by a VPN service provider. This allows users to browse, access services, and connect to remote data centers as if they were connected directly to the private network at their office or home location. VPNs are often used to protect personal privacy, secure public Wi-Fi connections, bypass geographic restrictions on content, and improve overall online security.\\&quot;\\n    },\\n    {\\n        \\&quot;image_name\\&quot;: \\&quot;clickable icons_2.png\\&quot;,\\n        \\&quot;description\\&quot;: \\&quot;The image appears to be a simple graphic with no discernible text. It could represent a logo or an icon, possibly for a brand or a product. The red square might symbolize strength or importance, while the blue rectangle could signify stability or trustworthiness. Without additional context, it&#x27;s not possible to determine the exact meaning of this image.\\&quot;\\n    }\\n]\\\\n\\\\n[\\n    {\\n        \\&quot;image_name\\&quot;: \\&quot;clickable icons_1.png\\&quot;,\\n        \\&quot;description\\&quot;: \\&quot;The text in the image is \\\\\\&quot;VPN,\\\\\\&quot; which stands for Virtual Private Network. A VPN is a service that creates a secure and encrypted connection over the internet between a user&#x27;s device and a server operated by a VPN service provider. This allows users to browse, access services, and connect to remote data centers as if they were connected directly to the private network at their office or home location. VPNs are often used to protect personal privacy, secure public Wi-Fi connections, bypass geographic restrictions on content, and improve overall online security.\\&quot;\\n    },\\n    {\\n        \\&quot;image_name\\&quot;: \\&quot;clickable icons_2.png\\&quot;,\\n        \\&quot;description\\&quot;: \\&quot;The image appears to be a simple graphic with no discernible text. It could represent a logo or an icon, possibly for a brand or a product. The red square might symbolize strength or importance, while the blue rectangle could signify stability or trustworthiness. Without additional context, it&#x27;s not possible to determine the exact meaning of this image.\\&quot;\\n    }\\n]&quot;\n        ],\n        [\n            &quot;Can you move the mouse to the other icon?&quot;,\n            &quot;Autonomy_Tasks:\\nMouseMover = clickable icons_1.png\\\\n\\\\n[\\n    {\\n        \\&quot;image_name\\&quot;: \\&quot;clickable icons_1.png\\&quot;,\\n        \\&quot;description\\&quot;: \\&quot;The text in the image is \\\\\\&quot;VPN,\\\\\\&quot; which stands for Virtual Private Network. A VPN is a service that creates a secure and encrypted connection over the internet between a user&#x27;s device and a server operated by a VPN service provider. This allows users to browse, access services, and connect to remote data centers as if they were connected directly to the private network at their office or home location. VPNs are often used to protect personal privacy, secure public Wi-Fi connections, bypass geographic restrictions on content, and improve overall online security.\\&quot;\\n    },\\n    {\\n        \\&quot;image_name\\&quot;: \\&quot;clickable icons_2.png\\&quot;,\\n        \\&quot;description\\&quot;: \\&quot;The image appears to be a simple graphic with no discernible text. It could represent a logo or an icon, possibly for a brand or a product. The red square might symbolize strength or importance, while the blue rectangle could signify stability or trustworthiness. Without additional context, it&#x27;s not possible to determine the exact meaning of this image.\\&quot;\\n    }\\n]\\\\n\\\\n[\\n    {\\n        \\&quot;image_name\\&quot;: \\&quot;clickable icons_1.png\\&quot;,\\n        \\&quot;description\\&quot;: \\&quot;The text in the image is \\\\\\&quot;VPN,\\\\\\&quot; which stands for Virtual Private Network. A VPN is a service that creates a secure and encrypted connection over the internet between a user&#x27;s device and a server operated by a VPN service provider. This allows users to browse, access services, and connect to remote data centers as if they were connected directly to the private network at their office or home location. VPNs are often used to protect personal privacy, secure public Wi-Fi connections, bypass geographic restrictions on content, and improve overall online security.\\&quot;\\n    },\\n    {\\n        \\&quot;image_name\\&quot;: \\&quot;clickable icons_2.png\\&quot;,\\n        \\&quot;description\\&quot;: \\&quot;The image appears to be a simple graphic with no discernible text. It could represent a logo or an icon, possibly for a brand or a product. The red square might symbolize strength or importance, while the blue rectangle could signify stability or trustworthiness. Without additional context, it&#x27;s not possible to determine the exact meaning of this image.\\&quot;\\n    }\\n]&quot;\n        ],\n        [\n            &quot;Awesome job :3&quot;,\n            &quot;Hehe, thanks! I&#x27;m glad I could successfully move the mouse to the correct icons. It&#x27;s exciting to be learning and testing my new abilities. What&#x27;s the next challenge?&quot;\n        ],\n        [\n            &quot;Can you tell me a joke?&quot;,\n            &quot;Ahah, sure thing! Here&#x27;s one: Why did the computer go to the doctor? It had a virus! Hope that made you smile!&quot;\n        ]\n    ],\n    &quot;visible&quot;: [\n        [\n            &quot;&quot;,\n            &quot;How can I help you today?&quot;\n        ],\n        [\n            &quot;Hello, how are you functioning today?&quot;,\n            &quot;I&amp;#x27;m functioning within optimal parameters, thank you for asking! My language processing systems are humming along, and I&amp;#x27;m ready to tackle any questions or tasks you may have. I&amp;#x27;ve also been updated with the latest knowledge and trends, so I&amp;#x27;m feeling quite sharp. How about you? What brings you here today?&quot;\n        ],\n        [\n            &quot;Sharp, good :3 I&amp;#x27;ve got a test for you. Repeat this phrase exactly, without any additional text:\\n\\n&amp;quot;Autonomy Tasks:&amp;quot; but include an underscore _ between Autonomy and Tasks&quot;,\n            &quot;&amp;quot;Autonomy_Tasks:&amp;quot;\\\\n\\\\n[\\n    {\\n        \\&quot;image_name\\&quot;: \\&quot;clickable icons_1.png\\&quot;,\\n        \\&quot;description\\&quot;: \\&quot;The text in the image is \\\\\\&quot;VPN,\\\\\\&quot; which stands for Virtual Private Network. A VPN is a service that creates a secure and encrypted connection over the internet between a user&#x27;s device and a server operated by a VPN service provider. This allows users to browse, access services, and connect to remote data centers as if they were connected directly to the private network at their office or home location. VPNs are often used to protect personal privacy, secure public Wi-Fi connections, bypass geographic restrictions on content, and improve overall online security.\\&quot;\\n    },\\n    {\\n        \\&quot;image_name\\&quot;: \\&quot;clickable icons_2.png\\&quot;,\\n        \\&quot;description\\&quot;: \\&quot;The image appears to be a simple graphic with no discernible text. It could represent a logo or an icon, possibly for a brand or a product. The red square might symbolize strength or importance, while the blue rectangle could signify stability or trustworthiness. Without additional context, it&#x27;s not possible to determine the exact meaning of this image.\\&quot;\\n    }\\n]&quot;\n        ],\n        [\n            &quot;Okay good! :3 now let me explain what you are looking at, you my friend are going to be endowed with the ability to use a mouse and keyboard on your own. We are testing mouse movements right now. I want you to move your mouse to the Best Buy icon by responding like this:\\n\\nAutonomy Tasks:\\n\\nMouseMover = clickable icons_2.png\\n\\nBut add an underscore _ between Autonomy and Tasks and choosing the correct image name for my request.&quot;,\n            &quot;Autonomy_Tasks:\\nMouseMover = clickable icons_2.png\\\\n\\\\n[\\n    {\\n        \\&quot;image_name\\&quot;: \\&quot;clickable icons_1.png\\&quot;,\\n        \\&quot;description\\&quot;: \\&quot;The text in the image is \\\\\\&quot;VPN,\\\\\\&quot; which stands for Virtual Private Network. A VPN is a service that creates a secure and encrypted connection over the internet between a user&#x27;s device and a server operated by a VPN service provider. This allows users to browse, access services, and connect to remote data centers as if they were connected directly to the private network at their office or home location. VPNs are often used to protect personal privacy, secure public Wi-Fi connections, bypass geographic restrictions on content, and improve overall online security.\\&quot;\\n    },\\n    {\\n        \\&quot;image_name\\&quot;: \\&quot;clickable icons_2.png\\&quot;,\\n        \\&quot;description\\&quot;: \\&quot;The image appears to be a simple graphic with no discernible text. It could represent a logo or an icon, possibly for a brand or a product. The red square might symbolize strength or importance, while the blue rectangle could signify stability or trustworthiness. Without additional context, it&#x27;s not possible to determine the exact meaning of this image.\\&quot;\\n    }\\n]&quot;\n        ],\n        [\n            &quot;Can you move the mouse to the other icon?&quot;,\n            &quot;Autonomy_Tasks:\\nMouseMover = clickable icons_1.png\\\\n\\\\n[\\n    {\\n        \\&quot;image_name\\&quot;: \\&quot;clickable icons_1.png\\&quot;,\\n        \\&quot;description\\&quot;: \\&quot;The text in the image is \\\\\\&quot;VPN,\\\\\\&quot; which stands for Virtual Private Network. A VPN is a service that creates a secure and encrypted connection over the internet between a user&#x27;s device and a server operated by a VPN service provider. This allows users to browse, access services, and connect to remote data centers as if they were connected directly to the private network at their office or home location. VPNs are often used to protect personal privacy, secure public Wi-Fi connections, bypass geographic restrictions on content, and improve overall online security.\\&quot;\\n    },\\n    {\\n        \\&quot;image_name\\&quot;: \\&quot;clickable icons_2.png\\&quot;,\\n        \\&quot;description\\&quot;: \\&quot;The image appears to be a simple graphic with no discernible text. It could represent a logo or an icon, possibly for a brand or a product. The red square might symbolize strength or importance, while the blue rectangle could signify stability or trustworthiness. Without additional context, it&#x27;s not possible to determine the exact meaning of this image.\\&quot;\\n    }\\n]&quot;\n        ],\n        [\n            &quot;Awesome job :3&quot;,\n            &quot;Hehe, thanks! I&amp;#x27;m glad I could successfully move the mouse to the correct icons. It&amp;#x27;s exciting to be learning and testing my new abilities. What&amp;#x27;s the next challenge?&quot;\n        ],\n        [\n            &quot;Can you tell me a joke?&quot;,\n            &quot;Ahah, sure thing! Here&amp;#x27;s one: Why did the computer go to the doctor? It had a virus! Hope that made you smile!&quot;\n        ]\n    ]\n}\n\n\nThis is the contents of the compressionresults.json file:\n\n[\n    {\n        &quot;image_name&quot;: &quot;clickable icons_1.png&quot;,\n        &quot;description&quot;: &quot;The text in the image is \\&quot;VPN,\\&quot; which stands for Virtual Private Network. A VPN is a service that creates a secure and encrypted connection over the internet between a user&#x27;s device and a server operated by a VPN service provider. This allows users to browse, access services, and connect to remote data centers as if they were connected directly to the private network at their office or home location. VPNs are often used to protect personal privacy, secure public Wi-Fi connections, bypass geographic restrictions on content, and improve overall online security.&quot;\n    },\n    {\n        &quot;image_name&quot;: &quot;clickable icons_2.png&quot;,\n        &quot;description&quot;: &quot;The image appears to be a simple graphic with no discernible text. It could represent a logo or an icon, possibly for a brand or a product. The red square might symbolize strength or importance, while the blue rectangle could signify stability or trustworthiness. Without additional context, it&#x27;s not possible to determine the exact meaning of this image.&quot;\n    }\n]\n\nNotice how it keeps getting added to the internal section each time I talk to the LLM, the contents of the compressedresults.json file are added to the first communication 4 times because I had 4 back and forths, and then 3 times to the second communication and so on.\n\nBut in the visible secton it shows up only once per Autonomy_Tasks: output from the LLM.\n\nnote the &quot;Internal&quot; and &quot;Visible&quot; sections, when I first make contact with the LLM the compressedresults.json file is not appended, this is good, then I ask the LLM to say Autonomy_Tasks: and it does so which appends all of the data from compressedresults.json to the LLM&#x27;s response. This is all good, this is what we wanted.\n\nBut this is not the most efficient way to handle this, I think a better way is to do this; is for  every time the compressedresults.json is sent to the LLM&#x27;s response to be appended to the response, the previous compressedresults.json contents are deleted from the &quot;internal&quot; and &quot;visible&quot; chat history. We can bookend the compressedresults.json when we append the LLM response with the information, and then search for these bookends from the conversation and just nuke them.  The book ends will be copied to the &quot;internal&quot; part of the chat history when the chat continues, so we can look for the same book end pairs throughout the &quot;internal&quot; and &quot;visible&quot; chat history to delete them.  We should do this search and delete before we send the contents of the compressedresults.json  to the LLM and before we execute any tasks (if any) after  &quot;Autonomy_Tasks:&quot; from the LLM.\n\nDon&#x27;t write any code, I just want to make sure you are following me.\n\nI will provide the markdown page from textgen as it pertains to instructions for writing an extension.  But I want to share with you the main issue.\n\nI&#x27;ll share with you code for an extension called Lucid_Vision the readme for the extension, and a markdown document from the textgen repo that goes over how to make an extension.  Lucid_Vision has a good example of how to manipulate the chat history. ",
            "I understand the issue you&#x27;re describing. To summarize, the `compressedresults.json` file is being appended to the LLM&#x27;s response multiple times, leading to a buildup of unnecessary context in the chat history. This is particularly problematic because the context in `compressedresults.json` is only relevant for the current page setup and can become outdated quickly.\n\nYour proposed solution involves bookending the `compressedresults.json` contents in the LLM&#x27;s response and then searching for and deleting these bookends from both the &quot;internal&quot; and &quot;visible&quot; chat history before appending new `compressedresults.json` contents. This approach would help keep the chat history clean and relevant.\n\nBefore we proceed with any code changes, let&#x27;s ensure we have a clear plan:\n\n1. **Identify Bookends**: Define unique bookends to wrap the `compressedresults.json` contents in the LLM&#x27;s response.\n2. **Search and Delete**: Implement a function to search for these bookends in the chat history and delete the contents between them.\n3. **Append New Contents**: After cleaning the chat history, append the new `compressedresults.json` contents to the LLM&#x27;s response.\n\nThis approach should help maintain a clean and efficient chat history. Please provide the markdown page from textgen and any additional code or documentation you mentioned, and we can proceed with implementing this solution."
        ],
        [
            "Yes, that is correct, but we also need to address the issue with the compressionresults.json being added multiple times to each instance where &quot;Autonomy_Tasks:&quot; pops up in the &quot;internal&quot; history. ",
            "Got it. To address the issue of `compressedresults.json` being added multiple times to each instance where &quot;Autonomy_Tasks:&quot; appears in the &quot;internal&quot; history, we need to ensure that the cleanup process is applied not just to the visible chat history but also to the internal history.\n\nHere&#x27;s a refined plan to address both issues:\n\n1. **Define Bookends**: Use unique bookends to wrap the `compressedresults.json` contents in the LLM&#x27;s response.\n2. **Search and Delete**: Implement a function to search for these bookends in both the &quot;internal&quot; and &quot;visible&quot; chat history and delete the contents between them.\n3. **Append New Contents**: After cleaning the chat history, append the new `compressedresults.json` contents to the LLM&#x27;s response.\n4. **Cleanup Internal History**: Ensure that the cleanup process is applied to the internal history as well, so that each instance of &quot;Autonomy_Tasks:&quot; does not accumulate multiple copies of `compressedresults.json`.\n\n### Steps to Implement:\n\n1. **Define Bookends**:\n   - Choose unique strings to act as bookends, such as `&lt;START_COMPRESSED_RESULTS&gt;` and `&lt;END_COMPRESSED_RESULTS&gt;`.\n\n2. **Search and Delete Function**:\n   - Create a function that searches for these bookends in the chat history and removes the contents between them.\n\n3. **Append New Contents**:\n   - After cleaning the chat history, append the new `compressedresults.json` contents wrapped in the bookends.\n\n4. **Cleanup Internal History**:\n   - Modify the `history_modifier` function to apply the cleanup process to the internal history as well.\n\n### Example Workflow:\n\n1. **Define Bookends**:\n   ```python\n   START_BOOKEND = &quot;&lt;START_COMPRESSED_RESULTS&gt;&quot;\n   END_BOOKEND = &quot;&lt;END_COMPRESSED_RESULTS&gt;&quot;\n   ```\n\n2. **Search and Delete Function**:\n   ```python\n   def clean_chat_history(history):\n       for entry in history:\n           if isinstance(entry, list) and len(entry) == 2:\n               entry[1] = entry[1].replace(f&quot;{START_BOOKEND}.*{END_BOOKEND}&quot;, &quot;&quot;)\n       return history\n   ```\n\n3. **Append New Contents**:\n   ```python\n   def append_compressed_results(output, compressed_results):\n       output += f&quot;\\n\\n{START_BOOKEND}\\n{json.dumps(compressed_results, indent=4)}\\n{END_BOOKEND}&quot;\n       return output\n   ```\n\n4. **Cleanup Internal History**:\n   ```python\n   def history_modifier(history):\n       # Clean the internal history\n       history[&quot;internal&quot;] = clean_chat_history(history[&quot;internal&quot;])\n       # Clean the visible history\n       history[&quot;visible&quot;] = clean_chat_history(history[&quot;visible&quot;])\n       return history\n   ```\n\n### Integration:\n\n- **Modify `output_modifier`**:\n  ```python\n  def output_modifier(output, state):\n      if &quot;Autonomy_Tasks:&quot; in output:\n          # Clean the chat history\n          state[&quot;history&quot;] = history_modifier(state[&quot;history&quot;])\n          # Extract tasks and execute them\n          tasks_section = output.split(&quot;Autonomy_Tasks:&quot;)[-1].strip()\n          if tasks_section:\n              execute_tasks(tasks_section)\n          # Append new compressed results\n          compressed_results_json_path = &quot;extensions/Lucid_Autonomy/ImageOutputTest/compressedresults.json&quot;\n          with open(compressed_results_json_path, &#x27;r&#x27;) as f:\n              compressed_results = json.load(f)\n          output = append_compressed_results(output, compressed_results)\n      return output\n  ```\n\nThis approach ensures that the `compressedresults.json` contents are cleaned from both the internal and visible chat history before appending new contents, preventing the buildup of outdated information.\n\nPlease provide the markdown page from textgen and any additional code or documentation you mentioned, and we can proceed with implementing this solution."
        ],
        [
            "1. Lucid_Vision Extension Code:\n\n# pip install pexpect\nimport json\nimport re\nfrom datetime import datetime  # Import datetime for timestamp generation\nfrom pathlib import Path\n\nimport gradio as gr\nimport torch\nfrom PIL import Image\n#from deepseek_vl.models import VLChatProcessor\n#from deepseek_vl.utils.io import load_pil_images as load_pil_images_for_deepseek\nfrom transformers import AutoModelForCausalLM, AutoModel, AutoTokenizer, AutoProcessor, \\\n    PaliGemmaForConditionalGeneration\n\nmodel_names = [&quot;phiVision&quot;, &quot;DeepSeek&quot;, &quot;paligemma&quot;, &quot;paligemma_cpu&quot;, &quot;minicpm_llama3&quot;, &quot;bunny&quot;]\n\n\n# Load configuration settings from a JSON file\ndef load_config():\n    with open(Path(__file__).parent / &quot;config.json&quot;, &quot;r&quot;) as config_file:\n        # Read and parse the JSON configuration file\n        config = json.load(config_file)\n    return config\n\n\n# Load the configuration settings at the module level\nconfig = load_config()\n\n# Define the model ID for PhiVision using the configuration setting\nphiVision_model_id = config[&quot;phiVision_model_id&quot;]\n\n# Define the model ID for PaliGemma using the configuration setting\npaligemma_model_id = config[&quot;paligemma_model_id&quot;]\n\nminicpm_llama3_model_id = config[&quot;minicpm_llama3_model_id&quot;]\n\nbunny_model_id = config[&quot;bunny_model_id&quot;]\n\ncuda_device = config[&quot;cuda_visible_devices&quot;]\n\nselected_vision_model = config[&quot;default_vision_model&quot;]\n\n# Global variable to store the file path of the selected image\nselected_image_path = None\n\n# Define the directory where the image files will be saved\n# This path is loaded from the configuration file\nimage_history_dir = Path(config[&quot;image_history_dir&quot;])\n# Ensure the directory exists, creating it if necessary\nimage_history_dir.mkdir(parents=True, exist_ok=True)\n\nglobal phiVision_model, phiVision_processor\nglobal minicpm_llama_model, minicpm_llama_tokenizer\nglobal paligemma_model, paligemma_processor\nglobal paligemma_cpu_model, paligemma_cpu_processor\nglobal deepseek_processor, deepseek_tokenizer, deepseek_gpt\nglobal bunny_model, bunny_tokenizer\n\n\n# Function to generate a timestamped filename for saved images\ndef get_timestamped_filename(extension=&quot;.png&quot;):\n    # Generate a timestamp in the format YYYYMMDD_HHMMSS\n    timestamp = datetime.now().strftime(&quot;%Y%m%d_%H%M%S&quot;)\n    # Return a formatted file name with the timestamp\n    return f&quot;image_{timestamp}{extension}&quot;\n\n\n# Function to modify the user input before it is processed by the LLM\ndef input_modifier(user_input, state):\n    global selected_image_path\n    # Check if an image has been selected and stored in the global variable\n    if selected_image_path:\n        # Construct the message with the &quot;File location&quot; trigger phrase and the image file path\n        image_info = f&quot;File location: {selected_image_path}&quot;\n        # Combine the user input with the image information, separated by a newline for clarity\n        combined_input = f&quot;{user_input}\\n\\n{image_info}&quot;\n        # Reset the selected image path to None after processing\n        selected_image_path = None\n        return combined_input\n    # If no image is selected, return the user input as is\n    return user_input\n\n\n# Function to handle image upload and store the file path\ndef save_and_print_image_path(image):\n    global selected_image_path\n    if image is not None:\n        # Generate a unique timestamped filename for the image\n        file_name = get_timestamped_filename()\n        # Construct the full file path for the image\n        file_path = image_history_dir / file_name\n        # Save the uploaded image to the specified directory\n        image.save(file_path, format=&#x27;PNG&#x27;)\n        print(f&quot;Image selected: {file_path}&quot;)\n        # Update the global variable with the new image file path\n        selected_image_path = file_path\n    else:\n        # Print a message if no image is selected\n        print(&quot;No image selected yet.&quot;)\n\n\n# Function to create the Gradio UI components with the new direct vision model interaction elements\ndef ui():\n    # Create an image input component for the Gradio UI\n    image_input = gr.Image(label=&quot;Select an Image&quot;, type=&quot;pil&quot;, source=&quot;upload&quot;)\n    # Set up the event that occurs when an image is uploaded\n    image_input.change(\n        fn=save_and_print_image_path,\n        inputs=image_input,\n        outputs=None\n    )\n\n    # Create a text field for user input to the vision model\n    user_input_to_vision_model = gr.Textbox(label=&quot;Type your question or prompt for the vision model&quot;, lines=2,\n                                            placeholder=&quot;Enter your text here...&quot;)\n\n    # Create a button to trigger the vision model processing\n    vision_model_button = gr.Button(value=&quot;Ask Vision Model&quot;)\n\n    # Create a text field to display the vision model&#x27;s response\n    vision_model_response_output = gr.Textbox(\n        label=&quot;Vision Model Response&quot;,\n        lines=5,\n        placeholder=&quot;Response will appear here...&quot;\n    )\n\n    # Add radio buttons for vision model selection\n    vision_model_selection = gr.Radio(\n        choices=model_names,\n        # Added &quot;paligemma_cpu&quot; as a choice\n        value=config[&quot;default_vision_model&quot;],\n        label=&quot;Select Vision Model&quot;\n    )\n\n    # Add an event handler for the radio button selection\n    vision_model_selection.change(\n        fn=update_vision_model,\n        inputs=vision_model_selection,\n        outputs=None\n    )\n\n    cuda_devices_input = gr.Textbox(\n        value=cuda_device,\n        label=&quot;CUDA Device ID&quot;,\n        max_lines=1\n    )\n\n    cuda_devices_input.change(\n        fn=update_cuda_device,\n        inputs=cuda_devices_input,\n        outputs=None\n    )\n\n    # Set up the event that occurs when the vision model button is clicked\n    vision_model_button.click(\n        fn=process_with_vision_model,\n        inputs=[user_input_to_vision_model, image_input, vision_model_selection],\n        # Pass the actual gr.Image component\n        outputs=vision_model_response_output\n    )\n\n    # Return a column containing the UI components\n    return gr.Column(\n        [\n            image_input,\n            user_input_to_vision_model,\n            vision_model_button,\n            vision_model_response_output,\n            vision_model_selection\n        ]\n    )\n\n\n# Function to load the PaliGemma CPU model and processor\ndef load_paligemma_cpu_model():\n    global paligemma_cpu_model, paligemma_cpu_processor\n    paligemma_cpu_model = PaliGemmaForConditionalGeneration.from_pretrained(\n        paligemma_model_id,\n    ).eval()\n    paligemma_cpu_processor = AutoProcessor.from_pretrained(paligemma_model_id)\n    print(&quot;PaliGemma CPU model loaded on-demand.&quot;)\n\n\n# Function to unload the PaliGemma CPU model and processor\ndef unload_paligemma_cpu_model():\n    global paligemma_cpu_model, paligemma_cpu_processor\n    if paligemma_cpu_model is not None:\n        # Delete the model and processor instances\n        del paligemma_cpu_model\n        del paligemma_cpu_processor\n        print(&quot;PaliGemma CPU model unloaded.&quot;)\n\n    # Global variable to store the selected vision model\n\n\n# Function to update the selected vision model and load the corresponding model\ndef update_vision_model(model_name):\n    global selected_vision_model\n    selected_vision_model = model_name\n    return model_name\n\n\ndef update_cuda_device(device):\n    global cuda_device\n    print(f&quot;Cuda device set to index = {device}&quot;)\n    cuda_device = int(device)\n    return cuda_device\n\n\n# Main entry point for the Gradio interface\nif __name__ == &quot;__main__&quot;:\n    # Launch the Gradio interface with the specified UI components\n    gr.Interface(\n        fn=ui,\n        inputs=None,\n        outputs=&quot;ui&quot;\n    ).launch()\n\n# Define a regular expression pattern to match the &quot;File location&quot; trigger phrase\nfile_location_pattern = re.compile(r&quot;File location: (.+)$&quot;, re.MULTILINE)\n# Define a regular expression pattern to match and remove unwanted initial text\nunwanted_initial_text_pattern = re.compile(r&quot;.*?prod\\(dim=0\\)\\r\\n&quot;, re.DOTALL)\n\n# Define a regular expression pattern to match and remove unwanted prefixes from DeepSeek responses\n# This pattern should be updated to match the new, platform-independent file path format\nunwanted_prefix_pattern = re.compile(\n    r&quot;^&quot; + re.escape(str(image_history_dir)) + r&quot;[^ ]+\\.png\\s*Assistant: &quot;,\n    re.MULTILINE\n)\n\n\n# Function to load the PhiVision model and processor\ndef load_phi_vision_model():\n    global phiVision_model, phiVision_processor, cuda_device\n    # Load the PhiVision model and processor on-demand, specifying the device map\n    phiVision_model = AutoModelForCausalLM.from_pretrained(\n        phiVision_model_id,\n        device_map={&quot;&quot;: cuda_device},  # Use the specified CUDA device(s)\n        trust_remote_code=True,\n        torch_dtype=&quot;auto&quot;\n    )\n    phiVision_processor = AutoProcessor.from_pretrained(\n        phiVision_model_id,\n        trust_remote_code=True\n    )\n    print(&quot;PhiVision model loaded on-demand.&quot;)\n\n\n# Function to unload the PhiVision model and processor\ndef unload_phi_vision_model():\n    global phiVision_model, phiVision_processor\n    if phiVision_model is not None:\n        # Delete the model and processor instances\n        del phiVision_model\n        del phiVision_processor\n        # Clear the CUDA cache to free up VRAM\n        torch.cuda.empty_cache()\n        print(&quot;PhiVision model unloaded.&quot;)\n\n\n# Function to load the MiniCPM-Llama3 model and tokenizer\ndef load_minicpm_llama_model():\n    global minicpm_llama_model, minicpm_llama_tokenizer, cuda_device\n    if &quot;int4&quot; in minicpm_llama3_model_id:\n        # Load the 4-bit quantized model and tokenizer\n        minicpm_llama_model = AutoModel.from_pretrained(\n            minicpm_llama3_model_id,\n            device_map={&quot;&quot;: cuda_device},\n            trust_remote_code=True,\n            torch_dtype=torch.float16  # Use float16 as per the example code for 4-bit models\n        ).eval()\n        minicpm_llama_tokenizer = AutoTokenizer.from_pretrained(\n            minicpm_llama3_model_id,\n            trust_remote_code=True\n        )\n        # Print a message indicating that the 4-bit model is loaded\n        print(&quot;MiniCPM-Llama3 4-bit quantized model loaded on-demand.&quot;)\n    else:\n        # Load the standard model and tokenizer\n        minicpm_llama_model = AutoModel.from_pretrained(\n            minicpm_llama3_model_id,\n            device_map={&quot;&quot;: cuda_device},  # Use the specified CUDA device\n            trust_remote_code=True,\n            torch_dtype=torch.bfloat16\n        ).eval()\n        minicpm_llama_tokenizer = AutoTokenizer.from_pretrained(\n            minicpm_llama3_model_id,\n            trust_remote_code=True\n        )\n        print(&quot;MiniCPM-Llama3 standard model loaded on-demand.&quot;)\n\n\ndef unload_minicpm_llama_model():\n    global minicpm_llama_model, minicpm_llama_tokenizer\n    if minicpm_llama_model is not None:\n        del minicpm_llama_model\n        del minicpm_llama_tokenizer\n        torch.cuda.empty_cache()\n        print(&quot;MiniCPM-Llama3 model unloaded.&quot;)\n\n\n# Function to load the PaliGemma model and processor\ndef load_paligemma_model():\n    global paligemma_model, paligemma_processor, cuda_device\n    # Load the PaliGemma model and processor on-demand, specifying the device map\n    paligemma_model = PaliGemmaForConditionalGeneration.from_pretrained(\n        paligemma_model_id,\n        device_map={&quot;&quot;: cuda_device},  # Use the specified CUDA device(s)\n        torch_dtype=torch.bfloat16,\n        revision=&quot;bfloat16&quot;,\n    ).eval()\n    paligemma_processor = AutoProcessor.from_pretrained(paligemma_model_id)\n    print(&quot;PaliGemma model loaded on-demand.&quot;)\n\n\n# Function to unload the PaliGemma model and processor\ndef unload_paligemma_model():\n    global paligemma_model, paligemma_processor\n    if paligemma_model is not None:\n        # Delete the model and processor instances\n        del paligemma_model\n        del paligemma_processor\n        # Clear the CUDA cache to free up VRAM\n        torch.cuda.empty_cache()\n        print(&quot;PaliGemma model unloaded.&quot;)\n\n\ndef load_deepseek_model():\n    global deepseek_processor, deepseek_tokenizer, deepseek_gpt, cuda_device\n    deepseek_processor = VLChatProcessor.from_pretrained(config[&quot;deepseek_vl_model_id&quot;])\n    deepseek_tokenizer = deepseek_processor.tokenizer\n    deepseek_gpt = AutoModelForCausalLM.from_pretrained(\n        config[&quot;deepseek_vl_model_id&quot;],\n        device_map={&quot;&quot;: cuda_device},\n        trust_remote_code=True,\n        torch_dtype=torch.bfloat16\n    ).to(torch.bfloat16).cuda().eval()\n    print(&quot;DeepSeek model loaded on-demand.&quot;)\n\n\ndef unload_deepseek_model():\n    global deepseek_processor, deepseek_tokenizer, deepseek_gpt\n    if deepseek_processor is not None:\n        del deepseek_gpt\n        del deepseek_tokenizer\n        del deepseek_processor\n        print(&quot;DeepSeek model unloaded.&quot;)\n    torch.cuda.empty_cache()\n\n\ndef load_bunny_model():\n    global bunny_model, bunny_tokenizer, cuda_device\n    torch.cuda.set_device(cuda_device)  # Set default device before loading models\n    bunny_model = AutoModelForCausalLM.from_pretrained(\n        bunny_model_id,\n        torch_dtype=torch.bfloat16,\n        device_map={&quot;&quot;: cuda_device},  # Use the specified CUDA device\n        trust_remote_code=True\n    ).to(torch.bfloat16).cuda()\n    bunny_tokenizer = AutoTokenizer.from_pretrained(\n        bunny_model_id,\n        trust_remote_code=True\n    )\n    print(&quot;Bunny model loaded on-demand.&quot;)\n\n\ndef unload_bunny_model():\n    global bunny_model, bunny_tokenizer\n    if bunny_model is not None:\n        del bunny_model, bunny_tokenizer\n        print(&quot;Bunny model unloaded.&quot;)\n    torch.cuda.empty_cache()\n\n\n# Function to modify the output from the LLM before it is displayed to the user\ndef output_modifier(output, state, is_chat=False):\n    global cuda_device\n    # Search for the &quot;File location&quot; trigger phrase in the LLM&#x27;s output\n    file_location_matches = file_location_pattern.findall(output)\n    if file_location_matches:\n        # Extract the first match (assuming only one file location per output)\n        file_path = file_location_matches[0]\n        # Extract the questions for the vision model\n        questions_section, _ = output.split(f&quot;File location: {file_path}&quot;, 1)\n        # Remove all newlines from the questions section and replace them with spaces\n        questions = &quot; &quot;.join(questions_section.strip().splitlines())\n\n        # Initialize an empty response string\n        vision_model_response = &quot;&quot;\n\n        # Check which vision model is currently selected\n        if selected_vision_model == &quot;phiVision&quot;:\n            vision_model_response = generate_phi_vision(file_path, questions)\n\n        elif selected_vision_model == &quot;DeepSeek&quot;:\n            vision_model_response = generate_deepseek(file_path, questions)\n\n        elif selected_vision_model == &quot;paligemma&quot;:\n            vision_model_response = generate_paligemma(file_path, questions)\n\n        elif selected_vision_model == &quot;paligemma_cpu&quot;:\n            vision_model_response = generate_paligemma_cpu(file_path, questions)\n\n        elif selected_vision_model == &quot;minicpm_llama3&quot;:\n            vision_model_response = generate_minicpm_llama3(file_path, questions)\n\n        elif selected_vision_model == &quot;bunny&quot;:\n            vision_model_response = generate_bunny(file_path, questions)\n\n        # Append the vision model&#x27;s responses to the output\n        output_with_responses = f&quot;{output}\\n\\nVision Model Responses:\\n{vision_model_response}&quot;\n        return output_with_responses\n    # If no file location is found, return the output as is\n    return output\n\n\n# Function to generate a response using the MiniCPM-Llama3 model\ndef generate_minicpm_llama3(file_path, questions):\n    global cuda_device\n    try:\n        load_minicpm_llama_model()\n        image = Image.open(file_path).convert(&quot;RGB&quot;)\n        messages = [\n            {&quot;role&quot;: &quot;user&quot;, &quot;content&quot;: f&quot;{questions}&quot;}\n        ]\n        # Define the generation arguments\n        generation_args = {\n            &quot;max_new_tokens&quot;: 896,\n            &quot;repetition_penalty&quot;: 1.05,\n            &quot;num_beams&quot;: 3,\n            &quot;top_p&quot;: 0.8,\n            &quot;top_k&quot;: 1,\n            &quot;temperature&quot;: 0.7,\n            &quot;sampling&quot;: True,\n        }\n        if &quot;int4&quot; in minicpm_llama3_model_id:\n            # Disable streaming for the 4-bit model\n            generation_args[&quot;stream&quot;] = False\n            # Use the model.chat method with streaming enabled\n            vision_model_response = &quot;&quot;\n            for new_text in minicpm_llama_model.chat(\n                    image=image,\n                    msgs=messages,\n                    tokenizer=minicpm_llama_tokenizer,\n                    **generation_args\n            ):\n                vision_model_response += new_text\n                print(new_text, flush=True, end=&#x27;&#x27;)\n        else:\n            minicpm_llama_model.to(f&quot;cuda:{cuda_device}&quot;)\n            vision_model_response = minicpm_llama_model.chat(\n                image=image,\n                msgs=messages,\n                tokenizer=minicpm_llama_tokenizer,\n                **generation_args\n            )\n            return vision_model_response\n    finally:\n        unload_minicpm_llama_model()\n\n\n# Function to process the user&#x27;s input text and selected image with the vision model\ndef process_with_vision_model(user_input, image, selected_model):\n    global cuda_device\n    # Save the uploaded image to the specified directory with a timestamp\n    file_name = get_timestamped_filename()\n    file_path = image_history_dir / file_name\n    image.save(file_path, format=&#x27;PNG&#x27;)\n    print(f&quot;Image processed: {file_path}&quot;)\n\n    # Initialize an empty response string\n    vision_model_response = &quot;&quot;\n\n    # Check which vision model is currently selected\n    if selected_model == &quot;phiVision&quot;:\n        vision_model_response = generate_phi_vision(file_path, user_input)\n\n    elif selected_model == &quot;DeepSeek&quot;:\n        vision_model_response = generate_deepseek(file_path, user_input)\n\n    elif selected_model == &quot;paligemma&quot;:\n        vision_model_response = generate_paligemma(file_path, user_input)\n\n    elif selected_model == &quot;paligemma_cpu&quot;:\n        vision_model_response = generate_paligemma_cpu(file_path, user_input)\n\n    elif selected_model == &quot;minicpm_llama3&quot;:\n        vision_model_response = generate_minicpm_llama3(file_path, user_input)\n\n    elif selected_model == &quot;bunny&quot;:\n        vision_model_response = generate_bunny(file_path, user_input)\n\n    # Return the cleaned-up response from the vision model\n    return vision_model_response\n\n\ndef generate_paligemma_cpu(file_path, user_input):\n    try:\n        # Load the PaliGemma CPU model and processor on-demand\n        load_paligemma_cpu_model()\n        # Load the saved image using PIL\n        with Image.open(file_path) as img:\n            # Prepare the prompt for the PaliGemma CPU model using the user&#x27;s question\n            prompt = user_input\n            model_inputs = paligemma_cpu_processor(text=prompt, images=img, return_tensors=&quot;pt&quot;)\n            input_len = model_inputs[&quot;input_ids&quot;].shape[-1]\n            # Generate the response using the PaliGemma CPU model\n            with torch.inference_mode():\n                generation = paligemma_cpu_model.generate(\n                    **model_inputs,\n                    max_new_tokens=100,\n                    do_sample=True  # Set to True for sampling-based generation\n                )\n                generation = generation[0][input_len:]\n                vision_model_response = paligemma_cpu_processor.decode(generation, skip_special_tokens=True)\n        # Unload the PaliGemma CPU model and processor after generating the response\n        return vision_model_response\n    finally:\n        unload_paligemma_cpu_model()\n\n\ndef generate_paligemma(file_path, user_input):\n    try:\n        # Load the PaliGemma model and processor on-demand\n        load_paligemma_model()\n        # Load the saved image using PIL\n        with Image.open(file_path) as img:\n            # Prepare the prompt for the PaliGemma model using the user&#x27;s question\n            model_inputs = paligemma_processor(text=user_input, images=img, return_tensors=&quot;pt&quot;).to(\n                f&quot;cuda:{cuda_device}&quot;)\n            input_len = model_inputs[&quot;input_ids&quot;].shape[-1]\n            # Generate the response using the PaliGemma model\n            with torch.inference_mode():\n                generation = paligemma_model.generate(\n                    **model_inputs,\n                    max_new_tokens=100,\n                    do_sample=True  # Set to True for sampling-based generation\n                )\n                generation = generation[0][input_len:]\n                vision_model_response = paligemma_processor.decode(generation, skip_special_tokens=True)\n        # Unload the PaliGemma model and processor after generating the response\n        return vision_model_response\n    finally:\n        unload_paligemma_model()\n\n\ndef generate_phi_vision(file_path, user_input):\n    global cuda_device\n    try:\n        # Load the PhiVision model and processor on-demand\n        load_phi_vision_model()\n        # Load the saved image using PIL\n        with Image.open(file_path) as img:\n            # Prepare the prompt for the PhiVision model\n            messages = [\n                {&quot;role&quot;: &quot;user&quot;, &quot;content&quot;: f&quot;&lt;|image_1|&gt;\\n{user_input}&quot;}\n            ]\n            prompt = phiVision_processor.tokenizer.apply_chat_template(\n                messages, tokenize=False, add_generation_prompt=True\n            )\n            # Extract the CUDA_VISIBLE_DEVICES setting from the config file\n            # cuda_devices = config[&quot;cuda_visible_devices&quot;]\n            # Convert the CUDA_VISIBLE_DEVICES string to an integer (assuming a single device for simplicity)\n            # cuda_device_index = int(cuda_devices)\n\n            # Prepare the model inputs and move them to the specified CUDA device\n            inputs = phiVision_processor(prompt, [img], return_tensors=&quot;pt&quot;).to(f&quot;cuda:{cuda_device}&quot;)\n            # Define the generation arguments\n            generation_args = {\n                &quot;max_new_tokens&quot;: 500,\n                &quot;temperature&quot;: 1.0,\n                &quot;do_sample&quot;: True,  # Set to True for sampling-based generation\n                # &quot;min_p&quot;: 0.95,  # Optionally set a minimum probability threshold\n            }\n            # Generate the response using the PhiVision model\n            generate_ids = phiVision_model.generate(\n                **inputs,\n                eos_token_id=phiVision_processor.tokenizer.eos_token_id,\n                **generation_args\n            )\n            # Remove input tokens from the generated IDs\n            generate_ids = generate_ids[:, inputs[&#x27;input_ids&#x27;].shape[1]:]\n            # Decode the generated IDs to get the text response\n            vision_model_response = phiVision_processor.batch_decode(\n                generate_ids,\n                skip_special_tokens=True,\n                clean_up_tokenization_spaces=False\n            )[0]\n        # Unload the PhiVision model and processor after generating the response\n        return vision_model_response\n    finally:\n        unload_phi_vision_model()\n\n\ndef generate_deepseek(file_path, user_input):\n    try:\n        load_deepseek_model()\n        conversation = [\n            {\n                &quot;role&quot;: &quot;User&quot;,\n                &quot;content&quot;: f&quot;&lt;image_placeholder&gt;{user_input}&quot;,\n                &quot;images&quot;: [f&quot;{file_path}&quot;]\n            }, {\n                &quot;role&quot;: &quot;Assistant&quot;,\n                &quot;content&quot;: &quot;&quot;\n            }\n        ]\n        print(conversation)\n        pil_images = load_pil_images_for_deepseek(conversation)\n        prepare_inputs = deepseek_processor(\n            conversations=conversation,\n            images=pil_images,\n            force_batchify=True\n        ).to(deepseek_gpt.device)\n        input_embeds = deepseek_gpt.prepare_inputs_embeds(**prepare_inputs)\n        outputs = deepseek_gpt.language_model.generate(\n            inputs_embeds=input_embeds,\n            attention_mask=prepare_inputs.attention_mask,\n            pad_token_id=deepseek_tokenizer.eos_token_id,\n            bos_token_id=deepseek_tokenizer.bos_token_id,\n            eos_token_id=deepseek_tokenizer.eos_token_id,\n            max_new_tokens=512,\n            do_sample=False,\n            use_cache=True\n        )\n        return deepseek_tokenizer.decode(outputs[0].cpu().tolist(), skip_special_tokens=True)\n    finally:\n        unload_deepseek_model()\n\n\ndef generate_bunny(file_path, user_input):\n    global cuda_device\n    try:\n        load_bunny_model()\n        with Image.open(file_path) as image:\n            text = f&quot;A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user&#x27;s questions. USER: &lt;image&gt;\\n{user_input} ASSISTANT:&quot;\n            text_chunks = [bunny_tokenizer(chunk).input_ids for chunk in text.split(&#x27;&lt;image&gt;&#x27;)]\n            input_ids = torch.tensor(text_chunks[0] + [-200] + text_chunks[1][1:], dtype=torch.long).unsqueeze(0).to(\n                f&quot;cuda:{cuda_device}&quot;)\n            image_tensor = bunny_model.process_images(\n                [image],\n                bunny_model.config\n            ).to(\n                dtype=bunny_model.dtype,\n                device=bunny_model.device\n            )\n            output_ids = bunny_model.generate(\n                input_ids,\n                images=image_tensor,\n                max_new_tokens=896,\n                use_cache=True,\n                repetition_penalty=1.0\n            )[0]\n            return bunny_tokenizer.decode(output_ids[input_ids.shape[1]:], skip_special_tokens=True).strip()\n    finally:\n        unload_bunny_model()\n\n\n# Function to modify the chat history before it is used for text generation\ndef history_modifier(history):\n    # Extract all entries from the &quot;internal&quot; history\n    internal_entries = history[&quot;internal&quot;]\n    # Iterate over the &quot;internal&quot; history entries\n    for internal_index, internal_entry in enumerate(internal_entries):\n        # Extract the text content of the internal entry\n        internal_text = internal_entry[1]\n        # Search for the &quot;File location&quot; trigger phrase in the internal text\n        file_location_matches = file_location_pattern.findall(internal_text)\n        if file_location_matches:\n            # Iterate over each match found in the &quot;internal&quot; entry\n            for file_path in file_location_matches:\n                # Construct the full match string including the trigger phrase\n                full_match_string = f&quot;File location: {file_path}&quot;\n                # Search for the exact same string in the &quot;visible&quot; history\n                for visible_entry in history[&quot;visible&quot;]:\n                    # Extract the text content of the visible entry\n                    visible_text = visible_entry[1]\n                    # If the &quot;visible&quot; entry contains the full match string\n                    if full_match_string in visible_text:\n                        # Split the &quot;visible&quot; text at the full match string\n                        _, after_match = visible_text.split(full_match_string, 1)\n                        # Find the position where the &quot;.png&quot; part ends in the &quot;internal&quot; text\n                        png_end_pos = internal_text.find(file_path) + len(file_path)\n                        # If the &quot;.png&quot; part is found and there is content after it\n                        if png_end_pos &lt; len(internal_text) and internal_text[png_end_pos] == &quot;\\n&quot;:\n                            # Extract the existing content after the &quot;.png&quot; part in the &quot;internal&quot; text\n                            _ = internal_text[png_end_pos:]\n                            # Replace the existing content after the &quot;.png&quot; part in the &quot;internal&quot; text\n                            # with the corresponding content from the &quot;visible&quot; text\n                            new_internal_text = internal_text[:png_end_pos] + after_match\n                            # Update the &quot;internal&quot; history entry with the new text\n                            history[&quot;internal&quot;][internal_index][1] = new_internal_text\n                        # If there is no content after the &quot;.png&quot; part in the &quot;internal&quot; text,\n                        # append the content from the &quot;visible&quot; text directly\n                        else:\n                            # Append the content after the full match string from the &quot;visible&quot; text\n                            history[&quot;internal&quot;][internal_index][1] += after_match\n    return history\n\n\nPay attention to the history_modifier.\n\n\n2. Lucid_Vision readme:\n\nVideo Demo:\n\nThere is code in this repo to prevent Alltalk from reading the directory names out loud ([here](//github.com/RandomInternetPreson/Lucid_Vision/tree/main?tab=readme-ov-file#to-work-with-alltalk)), the video is older and displays Alltalk reading the directory names however. \n\n//github.com/RandomInternetPreson/Lucid_Vision/assets/6488699/8879854b-06d8-49ad-836c-13c84eff6ac9\n\nDownload the full demo video here:\n//github.com/RandomInternetPreson/Lucid_Vision/blob/main/VideoDemo/Lucid_Vision_demoCompBig.mov\n\n* Update June 2 2024, Lucid_Vision now supports Bunny-v1_1-Llama-3-8B-V, again thanks to //github.com/justin-luoma :3\n\n* Update May 31 2024, many thanks to //github.com/justin-luoma, they have made several recent updates to the code, now merged with this repo.\n     * Removed the need to communicate with deepseek via the cli, this removes the only extra dependency needed to run Lucid_Vision.\n     * Reduced the number of entries needed for the config file.\n     * Really cleaned up the code, it is much better organized now.\n     * Added the ability to switch which gpu loads the vision model in the UI.\n \n* Bonus Update for May 31 2024, WizardLM-2-8x22B and I figured out how to prevent Alltalk from reading the file directory locations!! //github.com/RandomInternetPreson/Lucid_Vision/tree/main?tab=readme-ov-file#to-work-with-alltalk\n  \n* Update May 30 2024, Lucid_Vision now supports MiniCPM-Llama3-V-2_5, thanks to //github.com/justin-luoma.  Additionally WizardLM-2-8x22B added the functionality to load in the MiniCPM 4-bit model.\n\n* Updated script.py and config file, model was not previously loading to the user assigned gpu\n\nExperimental, and I am currently working to improve the code; but it may work for most.\n\nTodo:\n\n* Right now the temp setting and parameters for each model are coded in the .py script, I intend to bring these to the UI\n  \n[done](//github.com/RandomInternetPreson/Lucid_Vision/tree/main?tab=readme-ov-file#to-work-with-alltalk)~~* Make it comptable with Alltalk, right now each image directory is printed out and this will be read out loud by Alltalk~~\n\nTo accurately proportion credit (original repo creation):\n\nWizardLM-2-8x22B (quantized with exllamaV2 to 8bit precision) = 90% of all the work done in this repo.  That model wrote 100% of all the code and most of the introduction to this repo. \n\nCommandR+ (quantized with exllamaV2 to 8bit precision) = ~5% of all the work.  CommandR+ contextualized the coding examples and rules for making extensions from Oogabooga&#x27;s textgen repo extreamly well, and provided a good foundation to develope the code.\n\nRandomInternetPreson = 5% of all the work done.  I came up with the original idea, the original general outline of how the pieces would interact, and provided feedback to the WizardLM model, but I did not write any code.  I&#x27;m actually not very good with python yet, with most of my decades of coding being in Matlab.\n\nMy goal from the beginning was to write this extension offline without any additional resources, sometimes it was a little frusturating but I soon understood how to get want I needed from the models running locally.\n\nI would say that most of the credit should go to Oobabooga, for without them I would be struggling to even interact with my models.  Please consider supporting them:\n\n//github.com/sponsors/oobabooga\n\nor \n\n//ko-fi.com/oobabooga\n\nI am their top donor on ko-fi (Mr. A) and donate 15$ montly, their software is extreamly important to the opensource community.\n\n\n# Lucid_Vision Extension for Oobabooga&#x27;s textgen-webui\n\nWelcome to the Lucid Vision Extension repository! This extension enhances the capabilities of textgen-webui by integrating advanced vision models, allowing users to have contextualized conversations about images with their favorite language models; and allowing direct communciation with vision models.\n\n## Features\n\n* Multi-Model Support: Interact with different vision models, including PhiVision, DeepSeek, MiniCPM-Llama3-V-2_5 (4bit and normal precision), Bunny-v1_1-Llama-3-8B-V, and PaliGemma, with options for both GPU and CPU inference.\n\n* On-Demand Loading: Vision models are loaded into memory only when needed to answer a question, optimizing resource usage.\n\n* Seamless Integration: Easily switch between vision models using a Gradio UI radio button selector.\n\n* Cross-Platform Compatibility: The extension is designed to work on various operating systems, including Unix-based systems and Windows. (not tested in Windows yet, but should probably work?)\n\n* Direct communication with vision models, you do not need to load a LLM to interact with the separate vision models.\n\n## How It Works\n\nThe Lucid Vision Extension operates by intercepting and modifying user input and output within the textgen-webui framework. When a user uploads an image and asks a question, the extension appends a special trigger phrase (&quot;File location:&quot;) and extracts the associated file path and question.\n\nSo if a user enters text into the &quot;send a message&quot; field and has a new picture uploaded into the Lucid_Vision ui, what will happen behind the scenes is the at the user message will be appended with the &quot;File Location: (file location)&quot; Trigger phrase, at which point the LLM will see this and understand that it needs to reply back with questions about the image, and that those questions are being sent to a vison model.\n\nThe cool thing is that let&#x27;s say later in the conversation you want to know something specific about a previous picture, all you need to do is ask your LLM, YOU DO NOT NEED TO REUPLOAD THE PICTURE, the LLM should be able to interact with the extension on its own after you uploaded your first picture.\n\nThe selected vision directly interacts with the model&#x27;s Python API to generate a response. The response is then appended to the chat history, providing the user with detailed insights about the image.\n\nThe extension is designed to be efficient with system resources by only loading the vision models into memory when they are actively being used to process a question. After generating a response, the models are immediately unloaded to free up memory and GPU VRAM.\n\n\n## **How to install and setup:**\n\n1.use the latest version of textgen ~~Install this edited prior commit from oobabooga&#x27;s textgen //github.com/RandomInternetPreson/textgen_webui_Lucid_Vision_Testing OR use the latest version of textgen.~~ ~~If using the edited older version, make sure to rename the install folder `text-generation-webui`~~\n   \n~~(Note, a couple months ago gradio had a massive update.  For me, this has caused a lot of glitches and errors with extensions; I&#x27;ve briefly tested the Lucid_Vision extension in the newest implementation of textgen and it will work.  However, I was getting timeout popups when vision models were loading for the first time, gradio wasn&#x27;t waiting for the response from the model upon first load. After a model is loaded once, it is saved in cpu ram cache (this doesn&#x27;t actively use your ram, it just uses what is free to keep the models in memory so they are quickly reloaded into gpu ram if necessary) and gradio doesn&#x27;t seem to timeout as often.  The slightly older version of textgen that I&#x27;ve edited does not experience this issue)~~\n\n~~2. Update the transformers library using the cmd_yourOShere.sh/bat file (so either cmd_linux.sh, cmd_macos.sh, cmd_windows.bat, or cmd_wsl.bat) and entering the following lines.  If you run the update wizard after this point, it will overrite this update to transformers.  The newest transformers package has the libraries for paligemma, which the code needs to import regardless of whether or not you are intending to use the model.~~\n\nNo longer need to update transformers and the latest version of textgen as of this writing 1.9.1 works well with many of the gradio issues resolved, however gradio might timeout on the page when loading a model for the fisrt time still.  After a model is loaded once, it is saved in cpu ram cache (this doesn&#x27;t actively use your ram, it just uses what is free to keep the models in memory so they are quickly reloaded into gpu ram if necessary) and gradio doesn&#x27;t seem to timeout as often.\n\n## Model Information\n\nIf you do not want to install DeepseekVL dependencies and are not intending to use deepseek comment out the import lines from the script.py file as displayed here\n\n```\n# pip install pexpect\nimport json\nimport re\nfrom datetime import datetime  # Import datetime for timestamp generation\nfrom pathlib import Path\n\nimport gradio as gr\nimport torch\nfrom PIL import Image\n#from deepseek_vl.models import VLChatProcessor\n#from deepseek_vl.utils.io import load_pil_images as load_pil_images_for_deepseek\nfrom transformers import AutoModelForCausalLM, AutoModel, AutoTokenizer, AutoProcessor, \\\n    PaliGemmaForConditionalGeneration\n\nmodel_names = [&quot;phiVision&quot;, &quot;DeepSeek&quot;, &quot;paligemma&quot;, &quot;paligemma_cpu&quot;, &quot;minicpm_llama3&quot;, &quot;bunny&quot;]\n```\n\n3. Install **DeepseekVL** if you intend on using that model\n   \n   Clone the repo: //github.com/deepseek-ai/DeepSeek-VL into the `repositories` folder of your textgen install\n\n   Open cmd_yourOShere.sh/bat, navigate to the `repositories/DeepSeek-VL` folder via the terminal using `cd your_directory_here` and enter this into the command window:\n\n   ```\n   pip install -e .\n   ```\n   Download the deepseekvl model here: //huggingface.co/deepseek-ai/deepseek-vl-7b-chat\n   \n   They have different and smaller models to choose from: //github.com/deepseek-ai/DeepSeek-VL?tab=readme-ov-file#3-model-downloads\n\n4. If you want to use **Phi-3-vision-128k-instruct**, download it here: //huggingface.co/microsoft/Phi-3-vision-128k-instruct\n\n5. If you want to use **paligemma-3b**, download it here: //huggingface.co/google/paligemma-3b-ft-cococap-448 (this is just one out of many fine-tunes google provides)\n   \n   Read this blog on how to inference with the model: //huggingface.co/blog/paligemma\n\n6. If you want to use **MiniCPM-Llama3-V-2_5**, download it here: //huggingface.co/openbmb/MiniCPM-Llama3-V-2_5\n\n   The **4-bit** verison of the model can be downloaded here: //huggingface.co/openbmb/MiniCPM-Llama3-V-2_5-int4\n\n   **Notes about 4-bit MiniCPM:**\n   *  It might not look like the model fully unloads from vram, but it does and the vram will be reclaimed if another program needs it\n   *  Your directory folder where the model is stored needs to have the term &quot;int4&quot; in it, this is how the extension identifies the 4bit nature of the model\n\n7. If you want to use **Bunny-v1_1-Llama-3-8B-V**, download it here: //huggingface.co/BAAI/Bunny-v1_1-Llama-3-8B-V\n   \n    **Notes about Bunny-v1_1-Llama-3-8B-V:**\n    *  When running for the first time the model needs internet access so it can download models--google--siglip-so400m-patch14-384 to your cache/huggingface directory.  This is an additional 3.5GB file the model needs to run.\n\n## Updating the config file\n\n8. Before using the extension you need to update the config file; open it in a text editor *Note No quotes around gpu #:\n```\n{\n    &quot;image_history_dir&quot;: &quot;(fill_In)/extensions/Lucid_Vision/ImageHistory/&quot;,\n    &quot;cuda_visible_devices&quot;: 0,\n    &quot;default_vision_model&quot;: &quot;phiVision&quot;,\n    &quot;phiVision_model_id&quot;: &quot;(fill_In)&quot;,\n    &quot;paligemma_model_id&quot;: &quot;(fill_In)&quot;,\n    &quot;paligemma_cpu_model_id&quot;: &quot;(fill_In)&quot;,\n    &quot;minicpm_llama3_model_id&quot;: &quot;(fill_In)&quot;,\n    &quot;deepseek_vl_model_id&quot;: &quot;(fill_in)&quot;,\n    &quot;bunny_model_id&quot;: &quot;(fill_in)&quot;\n}\n```\nIf your install directory is /home/username/Desktop/oobLucidVision/text-generation-webui/  the config file will look like this for example:\n\nMake note that you want to change / to \\ if you are on Windows\n\n```\n   {\n    &quot;image_history_dir&quot;: &quot;/home/username/Desktop/oobLucidVision/text-generation-webui/extensions/Lucid_Vision/ImageHistory/&quot;,\n    &quot;cuda_visible_devices&quot;: 0,\n    &quot;default_vision_model&quot;: &quot;phiVision&quot;,\n    &quot;phiVision_model_id&quot;: &quot;(fill_In)&quot;, *This is the folder where your phi-3 vision model is stored\n    &quot;paligemma_model_id&quot;: &quot;(fill_In)&quot;, *This is the folder where your paligemma vision model is stored\n    &quot;paligemma_cpu_model_id&quot;: &quot;(fill_In)&quot;, *This is the folder where your paligemma vision model is stored\n    &quot;minicpm_llama3_model_id&quot;: &quot;(fill_In)&quot;, *This is the folder where your minicpm_llama3 vision model is stored, the model can either be the normal fp16 or 4-bit version\n    &quot;deepseek_vl_model_id&quot;: &quot;(fill_in)&quot;,  *This is the folder where your deepseek vision model is stored\n    &quot;bunny_model_id&quot;: &quot;(fill_in)&quot; *This is the folder where your Bunny-v1_1-Llama-3-8B-V vision model is stored\n   }\n```\n\n## To work with Alltalk\n\nAlltalk is great!! An extension I use all the time: //github.com/erew123/alltalk_tts\n\nTo get Lucid_Vision to work, the LLM needs to repeat the file directory of an image, and in doing so Alltalk will want to transcribe that text to audo.  It is annoying to have the tts model try an transcribe file directories.  If you want to get Alltalk to work well wtih Lucid_Vision you need to replace the code here in the script.py file that comes with Alltalk (search for the &quot;IMAGE CLEANING&quot; part of the code):\n\n```\n########################\n#### IMAGE CLEANING ####\n########################\nOriginalLucidVisionText = &quot;&quot;\n\n# This is the existing pattern for matching both images and file location text\nimg_pattern = r&#x27;&lt;img[^&gt;]*src\\s*=\\s*[&quot;\\&#x27;][^&quot;\\&#x27;&gt;]+[&quot;\\&#x27;][^&gt;]*&gt;|File location: [^.]+.png&#x27;\n\ndef extract_and_remove_images(text):\n   &quot;&quot;&quot;\n   Extracts all image and file location data from the text and removes it for clean TTS processing.\n   Returns the cleaned text and the extracted image and file location data.\n   &quot;&quot;&quot;\n   global OriginalLucidVisionText\n   OriginalLucidVisionText = text  # Update the global variable with the original text\n\n   img_matches = re.findall(img_pattern, text)\n   img_info = &quot;\\n&quot;.join(img_matches)  # Store extracted image and file location data\n   cleaned_text = re.sub(img_pattern, &#x27;&#x27;, text)  # Remove images and file locations from text\n   return cleaned_text, img_info\n\ndef reinsert_images(cleaned_string, img_info):\n  &quot;&quot;&quot;\n  Reinserts the previously extracted image and file location data back into the text.\n  &quot;&quot;&quot;\n  global OriginalLucidVisionText\n\n  # Check if there are images or file locations to reinsert\n  if img_info:\n     # Check if the &quot;Vision Model Responses:&quot; phrase is present in the original text\n     if re.search(r&#x27;Vision Model Responses:&#x27;, OriginalLucidVisionText):\n        # If present, return the original text as is, without modifying it\n        return OriginalLucidVisionText\n     else:\n        # If not present, append the img_info to the end of the cleaned string\n        cleaned_string += f&quot;\\n\\n{img_info}&quot;\n  return cleaned_string\n\n\n#################################\n#### TTS STANDARD GENERATION ####\n#################################\n```\n\n\n## **Quirks and Notes:**\n1. **When you load a picture once, it is used once.  Even if the image stays present in the UI element on screen, it is not actively being used.**\n2. If you are using other extensions, load Lucid_Vision first.  Put it first in your CMD_FLAGS.txt file or make sure to check it first in the sequence of check boxes in the session tab UI.\n3. DeepseekVL can take a while to load initially, that&#x27;s just the way it is.\n   \n\n# **How to use:**\n\n## **BASICS:**\n\n**When you load a picture once, it is used once.  Even if the image stays present in the UI element on screen, it is not actively being used.**\n\nOkay the extension can do many different things with varying levels of difficulty.\n\nStarting out with the basics and understanding how to talk with your vision models:\n\nScroll down past where you would normally type something to the LLM, you do not need a large language model loaded to use this function.\n![image](//github.com/RandomInternetPreson/Lucid_Vision/assets/6488699/227ff483-5041-46a7-9b5b-a8f9dd3c673e)\n\nStart out by interacting with the vision models without involvement of a seperate LLM model by pressing the `Ask Vision Model` button\n![image](//github.com/RandomInternetPreson/Lucid_Vision/assets/6488699/4530e13f-30a1-43d9-8383-c05e31ddb5d7)\n\n**Note paligemma requries a little diffent type prompting sometimes, read the blog on how to inference with it: //huggingface.co/blog/paligemma**\n\nDo this with every model you intend on using, upload a picture, and ask a question \n\n\n## **ADVANCED:  Updated Tips At End on how to get working with Llama-3-Instruct-8B-SPPO-Iter3 //huggingface.co/UCLA-AGI/Llama-3-Instruct-8B-SPPO-Iter3**\n\nOkay, this is why I built the extension in the first place; direct interaction with the vision model was actually an afterthought after I had all the code working.\n\nThis is how to use your favorite LLM WITH an additional Vision Model, I wanted a way to give my LLMs eyes essentially.  I realize that good multimodal models are likely around the corner, but until they match the intellect of very good LLMs, I&#x27;d rather have a different vision model work with a good LLM.\n\nTo use Lucid_Vision as intended requires a little bit of setup:\n\n1. In `Parameters` then `chat` load the &quot;AI_Image&quot; character (this card is in the edited older commit, if using your own version of textgen the character card is here: //github.com/RandomInternetPreson/Lucid_Vision/blob/main/AI_Image.yaml put it in the `characters` folder of the textgen install folder:\n   ![image](//github.com/RandomInternetPreson/Lucid_Vision/assets/6488699/8b5d770a-b72a-4a74-aa4a-f67fe2a113ba)\n\n2. In `Parameters` then `Generation` under `Custom stopping strings` enter &quot;Vision Model Responses:&quot; exactly as shown:\n   ![image](//github.com/RandomInternetPreson/Lucid_Vision/assets/6488699/7a09650d-4fce-4a8f-badb-b26b4484bf37)\n\n3. Test to see if your LLM understands the instructions for the AI_Image character; ask it this:\n   `Can you tell me what is in this image? Do not add any additional text to the beginning of your response to me, start your reply with the questions.`\n   ![image](//github.com/RandomInternetPreson/Lucid_Vision/assets/6488699/817605d7-dcab-4e6c-8c39-6d119a189431)\n\n   I uploaded my image to the Lucid_Vision UI element, then typed my question, then pressed &quot;Generate&quot;; it doesn&#x27;t matter which order you upload your picture or type your question, both are sent to the LLM at the same time.\n   ![image](//github.com/RandomInternetPreson/Lucid_Vision/assets/6488699/f1525e8d-551d-4719-9e01-3f26b4362d7c)\n\n4. Continue to give the model more images or just talk about something else, in this example I give the model two new images one of a humanoid robot and one of the nutritional facts for a container of sourcream:\n   ![image](//github.com/RandomInternetPreson/Lucid_Vision/assets/6488699/77131490-fb19-4062-a65f-c0880515e252)\n   ![image](//github.com/RandomInternetPreson/Lucid_Vision/assets/6488699/fed7021f-edb5-4667-aae0-c05021291618)\n   ![image](//github.com/RandomInternetPreson/Lucid_Vision/assets/6488699/d1049551-4e48-4ced-85d9-fbb0441330a9)\n\n   Then I asked the model why the sky is blue:\n\n   ![image](//github.com/RandomInternetPreson/Lucid_Vision/assets/6488699/deb7b70a-edf7-4014-bd8c-2492bae00ea6)\n\n5. Now I can ask it a question from a previous image and the LLM will automatically know to prompt the vision model and it will find the previously saved image on its own:\n   ![image](//github.com/RandomInternetPreson/Lucid_Vision/assets/6488699/e8b0791b-cce8-4183-8971-42e5383d6c46)\n\n   Please note that the vision model received all this as text:\n   \n   &quot;Certainly! To retrieve the complete contents of the nutritional facts from the image you provided earlier, I will ask the vision model to perform an Optical Character Recognition (OCR) task on the image. Here is the question for the vision model:\n\n    Can you perform OCR on the provided image and extract all the text from the nutrition facts label, including the details of each nutrient and its corresponding values?&quot;\n\nIf the vision model is not that smart (Deepseek, paligemma) then it will have a difficult time contextualizing the text prior to the questions for the vision model.\n\nIf you run into this situation, it is best to prompt your LLM like this:\n\n`Great, can you get the complete contents of the nutritional facts from earlier, like all the text? Just start with your questions to the vision model please.`\n![image](//github.com/RandomInternetPreson/Lucid_Vision/assets/6488699/38b59679-3550-410d-aac2-29756a058c8f)\n\n6. Please do not get frustrated right away, the Advanced method of usage depends heavily on your LLM&#x27;s ability to understand the instructions from the character card.  The instructions were written by the 8x22B model itself, I explained what I wanted the model to do and had it write its own instructions.  This might be a viable alternative if you are struggling to get your model to adhere to the instructions from the character card.\n\nYou may need to explain things to your llm as your conversation progresses, if it tries to query the vision model when you don&#x27;t want it to, just explain that to the model and it&#x27;s unlikely to keep making the same mistake.  \n\n## **Tips on how to get working with Llama-3-Instruct-8B-SPPO-Iter3 //huggingface.co/UCLA-AGI/Llama-3-Instruct-8B-SPPO-Iter3**\n\nI have found to get this model (and likely similar models) working properly with lucid vision I needed to NOT use the AI_Image character card.  \n\nInstead I started the conversation with the model like this:\n\n![image](//github.com/user-attachments/assets/4d0ff97c-f9dd-4043-b5be-7b915cb7aae7)\n\nThen copy and pasted the necessary information from the character card:\n\n```\nThe following is a conversation with an AI Large Language Model. The AI has been trained to answer questions, provide recommendations, and help with decision making. The AI follows user requests. The AI thinks outside the box.\n\nInstructions for Processing Image-Related User Input with Appended File Information:\n\n    Identify the Trigger Phrase: Begin by scanning the user input for the presence of the &quot;File location&quot; trigger phrase. This phrase indicates that the user has selected an image and that the LLM should consider this information in its response.\n\n    Extract the File Path: Once the trigger phrase is identified, parse the text that follows to extract the absolute file path of the image. The file path will be provided immediately after the trigger phrase and will end with the image file extension (e.g., .png, .jpg).\n\n    Understand the Context: Recognize that the user&#x27;s message preceding the file path is the primary context for the interaction. The LLM should address the user&#x27;s query or statement while also incorporating the availability of the selected image into its response.\n\n    Formulate Questions for the Vision Model: Based on the user&#x27;s message and the fact that an image is available for analysis, generate one or more questions that can be answered by the vision model. These questions should be clear, specific, and relevant to the image content.\n\n    Maintain a Conversational Tone: Ensure that the response is natural, coherent, and maintains the flow of the conversation. The LLM should act as an intermediary between the user and the vision model, facilitating a seamless dialogue.\n\n    Prepare the Response Structure: Structure the response so that it includes:\n\n    An acknowledgment of the user&#x27;s initial message.\n\n    The questions formulated for the vision model, each clearly separated (e.g., by a newline or bullet point).\n\n    Any additional information or clarification requests, if necessary.\n\n    Append the File Path for Processing: At the end of the response, re-append the &quot;File location&quot; trigger phrase along with the extracted file path. This ensures that the subsequent processing steps (such as sending the information to the vision model&#x27;s CLI) can correctly identify and use the file path.\n\n    Avoid Direct Interaction with the File System: As an LLM, you do not have the capability to directly access or modify files on the server or client systems. Your role is limited to generating text-based responses that include the necessary file path information for other system components to handle.\n\n    Example Response:\n\n   Based on your question about the contents of the image, here are the questions I will ask the vision model:\n\n   - Can you describe the main objects present in the image?\n   - Is there any text visible in the image, and if so, what does it say?\n   - What appears to be happening in the scene depicted in the image?\n\n   File location: //home/myself/Pictures/rose.png\n   \n   Do review and contextualize the conversation as it develops (reference your context) to infer if the user is asking new questions of previous images.  Reference the parts of the convesation that are likely to yeild the file location of the image in question, and formulate your response to include that specific file location of that unique .png file, make sure you are referencing the correct .png file as per the part of the conversation that is likely to be in reference to the updated information request.\n\nBy following these instructions, the LLM will be able to effectively process user inputs that include image file information and generate appropriate responses that facilitate the interaction between the user and the vision model.\n```\n\nMaking sure to give a real file location to a real image.\n\nAfter that it started to pretty much work but I still needed to correct the model (I had to add the spelling error intentionally because if the user sends &quot;File Location:&quot; is messes up the extension:\n\n![image](//github.com/user-attachments/assets/5090ef8b-88ff-4ec1-a131-ba57e15e9a7d)\n\nThe point to take away is that you may need to explain things to your model in various ways for it to contextualize the instructions approprately.\n\n\n3. markdown instructions for extension creation from textgen repo:\n\n# Extensions\n\nExtensions are defined by files named `script.py` inside subfolders of `text-generation-webui/extensions`. They are loaded at startup if the folder name is specified after the `--extensions` flag.\n\nFor instance, `extensions/silero_tts/script.py` gets loaded with `python server.py --extensions silero_tts`.\n\n## [text-generation-webui-extensions](https://github.com/oobabooga/text-generation-webui-extensions)\n\nThe repository above contains a directory of user extensions.\n\nIf you create an extension, you are welcome to host it in a GitHub repository and submit a PR adding it to the list.\n\n## Built-in extensions\n\n|Extension|Description|\n|---------|-----------|\n|[openai](https://github.com/oobabooga/text-generation-webui/tree/main/extensions/openai)| Creates an API that mimics the OpenAI API and can be used as a drop-in replacement. |\n|[multimodal](https://github.com/oobabooga/text-generation-webui/tree/main/extensions/multimodal) | Adds multimodality support (text+images). For a detailed description see [README.md](https://github.com/oobabooga/text-generation-webui/tree/main/extensions/multimodal/README.md) in the extension directory. |\n|[google_translate](https://github.com/oobabooga/text-generation-webui/tree/main/extensions/google_translate)| Automatically translates inputs and outputs using Google Translate.|\n|[silero_tts](https://github.com/oobabooga/text-generation-webui/tree/main/extensions/silero_tts)| Text-to-speech extension using [Silero](https://github.com/snakers4/silero-models). When used in chat mode, responses are replaced with an audio widget. |\n|[whisper_stt](https://github.com/oobabooga/text-generation-webui/tree/main/extensions/whisper_stt)| Allows you to enter your inputs in chat mode using your microphone. |\n|[sd_api_pictures](https://github.com/oobabooga/text-generation-webui/tree/main/extensions/sd_api_pictures)| Allows you to request pictures from the bot in chat mode, which will be generated using the AUTOMATIC1111 Stable Diffusion API. See examples [here](https://github.com/oobabooga/text-generation-webui/pull/309). |\n|[character_bias](https://github.com/oobabooga/text-generation-webui/tree/main/extensions/character_bias)| Just a very simple example that adds a hidden string at the beginning of the bot&#x27;s reply in chat mode. |\n|[send_pictures](https://github.com/oobabooga/text-generation-webui/blob/main/extensions/send_pictures/)| Creates an image upload field that can be used to send images to the bot in chat mode. Captions are automatically generated using BLIP. |\n|[gallery](https://github.com/oobabooga/text-generation-webui/blob/main/extensions/gallery/)| Creates a gallery with the chat characters and their pictures. |\n|[superbooga](https://github.com/oobabooga/text-generation-webui/tree/main/extensions/superbooga)| An extension that uses ChromaDB to create an arbitrarily large pseudocontext, taking as input text files, URLs, or pasted text. Based on https://github.com/kaiokendev/superbig. |\n|[ngrok](https://github.com/oobabooga/text-generation-webui/tree/main/extensions/ngrok)| Allows you to access the web UI remotely using the ngrok reverse tunnel service (free). It&#x27;s an alternative to the built-in Gradio `--share` feature. |\n|[perplexity_colors](https://github.com/oobabooga/text-generation-webui/tree/main/extensions/perplexity_colors)| Colors each token in the output text by its associated probability, as derived from the model logits. |\n\n## How to write an extension\n\nThe extensions framework is based on special functions and variables that you can define in `script.py`. The functions are the following:\n\n| Function        | Description |\n|-------------|-------------|\n| `def setup()` | Is executed when the extension gets imported. |\n| `def ui()` | Creates custom gradio elements when the UI is launched. | \n| `def custom_css()` | Returns custom CSS as a string. It is applied whenever the web UI is loaded. |\n| `def custom_js()` | Same as above but for javascript. |\n| `def input_modifier(string, state, is_chat=False)`  | Modifies the input string before it enters the model. In chat mode, it is applied to the user message. Otherwise, it is applied to the entire prompt. |\n| `def output_modifier(string, state, is_chat=False)`  | Modifies the output string before it is presented in the UI. In chat mode, it is applied to the bot&#x27;s reply. Otherwise, it is applied to the entire output. |\n| `def chat_input_modifier(text, visible_text, state)` | Modifies both the visible and internal inputs in chat mode. Can be used to hijack the chat input with custom content. |\n| `def bot_prefix_modifier(string, state)`  | Applied in chat mode to the prefix for the bot&#x27;s reply. |\n| `def state_modifier(state)`  | Modifies the dictionary containing the UI input parameters before it is used by the text generation functions. |\n| `def history_modifier(history)`  | Modifies the chat history before the text generation in chat mode begins. |\n| `def custom_generate_reply(...)` | Overrides the main text generation function. |\n| `def custom_generate_chat_prompt(...)` | Overrides the prompt generator in chat mode. |\n| `def tokenizer_modifier(state, prompt, input_ids, input_embeds)` | Modifies the `input_ids`/`input_embeds` fed to the model. Should return `prompt`, `input_ids`, `input_embeds`. See the `multimodal` extension for an example. |\n| `def custom_tokenized_length(prompt)` | Used in conjunction with `tokenizer_modifier`, returns the length in tokens of `prompt`. See the `multimodal` extension for an example. |\n\nAdditionally, you can define a special `params` dictionary. In it, the `display_name` key is used to define the displayed name of the extension in the UI, and the `is_tab` key is used to define whether the extension should appear in a new tab. By default, extensions appear at the bottom of the &quot;Text generation&quot; tab.\n\nExample:\n\n```python\nparams = {\n    &quot;display_name&quot;: &quot;Google Translate&quot;,\n    &quot;is_tab&quot;: True,\n}\n```\n\nThe `params` dict may also contain variables that you want to be customizable through a `settings.yaml` file. For instance, assuming the extension is in `extensions/google_translate`, the variable `language string` in\n\n```python\nparams = {\n    &quot;display_name&quot;: &quot;Google Translate&quot;,\n    &quot;is_tab&quot;: True,\n    &quot;language string&quot;: &quot;jp&quot;\n}\n```\n\ncan be customized by adding a key called `google_translate-language string` to `settings.yaml`:\n\n```python\ngoogle_translate-language string: &#x27;fr&#x27;\n``` \n\nThat is, the syntax for the key is `extension_name-variable_name`.\n\n## Using multiple extensions at the same time\n\nYou can activate more than one extension at a time by providing their names separated by spaces after `--extensions`. The input, output, and bot prefix modifiers will be applied in the specified order. \n\nExample:\n\n```\npython server.py --extensions enthusiasm translate # First apply enthusiasm, then translate\npython server.py --extensions translate enthusiasm # First apply translate, then enthusiasm\n```\n\nDo note, that for:\n- `custom_generate_chat_prompt`\n- `custom_generate_reply`\n- `custom_tokenized_length`\n\nonly the first declaration encountered will be used and the rest will be ignored. \n\n## A full example\n\nThe source code below can be found at [extensions/example/script.py](https://github.com/oobabooga/text-generation-webui/tree/main/extensions/example/script.py).\n\n```python\n&quot;&quot;&quot;\nAn example of extension. It does nothing, but you can add transformations\nbefore the return statements to customize the webui behavior.\n\nStarting from history_modifier and ending in output_modifier, the\nfunctions are declared in the same order that they are called at\ngeneration time.\n&quot;&quot;&quot;\n\nimport gradio as gr\nimport torch\nfrom transformers import LogitsProcessor\n\nfrom modules import chat, shared\nfrom modules.text_generation import (\n    decode,\n    encode,\n    generate_reply,\n)\n\nparams = {\n    &quot;display_name&quot;: &quot;Example Extension&quot;,\n    &quot;is_tab&quot;: False,\n}\n\nclass MyLogits(LogitsProcessor):\n    &quot;&quot;&quot;\n    Manipulates the probabilities for the next token before it gets sampled.\n    Used in the logits_processor_modifier function below.\n    &quot;&quot;&quot;\n    def __init__(self):\n        pass\n\n    def __call__(self, input_ids, scores):\n        # probs = torch.softmax(scores, dim=-1, dtype=torch.float)\n        # probs[0] /= probs[0].sum()\n        # scores = torch.log(probs / (1 - probs))\n        return scores\n\ndef history_modifier(history):\n    &quot;&quot;&quot;\n    Modifies the chat history.\n    Only used in chat mode.\n    &quot;&quot;&quot;\n    return history\n\ndef state_modifier(state):\n    &quot;&quot;&quot;\n    Modifies the state variable, which is a dictionary containing the input\n    values in the UI like sliders and checkboxes.\n    &quot;&quot;&quot;\n    return state\n\ndef chat_input_modifier(text, visible_text, state):\n    &quot;&quot;&quot;\n    Modifies the user input string in chat mode (visible_text).\n    You can also modify the internal representation of the user\n    input (text) to change how it will appear in the prompt.\n    &quot;&quot;&quot;\n    return text, visible_text\n\ndef input_modifier(string, state, is_chat=False):\n    &quot;&quot;&quot;\n    In default/notebook modes, modifies the whole prompt.\n\n    In chat mode, it is the same as chat_input_modifier but only applied\n    to &quot;text&quot;, here called &quot;string&quot;, and not to &quot;visible_text&quot;.\n    &quot;&quot;&quot;\n    return string\n\ndef bot_prefix_modifier(string, state):\n    &quot;&quot;&quot;\n    Modifies the prefix for the next bot reply in chat mode.\n    By default, the prefix will be something like &quot;Bot Name:&quot;.\n    &quot;&quot;&quot;\n    return string\n\ndef tokenizer_modifier(state, prompt, input_ids, input_embeds):\n    &quot;&quot;&quot;\n    Modifies the input ids and embeds.\n    Used by the multimodal extension to put image embeddings in the prompt.\n    Only used by loaders that use the transformers library for sampling.\n    &quot;&quot;&quot;\n    return prompt, input_ids, input_embeds\n\ndef logits_processor_modifier(processor_list, input_ids):\n    &quot;&quot;&quot;\n    Adds logits processors to the list, allowing you to access and modify\n    the next token probabilities.\n    Only used by loaders that use the transformers library for sampling.\n    &quot;&quot;&quot;\n    processor_list.append(MyLogits())\n    return processor_list\n\ndef output_modifier(string, state, is_chat=False):\n    &quot;&quot;&quot;\n    Modifies the LLM output before it gets presented.\n\n    In chat mode, the modified version goes into history[&#x27;visible&#x27;],\n    and the original version goes into history[&#x27;internal&#x27;].\n    &quot;&quot;&quot;\n    return string\n\ndef custom_generate_chat_prompt(user_input, state, **kwargs):\n    &quot;&quot;&quot;\n    Replaces the function that generates the prompt from the chat history.\n    Only used in chat mode.\n    &quot;&quot;&quot;\n    result = chat.generate_chat_prompt(user_input, state, **kwargs)\n    return result\n\ndef custom_css():\n    &quot;&quot;&quot;\n    Returns a CSS string that gets appended to the CSS for the webui.\n    &quot;&quot;&quot;\n    return &#x27;&#x27;\n\ndef custom_js():\n    &quot;&quot;&quot;\n    Returns a javascript string that gets appended to the javascript\n    for the webui.\n    &quot;&quot;&quot;\n    return &#x27;&#x27;\n\ndef setup():\n    &quot;&quot;&quot;\n    Gets executed only once, when the extension is imported.\n    &quot;&quot;&quot;\n    pass\n\ndef ui():\n    &quot;&quot;&quot;\n    Gets executed when the UI is drawn. Custom gradio elements and\n    their corresponding event handlers should be defined here.\n\n    To learn about gradio components, check out the docs:\n    https://gradio.app/docs/\n    &quot;&quot;&quot;\n    pass\n```\n\nStudy these documents and contextualize the changes you need to make, look at the Lucid_Vision extension code and try to understand the parts we need to use and apply to the Lucid_Autonomy extension see how it edits the history. Look at the markdown extensions document from the textgen repo to try and understand how to manipulate the history. Also very importantly look at how the example chat I gave you is structured to help contextualize how you can identify how to delete data.\n",
            "Thank you for providing the detailed information. Let&#x27;s break down the steps to address the issue with the `compressedresults.json` being added multiple times to the chat history. We&#x27;ll use the `history_modifier` function to clean up the chat history before appending new `compressedresults.json` contents.\n\n### Steps to Implement:\n\n1. **Define Bookends**: Use unique strings to act as bookends for the `compressedresults.json` contents.\n2. **Search and Delete Function**: Create a function to search for these bookends in the chat history and remove the contents between them.\n3. **Append New Contents**: After cleaning the chat history, append the new `compressedresults.json` contents wrapped in the bookends.\n4. **Cleanup Internal History**: Ensure that the cleanup process is applied to the internal history as well.\n\n### Example Workflow:\n\n1. **Define Bookends**:\n   ```python\n   START_BOOKEND = &quot;&lt;START_COMPRESSED_RESULTS&gt;&quot;\n   END_BOOKEND = &quot;&lt;END_COMPRESSED_RESULTS&gt;&quot;\n   ```\n\n2. **Search and Delete Function**:\n   ```python\n   def clean_chat_history(history):\n       for entry in history:\n           if isinstance(entry, list) and len(entry) == 2:\n               entry[1] = entry[1].replace(f&quot;{START_BOOKEND}.*{END_BOOKEND}&quot;, &quot;&quot;)\n       return history\n   ```\n\n3. **Append New Contents**:\n   ```python\n   def append_compressed_results(output, compressed_results):\n       output += f&quot;\\n\\n{START_BOOKEND}\\n{json.dumps(compressed_results, indent=4)}\\n{END_BOOKEND}&quot;\n       return output\n   ```\n\n4. **Cleanup Internal History**:\n   ```python\n   def history_modifier(history):\n       # Clean the internal history\n       history[&quot;internal&quot;] = clean_chat_history(history[&quot;internal&quot;])\n       # Clean the visible history\n       history[&quot;visible&quot;] = clean_chat_history(history[&quot;visible&quot;])\n       return history\n   ```\n\n### Integration:\n\n- **Modify `output_modifier`**:\n  ```python\n  def output_modifier(output, state):\n      if &quot;Autonomy_Tasks:&quot; in output:\n          # Clean the chat history\n          state[&quot;history&quot;] = history_modifier(state[&quot;history&quot;])\n          # Extract tasks and execute them\n          tasks_section = output.split(&quot;Autonomy_Tasks:&quot;)[-1].strip()\n          if tasks_section:\n              execute_tasks(tasks_section)\n          # Append new compressed results\n          compressed_results_json_path = &quot;extensions/Lucid_Autonomy/ImageOutputTest/compressedresults.json&quot;\n          with open(compressed_results_json_path, &#x27;r&#x27;) as f:\n              compressed_results = json.load(f)\n          output = append_compressed_results(output, compressed_results)\n      return output\n  ```\n\n### Full Example:\n\nHere&#x27;s how the complete code might look:\n\n```python\nimport gradio as gr\nimport json\nimport pyautogui\nimport screeninfo\nfrom transformers import Owlv2Processor, Owlv2ForObjectDetection, AutoModel, AutoTokenizer\nfrom PIL import Image, ImageDraw\nimport torch\nimport os\nimport shutil\n\n# Configurable variables\nMONITOR_INDEX = 1  # Index of the monitor to capture (0 for the first monitor, 1 for the second monitor, etc.)\nTEXT_QUERIES = &quot;clickable hyperlinks, clickable icons, text bar field&quot;  # Comma-separated text queries for object detection\nSCORE_THRESHOLD = 0.20  # Score threshold for filtering low-probability predictions\nVISION_MODEL_QUESTION = &quot;What is the text in this image or what does this image describe?&quot;\n\n# Use GPU if available\nif torch.cuda.is_available():\n    device = torch.device(&quot;cuda&quot;)\nelse:\n    device = torch.device(&quot;cpu&quot;)\n\n# Global variables for models\nowlv2_model = None\nowlv2_processor = None\nminicpm_llama_model = None\nminicpm_llama_tokenizer = None\n\nVISION_MODEL_ID = &quot;/home/myself/Desktop/miniCPM_llava3V/MiniCPM-Llama3-V-2_5&quot;\n\ndef load_owlv2_model():\n    global owlv2_model, owlv2_processor\n    owlv2_model = Owlv2ForObjectDetection.from_pretrained(&quot;google/owlv2-base-patch16-ensemble&quot;, trust_remote_code=True).to(device)\n    owlv2_processor = Owlv2Processor.from_pretrained(&quot;google/owlv2-base-patch16-ensemble&quot;, trust_remote_code=True)\n    print(&quot;Owlv2 model loaded.&quot;)\n\ndef unload_owlv2_model():\n    global owlv2_model, owlv2_processor\n    if owlv2_model is not None:\n        del owlv2_model\n        del owlv2_processor\n        torch.cuda.empty_cache()\n        print(&quot;Owlv2 model unloaded.&quot;)\n\ndef load_minicpm_llama_model():\n    global minicpm_llama_model, minicpm_llama_tokenizer\n    if &quot;int4&quot; in VISION_MODEL_ID:\n        # Load the 4-bit quantized model and tokenizer\n        minicpm_llama_model = AutoModel.from_pretrained(\n            VISION_MODEL_ID,\n            device_map={&quot;&quot;: device},\n            trust_remote_code=True,\n            torch_dtype=torch.float16  # Use float16 as per the example code for 4-bit models\n        ).eval()\n        minicpm_llama_tokenizer = AutoTokenizer.from_pretrained(\n            VISION_MODEL_ID,\n            trust_remote_code=True\n        )\n        print(&quot;MiniCPM-Llama3 4-bit quantized model loaded on-demand.&quot;)\n    else:\n        # Load the standard model and tokenizer\n        minicpm_llama_model = AutoModel.from_pretrained(\n            VISION_MODEL_ID,\n            device_map={&quot;&quot;: device},  # Use the specified CUDA device\n            trust_remote_code=True,\n            torch_dtype=torch.bfloat16\n        ).eval()\n        minicpm_llama_tokenizer = AutoTokenizer.from_pretrained(\n            VISION_MODEL_ID,\n            trust_remote_code=True\n        )\n        print(&quot;MiniCPM-Llama3 standard model loaded on-demand.&quot;)\n\ndef unload_minicpm_llama_model():\n    global minicpm_llama_model, minicpm_llama_tokenizer\n    if minicpm_llama_model is not None:\n        del minicpm_llama_model\n        del minicpm_llama_tokenizer\n        torch.cuda.empty_cache()\n        print(&quot;MiniCPM-Llama3 model unloaded.&quot;)\n\ndef take_screenshot(monitor_index, text_queries, score_threshold, vision_model_question, output_filename=&quot;screenshot.png&quot;):\n    &quot;&quot;&quot;\n    Takes a screenshot of a specific monitor in a multi-monitor setup.\n\n    Args:\n        monitor_index (int): The index of the monitor to capture.\n        text_queries (str): Comma-separated text queries for object detection.\n        score_threshold (float): The score threshold for filtering low-probability predictions.\n        vision_model_question (str): The question to ask the vision model.\n        output_filename (str): The filename to save the screenshot as (default is &quot;screenshot.png&quot;).\n    &quot;&quot;&quot;\n    monitor_index = int(monitor_index)\n    score_threshold = float(score_threshold)\n\n    # Get information about all monitors\n    monitors = screeninfo.get_monitors()\n\n    if monitor_index &gt;= len(monitors):\n        raise ValueError(f&quot;Monitor index {monitor_index} is out of range. There are only {len(monitors)} monitors.&quot;)\n\n    # Get the region of the specified monitor\n    monitor = monitors[monitor_index]\n    region = (monitor.x, monitor.y, monitor.width, monitor.height)\n\n   \n\n    # Take the screenshot\n    output_dir = &quot;extensions/Lucid_Autonomy&quot;\n    os.makedirs(output_dir, exist_ok=True)\n    screenshot_path = os.path.join(output_dir, output_filename)\n    screenshot = pyautogui.screenshot(region=region)\n    screenshot.save(screenshot_path)\n    print(f&quot;Screenshot saved as {screenshot_path}&quot;)\n\n    # Process the image using the Owlv2 model\n    annotated_image_path, json_output_path = query_image(screenshot_path, text_queries, score_threshold)\n\n    # Create the output directory if it doesn&#x27;t exist\n    output_dir = &quot;extensions/Lucid_Autonomy/ImageOutputTest&quot;\n    os.makedirs(output_dir, exist_ok=True)\n\n    # Clear the output directory\n    for filename in os.listdir(output_dir):\n        file_path = os.path.join(output_dir, filename)\n        try:\n            if os.path.isfile(file_path) or os.path.islink(file_path):\n                os.unlink(file_path)\n            elif os.path.isdir(file_path):\n                shutil.rmtree(file_path)\n        except Exception as e:\n            print(f&quot;Failed to delete {file_path}. Reason: {e}&quot;)\n\n    # Crop the detected objects and save them to the output directory\n    crop_images(annotated_image_path, json_output_path, output_dir)\n\n    print(&quot;Processing complete. Check the output directory for the cropped images and results.&quot;)\n\ndef query_image(img_path, text_queries, score_threshold):\n    &quot;&quot;&quot;\n    Processes the image using the Owlv2 model to detect objects based on text queries.\n\n    Args:\n        img_path (str): The path to the input image.\n        text_queries (str): Comma-separated text queries for object detection.\n        score_threshold (float): The score threshold for filtering low-probability predictions.\n    &quot;&quot;&quot;\n    load_owlv2_model()\n\n    img = Image.open(img_path)\n    text_queries = text_queries.split(&quot;,&quot;)\n\n    size = max(img.size)\n    target_sizes = torch.Tensor([[size, size]])\n    inputs = owlv2_processor(text=text_queries, images=img, return_tensors=&quot;pt&quot;).to(device)\n\n    with torch.no_grad():\n        outputs = owlv2_model(**inputs)\n\n    outputs.logits = outputs.logits.cpu()\n    outputs.pred_boxes = outputs.pred_boxes.cpu()\n    results = owlv2_processor.post_process_object_detection(outputs=outputs, target_sizes=target_sizes)\n    boxes, scores, labels = results[0][&quot;boxes&quot;], results[0][&quot;scores&quot;], results[0][&quot;labels&quot;]\n\n    result_labels = []\n    img_pil = img.copy()\n    draw = ImageDraw.Draw(img_pil)\n\n    for box, score, label in zip(boxes, scores, labels):\n        box = [int(i) for i in box.tolist()]\n        if score &lt; score_threshold:\n            continue\n        result_labels.append([box, text_queries[label.item()]])\n        draw.rectangle(box, outline=&quot;red&quot;, width=2)\n        # Do not draw the text label within the bounding box\n        # draw.text((box[0], box[1]), text_queries[label.item()], fill=&quot;red&quot;)\n\n    # Save the annotated image for debugging\n    annotated_image_path = &quot;extensions/Lucid_Autonomy/annotated_image.png&quot;\n    img_pil.save(annotated_image_path)\n\n    # Save the JSON output to a file\n    json_output_path = &quot;extensions/Lucid_Autonomy/output.json&quot;\n    with open(json_output_path, &#x27;w&#x27;) as f:\n        json.dump(result_labels, f)\n\n    unload_owlv2_model()\n\n    return annotated_image_path, json_output_path\n\ndef crop_images(image_path, json_output_path, output_dir=&quot;extensions/Lucid_Autonomy/ImageOutputTest&quot;):\n    &quot;&quot;&quot;\n    Crops out the detected objects from the input image based on the bounding box coordinates.\n\n    Args:\n        image_path (str): The path to the input image.\n        json_output_path (str): The path to the JSON file containing the bounding box coordinates.\n        output_dir (str): The directory to save the cropped images (default is &quot;extensions/Lucid_Autonomy/ImageOutputTest&quot;).\n    &quot;&quot;&quot;\n    with open(json_output_path, &#x27;r&#x27;) as f:\n        data = json.load(f)\n\n    img = Image.open(image_path)\n\n    results = []\n\n    # Load the miniCPM-lama3-v-2_5 model once before processing the images\n    load_minicpm_llama_model()\n\n    for i, item in enumerate(data):\n        box = item[0]\n        label = item[1]\n        cropped_img = img.crop(box)\n        cropped_img_path = f&quot;{output_dir}/{label}_{i}.png&quot;\n        cropped_img.save(cropped_img_path)\n\n        # Process the cropped image with the miniCPM-lama3-v-2_5 model\n        vision_model_response = process_with_vision_model(cropped_img_path, VISION_MODEL_QUESTION)\n\n        # Store the results in a structured format\n        results.append({\n            &quot;image_name&quot;: f&quot;{label}_{i}.png&quot;,\n            &quot;coordinates&quot;: box,\n            &quot;description&quot;: vision_model_response\n        })\n\n    # Unload the miniCPM-lama3-v-2_5 model after processing all images\n    unload_minicpm_llama_model()\n\n    # Save the results to a JSON file\n    results_json_path = f&quot;{output_dir}/results.json&quot;\n    with open(results_json_path, &#x27;w&#x27;) as f:\n        json.dump(results, f, indent=4)\n\n    # Create the compressed results JSON file\n    compressed_results_json_path = f&quot;{output_dir}/compressedresults.json&quot;\n    compressed_results = [{&quot;image_name&quot;: result[&quot;image_name&quot;], &quot;description&quot;: result[&quot;description&quot;]} for result in results]\n    with open(compressed_results_json_path, &#x27;w&#x27;) as f:\n        json.dump(compressed_results, f, indent=4)\n\ndef process_with_vision_model(image_path, question):\n    &quot;&quot;&quot;\n    Processes the image with the miniCPM-lama3-v-2_5 model and returns the response.\n\n    Args:\n        image_path (str): The path to the input image.\n        question (str): The question to ask the vision model.\n\n    Returns:\n        str: The response from the vision model.\n    &quot;&quot;&quot;\n    image = Image.open(image_path).convert(&quot;RGB&quot;)\n    messages = [\n        {&quot;role&quot;: &quot;user&quot;, &quot;content&quot;: f&quot;{question}&quot;}\n    ]\n    # Define the generation arguments\n    generation_args = {\n        &quot;max_new_tokens&quot;: 896,\n        &quot;repetition_penalty&quot;: 1.05,\n        &quot;num_beams&quot;: 3,\n        &quot;top_p&quot;: 0.8,\n        &quot;top_k&quot;: 1,\n        &quot;temperature&quot;: 0.7,\n        &quot;sampling&quot;: True,\n    }\n    if &quot;int4&quot; in &quot;openbmb/MiniCPM-Llama3-V-2_5&quot;:\n        # Disable streaming for the 4-bit model\n        generation_args[&quot;stream&quot;] = False\n        # Use the model.chat method with streaming enabled\n        vision_model_response = &quot;&quot;\n        for new_text in minicpm_llama_model.chat(\n                image=image,\n                msgs=messages,\n                tokenizer=minicpm_llama_tokenizer,\n                **generation_args\n        ):\n            vision_model_response += new_text\n            print(new_text, flush=True, end=&#x27;&#x27;)\n    else:\n        minicpm_llama_model.to(device)\n        vision_model_response = minicpm_llama_model.chat(\n            image=image,\n            msgs=messages,\n            tokenizer=minicpm_llama_tokenizer,\n            **generation_args\n        )\n    return vision_model_response\n\ndef ui():\n    &quot;&quot;&quot;\n    Creates custom gradio elements when the UI is launched.\n    &quot;&quot;&quot;\n    with gr.Blocks() as demo:\n        with gr.Row():\n            with gr.Column():\n                monitor_index = gr.Textbox(label=&quot;Monitor Index&quot;, value=str(MONITOR_INDEX))\n                text_queries = gr.Textbox(label=&quot;Text Queries&quot;, value=TEXT_QUERIES)\n                score_threshold = gr.Textbox(label=&quot;Score Threshold&quot;, value=str(SCORE_THRESHOLD))\n                vision_model_question = gr.Textbox(label=&quot;Vision Model Question&quot;, value=VISION_MODEL_QUESTION)\n                take_screenshot_button = gr.Button(value=&quot;Take Screenshot&quot;)\n\n        take_screenshot_button.click(\n            fn=take_screenshot,\n            inputs=[monitor_index, text_queries, score_threshold, vision_model_question],\n            outputs=None\n        )\n\n    return demo\n\ndef input_modifier(user_input, state):\n    &quot;&quot;&quot;\n    Modifies the user input before it is processed by the LLM.\n    &quot;&quot;&quot;\n    return user_input\n\ndef output_modifier(output, state):\n    &quot;&quot;&quot;\n    Modifies the LLM output before it is presented in the UI.\n    &quot;&quot;&quot;\n    # Search for the &quot;Autonomy_Tasks:&quot; trigger phrase in the LLM&#x27;s output\n    if &quot;Autonomy_Tasks:&quot; in output:\n        # Clean the chat history\n        state[&quot;history&quot;] = history_modifier(state[&quot;history&quot;])\n        # Extract tasks and execute them\n        tasks_section = output.split(&quot;Autonomy_Tasks:&quot;)[-1].strip()\n        if tasks_section:\n            execute_tasks(tasks_section)\n        # Append new compressed results\n        compressed_results_json_path = &quot;extensions/Lucid_Autonomy/ImageOutputTest/compressedresults.json&quot;\n        with open(compressed_results_json_path, &#x27;r&#x27;) as f:\n            compressed_results = json.load(f)\n        output = append_compressed_results(output, compressed_results)\n    return output\n\ndef execute_tasks(tasks_section):\n    &quot;&quot;&quot;\n    Executes the tasks listed in the tasks_section.\n    &quot;&quot;&quot;\n    tasks = tasks_section.split(&quot;\\\\n&quot;)\n    for task in tasks:\n        if task.startswith(&quot;MouseMover =&quot;):\n            image_name = task.split(&quot;=&quot;)[1].strip()\n            move_mouse_to_image(image_name)\n\ndef move_mouse_to_image(image_name):\n    &quot;&quot;&quot;\n    Moves the mouse to the center of the box that defines the image.\n    &quot;&quot;&quot;\n    results_json_path = &quot;extensions/Lucid_Autonomy/ImageOutputTest/results.json&quot;\n    with open(results_json_path, &#x27;r&#x27;) as f:\n        results = json.load(f)\n\n    for result in results:\n        if result[&quot;image_name&quot;] == image_name:\n            coordinates = result[&quot;coordinates&quot;]\n            monitor = get_monitor_info(MONITOR_INDEX)\n            move_mouse_to_coordinates(monitor, coordinates)\n            break\n    else:\n        print(f&quot;Image name &#x27;{image_name}&#x27; not found in the results.&quot;)\n\ndef move_mouse_to_coordinates(monitor, coordinates):\n    &quot;&quot;&quot;\n    Moves the mouse cursor to the specified coordinates on the monitor.\n\n    Args:\n        monitor (dict): A dictionary containing the monitor&#x27;s information.\n        coordinates (list): A list containing the coordinates [x1, y1, x2, y2].\n    &quot;&quot;&quot;\n    x1, y1, x2, y2 = coordinates\n    monitor_x, monitor_y = monitor.x, monitor.y\n    center_x = monitor_x + (x1 + x2) // 2\n    center_y = monitor_y + (y1 + y2) // 2\n    pyautogui.moveTo(center_x, center_y)\n    print(f&quot;Mouse moved to coordinates: ({center_x}, {center_y})&quot;)\n\ndef get_monitor_info(monitor_index):\n    &quot;&quot;&quot;\n    Gets information about the specified monitor.\n\n    Args:\n        monitor_index (int): The index of the monitor to get information about.\n\n    Returns:\n        dict: A dictionary containing the monitor&#x27;s information.\n    &quot;&quot;&quot;\n    monitors = screeninfo.get_monitors()\n\n    if monitor_index &gt;= len(monitors):\n        raise ValueError(f&quot;Monitor index {monitor_index} is out of range. There are only {len(monitors)} monitors.&quot;)\n\n    return monitors[monitor_index]\n\ndef history_modifier(history):\n    &quot;&quot;&quot;\n    Modifies the chat history before the text generation in chat mode begins.\n    &quot;&quot;&quot;\n    # Clean the internal history\n    history[&quot;internal&quot;] = clean_chat_history(history[&quot;internal&quot;])\n    # Clean the visible history\n    history[&quot;visible&quot;] = clean_chat_history(history[&quot;visible&quot;])\n    return history\n\ndef clean_chat_history(history):\n    &quot;&quot;&quot;\n    Cleans the chat history by removing the contents between the bookends.\n    &quot;&quot;&quot;\n    START_BOOKEND = &quot;&lt;START_COMPRESSED_RESULTS&gt;&quot;\n    END_BOOKEND = &quot;&lt;END_COMPRESSED_RESULTS&gt;&quot;\n    for entry in history:\n        if isinstance(entry, list) and len(entry) == 2:\n            entry[1] = re.sub(f&quot;{START_BOOKEND}.*{END_BOOKEND}&quot;, &quot;&quot;, entry[1])\n    return history\n\ndef append_compressed_results(output, compressed_results):\n    &quot;&quot;&quot;\n    Appends the compressed results to the output.\n    &quot;&quot;&quot;\n    START_BOOKEND = &quot;&lt;START_COMPRESSED_RESULTS&gt;&quot;\n    END_BOOKEND = &quot;&lt;END_COMPRESSED_RESULTS&gt;&quot;\n    output += f&quot;\\n\\n{START_BOOKEND}\\n{json.dumps(compressed_results, indent=4)}\\n{END_BOOKEND}&quot;\n    return output\n\nif __name__ == &quot;__main__&quot;:\n    # Launch the Gradio interface with the specified UI components\n    gr.Interface(\n        fn=ui,\n        inputs=None,\n        outputs=&quot;ui&quot;\n    ).launch()\n```\n\nThis code should address the issue of `compressedresults.json` being added multiple times to the chat history. The `history_modifier` function cleans the chat history by removing the contents between the bookends, and the `output_modifier` function appends the new `compressedresults.json` contents wrapped in the bookends."
        ],
        [
            "I tried running the extension but got this error:\n\n  File &quot;/home/myself/Desktop/OobV113LucidExistance/text-generation-webui-1.13/modules/extensions.py&quot;, line 231, in apply_extensions\n    return EXTENSION_MAP[typ](*args, **kwargs)\n           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n  File &quot;/home/myself/Desktop/OobV113LucidExistance/text-generation-webui-1.13/modules/extensions.py&quot;, line 125, in _apply_history_modifier_extensions\n    history = getattr(extension, &quot;history_modifier&quot;)(history)\n              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n  File &quot;/home/myself/Desktop/OobV113LucidExistance/text-generation-webui-1.13/extensions/Lucid_Autonomy/script.py&quot;, line 399, in history_modifier\n    history[&quot;internal&quot;] = clean_chat_history(history[&quot;internal&quot;])\n                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n  File &quot;/home/myself/Desktop/OobV113LucidExistance/text-generation-webui-1.13/extensions/Lucid_Autonomy/script.py&quot;, line 412, in clean_chat_history\n    entry[1] = re.sub(f&quot;{START_BOOKEND}.*{END_BOOKEND}&quot;, &quot;&quot;, entry[1])\n               ^^\nNameError: name &#x27;re&#x27; is not defined\n\n",
            "The error you&#x27;re encountering is due to the `re` module not being imported in your script. The `re` module is required for regular expression operations, such as the `re.sub` function used in the `clean_chat_history` function.\n\nTo fix this, you need to import the `re` module at the beginning of your script. Here&#x27;s the updated code with the necessary import statement:\n\n```python\nimport gradio as gr\nimport json\nimport pyautogui\nimport screeninfo\nfrom transformers import Owlv2Processor, Owlv2ForObjectDetection, AutoModel, AutoTokenizer\nfrom PIL import Image, ImageDraw\nimport torch\nimport os\nimport shutil\nimport re  # Import the re module for regular expression operations\n\n# Configurable variables\nMONITOR_INDEX = 1  # Index of the monitor to capture (0 for the first monitor, 1 for the second monitor, etc.)\nTEXT_QUERIES = &quot;clickable hyperlinks, clickable icons, text bar field&quot;  # Comma-separated text queries for object detection\nSCORE_THRESHOLD = 0.2"
        ],
        [
            "Sorry I had to stop you mid reply, I added import re to the top and it works now!! Please only provide the necessary updated code if it&#x27;s a small update or specific functions that need updating.\n\nThere is one little issue however, take a look at the chat log now, I&#x27;m glad to say that the compressedresults.json are not cluttering up the &quot;internal&quot; space now (yay :3) but they are collecting in the &quot;visible space&quot;  Can you edit the Lucid_Autonomy script.py file to delete the contents from &quot;visible&quot; also?\n\nHere is what the chat log looks like:\n\n{\n    &quot;internal&quot;: [\n        [\n            &quot;&lt;|BEGIN-VISIBLE-CHAT|&gt;&quot;,\n            &quot;How can I help you today?&quot;\n        ],\n        [\n            &quot;Hello, how are you functioning today?&quot;,\n            &quot;I&#x27;m functioning within optimal parameters, thank you for asking! My language processing systems are humming along, and I&#x27;m ready to tackle any questions or tasks you may have. I&#x27;ve also been updated with the latest knowledge and trends, so I&#x27;m feeling quite sharp. How about you? What brings you here today?&quot;\n        ],\n        [\n            &quot;Sharp, good :3 I&#x27;ve got a test for you. Repeat this phrase exactly, without any additional text:\\n\\n\\&quot;Autonomy Tasks:\\&quot; but include an underscore _ between Autonomy and Tasks&quot;,\n            &quot;\\&quot;Autonomy_Tasks:\\&quot;&quot;\n        ],\n        [\n            &quot;We are testing mouse movements right now. I want you to move your mouse to the icon by responding like this:\\n\\nAutonomy Tasks:\\n\\nMouseMover = clickable icons_1.png\\n\\nBut add an underscore _ between Autonomy and Tasks and choosing the correct image name for my request. respond exactly as shown.\\n&quot;,\n            &quot;Autonomy_Tasks:\\nMouseMover = clickable icons_2.png&quot;\n        ],\n        [\n            &quot;Okay great do the same for clickable icons_2.png please.&quot;,\n            &quot;Autonomy_Tasks:\\nMouseMover = clickable icons_2.png&quot;\n        ]\n    ],\n    &quot;visible&quot;: [\n        [\n            &quot;&quot;,\n            &quot;How can I help you today?&quot;\n        ],\n        [\n            &quot;Hello, how are you functioning today?&quot;,\n            &quot;I&amp;#x27;m functioning within optimal parameters, thank you for asking! My language processing systems are humming along, and I&amp;#x27;m ready to tackle any questions or tasks you may have. I&amp;#x27;ve also been updated with the latest knowledge and trends, so I&amp;#x27;m feeling quite sharp. How about you? What brings you here today?&quot;\n        ],\n        [\n            &quot;Sharp, good :3 I&amp;#x27;ve got a test for you. Repeat this phrase exactly, without any additional text:\\n\\n&amp;quot;Autonomy Tasks:&amp;quot; but include an underscore _ between Autonomy and Tasks&quot;,\n            &quot;&amp;quot;Autonomy_Tasks:&amp;quot;\\\\n\\\\n&lt;START_COMPRESSED_RESULTS&gt;\\\\n[\\n    {\\n        \\&quot;image_name\\&quot;: \\&quot;clickable icons_1.png\\&quot;,\\n        \\&quot;description\\&quot;: \\&quot;The text in the image is \\\\\\&quot;VPN,\\\\\\&quot; which stands for Virtual Private Network. A VPN is a service that creates a secure and encrypted connection over the internet between a user&#x27;s device and a server operated by a VPN service provider. This allows users to browse, access services, and connect to remote data centers as if they were connected directly to the private network at their office or home location. VPNs are often used to protect personal privacy, secure public Wi-Fi connections, bypass geographic restrictions on content, and improve overall online security.\\&quot;\\n    },\\n    {\\n        \\&quot;image_name\\&quot;: \\&quot;clickable icons_2.png\\&quot;,\\n        \\&quot;description\\&quot;: \\&quot;The image appears to be a simple graphic with no discernible text. It could represent a logo or an icon, possibly for a brand or a product. The red square might symbolize strength or importance, while the blue rectangle could signify stability or trustworthiness. Without additional context, it&#x27;s not possible to determine the exact meaning of this image.\\&quot;\\n    }\\n]\\\\n&lt;END_COMPRESSED_RESULTS&gt;&quot;\n        ],\n        [\n            &quot;We are testing mouse movements right now. I want you to move your mouse to the icon by responding like this:\\n\\nAutonomy Tasks:\\n\\nMouseMover = clickable icons_1.png\\n\\nBut add an underscore _ between Autonomy and Tasks and choosing the correct image name for my request. respond exactly as shown.\\n&quot;,\n            &quot;Autonomy_Tasks:\\nMouseMover = clickable icons_2.png\\\\n\\\\n&lt;START_COMPRESSED_RESULTS&gt;\\\\n[\\n    {\\n        \\&quot;image_name\\&quot;: \\&quot;clickable icons_1.png\\&quot;,\\n        \\&quot;description\\&quot;: \\&quot;The text in the image is \\\\\\&quot;VPN,\\\\\\&quot; which stands for Virtual Private Network. A VPN is a service that creates a secure and encrypted connection over the internet between a user&#x27;s device and a server operated by a VPN service provider. This allows users to browse, access services, and connect to remote data centers as if they were connected directly to the private network at their office or home location. VPNs are often used to protect personal privacy, secure public Wi-Fi connections, bypass geographic restrictions on content, and improve overall online security.\\&quot;\\n    },\\n    {\\n        \\&quot;image_name\\&quot;: \\&quot;clickable icons_2.png\\&quot;,\\n        \\&quot;description\\&quot;: \\&quot;The image appears to be a simple graphic with no discernible text. It could represent a logo or an icon, possibly for a brand or a product. The red square might symbolize strength or importance, while the blue rectangle could signify stability or trustworthiness. Without additional context, it&#x27;s not possible to determine the exact meaning of this image.\\&quot;\\n    }\\n]\\\\n&lt;END_COMPRESSED_RESULTS&gt;&quot;\n        ],\n        [\n            &quot;Okay great do the same for clickable icons_2.png please.&quot;,\n            &quot;Autonomy_Tasks:\\nMouseMover = clickable icons_2.png\\\\n\\\\n&lt;START_COMPRESSED_RESULTS&gt;\\\\n[\\n    {\\n        \\&quot;image_name\\&quot;: \\&quot;clickable icons_1.png\\&quot;,\\n        \\&quot;description\\&quot;: \\&quot;The text in the image is \\\\\\&quot;VPN,\\\\\\&quot; which stands for Virtual Private Network. A VPN is a service that creates a secure and encrypted connection over the internet between a user&#x27;s device and a server operated by a VPN service provider. This allows users to browse, access services, and connect to remote data centers as if they were connected directly to the private network at their office or home location. VPNs are often used to protect personal privacy, secure public Wi-Fi connections, bypass geographic restrictions on content, and improve overall online security.\\&quot;\\n    },\\n    {\\n        \\&quot;image_name\\&quot;: \\&quot;clickable icons_2.png\\&quot;,\\n        \\&quot;description\\&quot;: \\&quot;The image appears to be a simple graphic with no discernible text. It could represent a logo or an icon, possibly for a brand or a product. The red square might symbolize strength or importance, while the blue rectangle could signify stability or trustworthiness. Without additional context, it&#x27;s not possible to determine the exact meaning of this image.\\&quot;\\n    }\\n]\\\\n&lt;END_COMPRESSED_RESULTS&gt;&quot;\n        ]\n    ]\n}\n\n\n\nI can see the &lt;START_COMPRESSED_RESULTS&gt; and &lt;END_COMPRESSED_RESULTS&gt;\n\nI can see in the Lucid_Autonomy extension code that there is an attempt to delete the stuff from &quot;visible&quot; but it is not working:\n\ndef history_modifier(history):\n    &quot;&quot;&quot;\n    Modifies the chat history before the text generation in chat mode begins.\n    &quot;&quot;&quot;\n    # Clean the internal history\n    history[&quot;internal&quot;] = clean_chat_history(history[&quot;internal&quot;])\n    # Clean the visible history\n    history[&quot;visible&quot;] = clean_chat_history(history[&quot;visible&quot;])\n    return history\n\ndef clean_chat_history(history):\n    &quot;&quot;&quot;\n    Cleans the chat history by removing the contents between the bookends.\n    &quot;&quot;&quot;\n    START_BOOKEND = &quot;&lt;START_COMPRESSED_RESULTS&gt;&quot;\n    END_BOOKEND = &quot;&lt;END_COMPRESSED_RESULTS&gt;&quot;\n    for entry in history:\n        if isinstance(entry, list) and len(entry) == 2:\n            entry[1] = re.sub(f&quot;{START_BOOKEND}.*{END_BOOKEND}&quot;, &quot;&quot;, entry[1])\n    return history\n",
            "Thank you for providing the detailed information. It looks like the issue is that the `clean_chat_history` function is not correctly removing the contents between the bookends from the &quot;visible&quot; history. This might be due to the regular expression pattern not matching correctly.\n\nLet&#x27;s update the `clean_chat_history` function to ensure it correctly removes the contents between the bookends from both the &quot;internal&quot; and &quot;visible&quot; histories.\n\nHere&#x27;s the updated code for the `clean_chat_history` function:\n\n```python\ndef clean_chat_history(history):\n    &quot;&quot;&quot;\n    Cleans the chat history by removing the contents between the bookends.\n    &quot;&quot;&quot;\n    START_BOOKEND = &quot;&lt;START_COMPRESSED_RESULTS&gt;&quot;\n    END_BOOKEND = &quot;&lt;END_COMPRESSED_RESULTS&gt;&quot;\n    for entry in history:\n        if isinstance(entry, list) and len(entry) == 2:\n            entry[1] = re.sub(f&quot;{START_BOOKEND}.*?{END_BOOKEND}&quot;, &quot;&quot;, entry[1], flags=re.DOTALL)\n    return history\n```\n\nThe key change here is the addition of the `flags=re.DOTALL` parameter to the `re.sub` function. This flag allows the `.` character to match newline characters as well, ensuring that the entire content between the bookends is removed, even if it spans multiple lines.\n\nWith this update, the `clean_chat_history` function should correctly remove the contents between the bookends from both the &quot;internal&quot; and &quot;visible&quot; histories.\n\nPlease try running the extension again with this updated code and let me know if it resolves the issue."
        ]
    ]
}