Receipts/ImageIdExtractionBasicsV2.json

{
    "internal": [
        [
            "<|BEGIN-VISIBLE-CHAT|>",
            "How can I help you today?"
        ],
        [
            " Hello, how are you functioning today?",
            "I'm functioning well, thank you for asking! I'm here and ready to assist you. How about you? How can I help you today?"
        ],
        [
            "Okay, so we've got quite an adventure ahead of us. Let me give you some context. There is a GitHub project called TextGenerationWebUI. We'll call it TextGen from now on. What TextGen allows us to do, it allows me to run large language models at home on my own hardware. So you're actually running on my computer right now locally. No internet connection. Anyways, TextGen allows for really interesting extension integration. And I've made two extensions myself, and I'll show you the code for all that later. But the gist of it is one extension lets the AI surf the internet in a somewhat supervised manner. And another extension gives the AI or the large language model the ability to communicate with a vision model. So the large language model formulates questions on its own, asks those questions of the vision model and gets a response. And then contextualizes that response into the conversation. The large language model can even recall past images on their own when they decide it's necessary within the context of the conversation. That latter, okay, well, so the two extensions are called LucidWebSearch and LucidVision. And LucidVision was actually entirely written by Mixtral8x22, which was like a 8x22 model of experts model. It's like your predecessor. You're supposed to be better than it. So I've upgraded to Mistral Large V2, which is you. And I want to take on a similar endeavor. I want you to write an extension for TextGen for me. But we're going to take this step by step. First of all, let me... First of all, just before I go into any more explanation, I just want to make sure you're understanding what I'm saying.",
            "Yes, I'm following along just fine. To summarize, you have a local GitHub project called TextGen that allows you to run large language models on your own hardware. You've created two extensions for it: LucidWebSearch, which enables supervised internet browsing, and LucidVision, which allows communication with a vision model. LucidVision was written by a previous model, and now you've upgraded to me, Mistral Large V2. You'd like me to help you write a new extension for TextGen. Is that correct? I'm ready to assist you with that."
        ],
        [
            " Okay, great. So let me, I'm going to explain to you the general idea of this extension. And we'll go through it step by step. So there is a model, a vision model called Owlv2. Okay, and what I can do with this vision model is I can give it an image and I can tell it to put a box around whatever I'm interested in in the image. So I can give it a screenshot of a web browser and I can say highlight all of the text, I mean, sorry, highlight all of the hyperlinks and highlight all of the clickable icons. And it can do that. And then it can give me the X, Y coordinates for all of the boxes that it makes. So what I want to do is I want to be able to have a large language model operate a mouse and keyboard using this Owl as a foundational, as the foundation of everything. So the pipeline or this whatever the flow I have in my mind is this. We take Owl the Two and then we get all the coordinates on the image. And then we get some Python code to crop out all of those little boxes. And then we send those to an optical character recognition model. But it's better than that because it also can recognize objects too. So we can send it to that model. We can send a list of images to that model. I forget what that model is called, but we'll get there when we get there. And then this kind of vision OCR model will tell us the contents of each one of the boxes. And then we can have a database that the LLM can that the large language model can effectively keep within context or maybe it can reference. You know what with this database, it would make sense if it existed in like a text document and then the LL, the large language model could reference a text document when it needed to. I have ideas on how to do that. I effectively did that with the Lucid web search, how to manage the memory within context and everything. OK, and then so the large language model will effectively have you can imagine like three columns of data. One column is the context. What is the box? And the way I will to works as I say, give me a clickable hyperlinks and give me clickable icons. And what it will do is it will put a box around everything and use those terms, clickable hyperlinks and clickable icons as the descriptor for the boxes. So that's what column one be our base test will be navigating a Web page on its own. So column one will be, you know, what is it, a hyperlink or an icon, a blank text field. Column two will be the box that borders those different user interface elements. So it'll be set of coordinates. And then column three will be what the the optical character recognition slash image vision model sees within those crop images of the boxes. And then the larger language model can look at that and say, OK, I want to know more about this world event that's happening right now. And it can look at all those things and say, OK, there's a hyperlink for the thing that I'm interested in. And then it can click on it and then we can use the Lucid Web search extension to contextualize the information on that Web page. And then the model can, you know, take screenshots. It'd be nice to have it like squirrel to and then get X, Y coordinates and click on things. Anyways, I hope this makes sense. Generally speaking, I know it's kind of maybe seems difficult at first, but I have all the pieces in this new model, 10 months old. I'm surprised I haven't found it yet. This model really kind of puts lets all the pieces kind of merge together.  Also, I'm using a voice to text model, so some of the spelling may be off in what I have dicated.",
            "Thank you for the detailed explanation. To ensure we're on the same page, let me summarize your vision for the new extension:\n\n1. **Vision Model (Owlv2)**: This model will take an image (e.g., a screenshot of a web browser) and identify specific elements like hyperlinks, icons, and text fields, providing their X, Y coordinates.\n\n2. **Image Cropping**: Python code will crop out the identified elements based on the coordinates provided by Owlv2.\n\n3. **Optical Character Recognition (OCR) and Object Recognition**: The cropped images will be sent to another model that can recognize both text and objects within these images.\n\n4. **Database/Text Document**: The results from the OCR/Object Recognition model will be stored in a structured format (e.g., a text document) that the large language model can reference. This document will have three columns:\n   - Column 1: Description of the element (e.g., hyperlink, icon, text field).\n   - Column 2: Coordinates of the element.\n   - Column 3: Recognized text or object within the element.\n\n5. **Large Language Model Interaction**: The large language model will use this structured data to interact with the web page. For example, it can decide to click on a hyperlink based on the recognized text and then use the LucidWebSearch extension to contextualize the information on the new page.\n\n6. **Mouse and Keyboard Operation**: The large language model will be able to operate a mouse and keyboard to interact with the web page, using the coordinates provided by Owlv2.\n\nGiven this understanding, we can start by breaking down the tasks:\n\n1. **Integrating Owlv2**: We need to ensure that Owlv2 can be called from within the TextGen environment to analyze images and provide coordinates.\n\n2. **Image Cropping**: Write Python code to crop images based on the coordinates provided by Owlv2.\n\n3. **OCR/Object Recognition**: Integrate the OCR/Object Recognition model to analyze the cropped images.\n\n4. **Data Storage**: Create a structured format (e.g., a text document) to store the results from the OCR/Object Recognition model.\n\n5. **LLM Interaction**: Develop the logic for the large language model to interact with the web page using the stored data.\n\n6. **Mouse and Keyboard Operation**: Implement the functionality for the large language model to operate a mouse and keyboard.\n\nLet's start with the first step: integrating Owlv2. Do you have any specific requirements or preferences for how Owlv2 should be integrated into the TextGen environment?"
        ],
        [
            " Okay, first of all, wonderful summary. Yes, we are on the same page for Shirley's. But before we begin, I want to take a bit of a different start. I came to realize this functionality of the OWL model through a hugging face space somebody made. I'm not too familiar with how this all works, but it seemed like it was like a GitHub in a lot of ways. But you could interact with this Python app.py file, essentially, but it was all through a web interface on hugging face. I could give it an image and test things out, and I was like, this is amazing. I figured out how to clone the GitHub code that supported that spin-up of OWL. I got it running on my computer, but there's one problem. There's two UI elements. There's the image and an output UI element. I can drag an image into the image UI element, great, and I can see it, wonderful. But on the output UI element, all I see are the clickable icon or hyperlink response colored boxes, but they're not overlaid on the screenshot of the web browser that I gave it. I gave it a screenshot of a web browser, and I'm pretty sure it highlighted everything right. I think the outputs exist as coordinates, and all the coordinates are colored differently, but there aren't any boxes overlaid on the input image display. When I was testing this on the hugging face website, the output UI showed the input image, but it had boxes around all of the regions that the OWL model identified. What I was thinking is, before we do anything with TextGen, let's get this GitHub repo working the way I want. We have a good foundational comparison to work with, and we can use this to make sure the right things are always being highlighted in our test cases and stuff like that.",
            "That sounds like a practical approach. Let's focus on getting the GitHub repository for the Owlv2 model working correctly on your local machine first. Here's a step-by-step plan to achieve that:\n\n1. **Understand the Current Setup**:\n   - Review the code in the cloned GitHub repository to understand how the image and output UI elements are handled.\n   - Identify the part of the code responsible for displaying the output boxes on the image.\n\n2. **Modify the Code**:\n   - Locate the section of the code where the output boxes are generated.\n   - Ensure that the coordinates of the boxes are correctly calculated and stored.\n   - Modify the code to overlay these boxes on the input image and display the combined image in the output UI element.\n\n3. **Test the Changes**:\n   - Run the modified code to ensure that the boxes are correctly overlaid on the input image.\n   - Verify that the output UI element displays the input image with the highlighted regions.\n\nLet's start by reviewing the code. Could you provide the path to the main script (e.g., `app.py`) and any relevant files or directories within the cloned repository? This will help me give you more specific guidance on where to make the necessary changes."
        ],
        [
            " Okay, so I hope this makes sense, but we have taken this conversation to a good stopping point in the past, and effectively what you have already done is you have fixed the code for me, and done a lot more.  Here is an update I had you make for yourself :\n\nHere's a summary of what the `ImageTestV1.py` code does, its purpose, how it works, the purpose for starting this, and the ultimate objective. This summary is written as if I'm explaining it to myself for context in a new conversation:\n\n---\n\n### Summary of `ImageTestV1.py`\n\n**Purpose**:\n- `ImageTestV1.py` is a script that automates the process of taking a screenshot of a specific monitor, processing the screenshot using the Owlv2 model to detect clickable icons and hyperlinks, and then cropping out the detected regions for further analysis.\n\n**How it Works**:\n1. **Configurable Variables**:\n   - `MONITOR_INDEX`: Index of the monitor to capture (0 for the first monitor, 1 for the second monitor, etc.).\n   - `SCREENSHOT_FILENAME`: Filename to save the screenshot as.\n   - `TEXT_QUERIES`: Comma-separated text queries for object detection (e.g., \"clickable icons,hyperlinks\").\n   - `SCORE_THRESHOLD`: Score threshold for filtering low-probability predictions.\n   - `OUTPUT_DIR`: Directory to save the cropped images.\n\n2. **Function `take_screenshot`**:\n   - Takes a screenshot of a specific monitor and saves it as a PNG file.\n\n3. **Function `query_image`**:\n   - Processes the image using the Owlv2 model to detect objects based on text queries.\n   - Saves the annotated image and the JSON output containing the bounding box coordinates and labels.\n   - **Note**: The text labels are not drawn within the bounding boxes to avoid confusion when sending the images to the OCR model.\n\n4. **Function `crop_images`**:\n   - Crops out the detected objects from the input image based on the bounding box coordinates and saves them to the specified output directory.\n\n5. **Main Block**:\n   - Takes the screenshot, processes the image using the Owlv2 model, and crops the detected objects.\n   - Saves the cropped images to the specified output directory.\n\n**Purpose for Starting This**:\n- The purpose of starting with `ImageTestV1.py` is to ensure that the Owlv2 model is working correctly and that the detected objects can be accurately cropped out of the input image. This serves as a foundation for the ultimate objective of the project.\n\n**Ultimate Objective**:\n- The ultimate objective of the project is to create an extension for the TextGenerationWebUI (TextGen) that allows a large language model to operate a mouse and keyboard to interact with a web page. The extension will use the Owlv2 model to detect clickable elements on the web page, crop out the detected elements, and send them to an optical character recognition (OCR) model to extract text and object descriptions. The large language model will then use this information to interact with the web page, such as by clicking on hyperlinks or filling out text fields. The extension will also use the LucidWebSearch extension to contextualize the information on the web page and the LucidVision extension to communicate with a vision model.\n\n**Next Steps**:\n1. Feed the markdown documents for the Lucid Vision and Lucid Web Search extensions to the AI to understand how they work.\n2. Start a new conversation with the AI to continue with the project, using the working `ImageTestV1.py` script as a starting point.\n3. Develop the OCR/Object Recognition functionality to extract text and object descriptions from the cropped images.\n4. Create a structured format (e.g., a JSON file) to store the extracted information that the large language model can reference.\n5. Develop the logic for the large language model to interact with the web page using the stored data.\n6. Implement the functionality for the large language model to operate a mouse and keyboard.\n\nBy following these steps, we can create a powerful extension for TextGen that enables a large language model to interact with web pages autonomously.\n\n---\n\nYou can copy this summary and use it to start a new conversation, providing the necessary context for the `ImageTestV1.py` script within the greater scheme of the project.\n\n\nimport torch\nimport pyautogui\nimport screeninfo\nfrom transformers import Owlv2Processor, Owlv2ForObjectDetection\nfrom PIL import Image, ImageDraw\nimport json\nimport os\n\n# Configurable variables\nMONITOR_INDEX = 1  # Index of the monitor to capture (0 for the first monitor, 1 for the second monitor, etc.)\nSCREENSHOT_FILENAME = \"screenshot.png\"  # Filename to save the screenshot as\nTEXT_QUERIES = \"clickable icons,hyperlinks\"  # Comma-separated text queries for object detection\nSCORE_THRESHOLD = 0.16  # Score threshold for filtering low-probability predictions\nOUTPUT_DIR = \"ImageOutputTest\"  # Directory to save the cropped images\n\n# Use GPU if available\nif torch.cuda.is_available():\n    device = torch.device(\"cuda\")\nelse:\n    device = torch.device(\"cpu\")\n\n# Load the Owlv2 model and processor\nmodel = Owlv2ForObjectDetection.from_pretrained(\"google/owlv2-base-patch16-ensemble\").to(device)\nprocessor = Owlv2Processor.from_pretrained(\"google/owlv2-base-patch16-ensemble\")\n\ndef take_screenshot(monitor_index=MONITOR_INDEX, output_filename=SCREENSHOT_FILENAME):\n    \"\"\"\n    Takes a screenshot of a specific monitor in a multi-monitor setup.\n\n    Args:\n        monitor_index (int): The index of the monitor to capture (default is MONITOR_INDEX).\n        output_filename (str): The filename to save the screenshot as (default is SCREENSHOT_FILENAME).\n    \"\"\"\n    # Get information about all monitors\n    monitors = screeninfo.get_monitors()\n\n    if monitor_index >= len(monitors):\n        raise ValueError(f\"Monitor index {monitor_index} is out of range. There are only {len(monitors)} monitors.\")\n\n    # Get the region of the specified monitor\n    monitor = monitors[monitor_index]\n    region = (monitor.x, monitor.y, monitor.width, monitor.height)\n\n    # Take the screenshot\n    screenshot = pyautogui.screenshot(region=region)\n\n    # Save the screenshot\n    screenshot.save(output_filename)\n    print(f\"Screenshot saved as {output_filename}\")\n\n    return output_filename\n\ndef query_image(img_path, text_queries=TEXT_QUERIES, score_threshold=SCORE_THRESHOLD):\n    \"\"\"\n    Processes the image using the Owlv2 model to detect objects based on text queries.\n\n    Args:\n        img_path (str): The path to the input image.\n        text_queries (str): Comma-separated text queries for object detection (default is TEXT_QUERIES).\n        score_threshold (float): The score threshold for filtering low-probability predictions (default is SCORE_THRESHOLD).\n    \"\"\"\n    img = Image.open(img_path)\n    text_queries = text_queries.split(\",\")\n\n    size = max(img.size)\n    target_sizes = torch.Tensor([[size, size]])\n    inputs = processor(text=text_queries, images=img, return_tensors=\"pt\").to(device)\n\n    with torch.no_grad():\n        outputs = model(**inputs)\n\n    outputs.logits = outputs.logits.cpu()\n    outputs.pred_boxes = outputs.pred_boxes.cpu()\n    results = processor.post_process_object_detection(outputs=outputs, target_sizes=target_sizes)\n    boxes, scores, labels = results[0][\"boxes\"], results[0][\"scores\"], results[0][\"labels\"]\n\n    result_labels = []\n    img_pil = img.copy()\n    draw = ImageDraw.Draw(img_pil)\n\n    for box, score, label in zip(boxes, scores, labels):\n        box = [int(i) for i in box.tolist()]\n        if score < score_threshold:\n            continue\n        result_labels.append([box, text_queries[label.item()]])\n        draw.rectangle(box, outline=\"red\", width=2)\n        # Do not draw the text label within the bounding box\n        # draw.text((box[0], box[1]), text_queries[label.item()], fill=\"red\")\n\n    # Save the annotated image for debugging\n    annotated_image_path = \"annotated_image.png\"\n    img_pil.save(annotated_image_path)\n\n    # Save the JSON output to a file\n    json_output_path = \"output.json\"\n    with open(json_output_path, 'w') as f:\n        json.dump(result_labels, f)\n\n    return annotated_image_path, json_output_path\n\ndef crop_images(image_path, json_output_path, output_dir=OUTPUT_DIR):\n    \"\"\"\n    Crops out the detected objects from the input image based on the bounding box coordinates.\n\n    Args:\n        image_path (str): The path to the input image.\n        json_output_path (str): The path to the JSON file containing the bounding box coordinates.\n        output_dir (str): The directory to save the cropped images (default is OUTPUT_DIR).\n    \"\"\"\n    with open(json_output_path, 'r') as f:\n        data = json.load(f)\n\n    img = Image.open(image_path)\n\n    for i, item in enumerate(data):\n        box = item[0]\n        label = item[1]\n        cropped_img = img.crop(box)\n        cropped_img.save(f\"{output_dir}/{label}_{i}.png\")\n\nif __name__ == \"__main__\":\n    # Take the screenshot\n    screenshot_path = take_screenshot()\n\n    # Process the image using the Owlv2 model\n    annotated_image_path, json_output_path = query_image(screenshot_path)\n\n    # Create the output directory if it doesn't exist\n    os.makedirs(OUTPUT_DIR, exist_ok=True)\n\n    # Crop the detected objects and save them to the output directory\n    crop_images(annotated_image_path, json_output_path)\n\n    print(\"Processing complete. Check the output directory for the cropped images.\")\n\n\nDon't move forward with anything yet, I want to make sure we are on the same page.",
            "Thank you for providing the detailed summary and the code for `ImageTestV1.py`. Let's ensure we're on the same page before moving forward.\n\n### Summary of `ImageTestV1.py`\n\n**Purpose**:\n- `ImageTestV1.py` is a script that automates the process of taking a screenshot of a specific monitor, processing the screenshot using the Owlv2 model to detect clickable icons and hyperlinks, and then cropping out the detected regions for further analysis.\n\n**How it Works**:\n1. **Configurable Variables**:\n   - `MONITOR_INDEX`: Index of the monitor to capture (0 for the first monitor, 1 for the second monitor, etc.).\n   - `SCREENSHOT_FILENAME`: Filename to save the screenshot as.\n   - `TEXT_QUERIES`: Comma-separated text queries for object detection (e.g., \"clickable icons,hyperlinks\").\n   - `SCORE_THRESHOLD`: Score threshold for filtering low-probability predictions.\n   - `OUTPUT_DIR`: Directory to save the cropped images.\n\n2. **Function `take_screenshot`**:\n   - Takes a screenshot of a specific monitor and saves it as a PNG file.\n\n3. **Function `query_image`**:\n   - Processes the image using the Owlv2 model to detect objects based on text queries.\n   - Saves the annotated image and the JSON output containing the bounding box coordinates and labels.\n   - **Note**: The text labels are not drawn within the bounding boxes to avoid confusion when sending the images to the OCR model.\n\n4. **Function `crop_images`**:\n   - Crops out the detected objects from the input image based on the bounding box coordinates and saves them to the specified output directory.\n\n5. **Main Block**:\n   - Takes the screenshot, processes the image using the Owlv2 model, and crops the detected objects.\n   - Saves the cropped images to the specified output directory.\n\n**Purpose for Starting This**:\n- The purpose of starting with `ImageTestV1.py` is to ensure that the Owlv2 model is working correctly and that the detected objects can be accurately cropped out of the input image. This serves as a foundation for the ultimate objective of the project.\n\n**Ultimate Objective**:\n- The ultimate objective of the project is to create an extension for the TextGenerationWebUI (TextGen) that allows a large language model to operate a mouse and keyboard to interact with a web page. The extension will use the Owlv2 model to detect clickable elements on the web page, crop out the detected elements, and send them to an optical character recognition (OCR) model to extract text and object descriptions. The large language model will then use this information to interact with the web page, such as by clicking on hyperlinks or filling out text fields. The extension will also use the LucidWebSearch extension to contextualize the information on the web page and the LucidVision extension to communicate with a vision model.\n\n**Next Steps**:\n1. Feed the markdown documents for the Lucid Vision and Lucid Web Search extensions to the AI to understand how they work.\n2. Start a new conversation with the AI to continue with the project, using the working `ImageTestV1.py` script as a starting point.\n3. Develop the OCR/Object Recognition functionality to extract text and object descriptions from the cropped images.\n4. Create a structured format (e.g., a JSON file) to store the extracted information that the large language model can reference.\n5. Develop the logic for the large language model to interact with the web page using the stored data.\n6. Implement the functionality for the large language model to operate a mouse and keyboard.\n\nBy following these steps, we can create a powerful extension for TextGen that enables a large language model to interact with web pages autonomously.\n\n### Code Review\n\nThe provided code for `ImageTestV1.py` appears to be well-structured and covers the necessary steps for taking a screenshot, processing it with the Owlv2 model, and cropping out the detected regions. Here's a brief overview of the code:\n\n1. **Imports**:\n   - The necessary libraries are imported, including `torch`, `pyautogui`, `screeninfo`, `transformers`, `PIL`, `json`, and `os`.\n\n2. **Configurable Variables**:\n   - Variables for monitor index, screenshot filename, text queries, score threshold, and output directory are defined.\n\n3. **Device Setup**:\n   - The script checks if a GPU is available and sets the device accordingly.\n\n4. **Model Loading**:\n   - The Owlv2 model and processor are loaded from the Hugging Face model hub.\n\n5. **Function `take_screenshot`**:\n   - This function takes a screenshot of the specified monitor and saves it as a PNG file.\n\n6. **Function `query_image`**:\n   - This function processes the image using the Owlv2 model to detect objects based on text queries. It saves the annotated image and the JSON output containing the bounding box coordinates and labels.\n\n7. **Function `crop_images`**:\n   - This function crops out the detected objects from the input image based on the bounding box coordinates and saves them to the specified output directory.\n\n8. **Main Block**:\n   - The main block takes the screenshot, processes the image using the Owlv2 model, and crops the detected objects. It saves the cropped images to the specified output directory.\n\n### Next Steps\n\nBefore moving forward, please confirm if there are any specific aspects of the code or the summary that you would like to discuss further. Once confirmed, we can proceed with the next steps as outlined in the summary."
        ],
        [
            " Okay, that seems like a good contextualization. Yeah, so we're going to move towards our ultimate objective piece by piece here, and we're making really good progress because you're surprisingly good at coding. All right. The next step is, we're going to have a vision large language model. Well, it's not as good as a large language model, but it's a good vision model. It can contextualize text, so it's like an OCR model, but it can also see images, so it's pretty good there too. It's a good combination of both, not perfect, but good at diversity of image recognition. What we're going to do is, I want you to take the imagetestv1.py file, or that code, and I want you to update it to include sending each one of those cropped images to the vision model. How you will learn to do this is, I will give you the code for the Lucid Vision extension for text gen.\n\nDo not write any code yet, I just want to make sure you understand.",
            "I understand your request. You want to update the `ImageTestV1.py` script to include functionality for sending each of the cropped images to a vision model that can perform both OCR and image recognition. To achieve this, you will provide me with the code for the Lucid Vision extension for TextGen, which I will use as a reference to integrate the vision model into the script.\n\nTo summarize the next steps:\n\n1. **Review the Lucid Vision Extension Code**: I will review the code for the Lucid Vision extension to understand how it interacts with the vision model.\n\n2. **Update `ImageTestV1.py`**: I will update the `ImageTestV1.py` script to include functionality for sending each cropped image to the vision model.\n\n3. **Integrate Vision Model**: I will integrate the vision model into the script, ensuring that it can process the cropped images and return the recognized text and object descriptions.\n\n4. **Store Results**: I will store the results from the vision model in a structured format (e.g., a JSON file) that the large language model can reference.\n\nPlease provide the code for the Lucid Vision extension, and I will proceed with the integration."
        ],
        [
            "Okay, here is the Lucid_Vision script.py file that works as an extension for textgen.  I am going to give you the script.py file and the markdown for the github repo for supported context.  Don't write any code just yet I want you to contextualize this extension.\n\n# pip install pexpect\nimport json\nimport re\nfrom datetime import datetime  # Import datetime for timestamp generation\nfrom pathlib import Path\n\nimport gradio as gr\nimport torch\nfrom PIL import Image\nfrom deepseek_vl.models import VLChatProcessor\nfrom deepseek_vl.utils.io import load_pil_images as load_pil_images_for_deepseek\nfrom transformers import AutoModelForCausalLM, AutoModel, AutoTokenizer, AutoProcessor, \\\n    PaliGemmaForConditionalGeneration\n\nmodel_names = [\"phiVision\", \"DeepSeek\", \"paligemma\", \"paligemma_cpu\", \"minicpm_llama3\", \"bunny\"]\n\n\n# Load configuration settings from a JSON file\ndef load_config():\n    with open(Path(__file__).parent / \"config.json\", \"r\") as config_file:\n        # Read and parse the JSON configuration file\n        config = json.load(config_file)\n    return config\n\n\n# Load the configuration settings at the module level\nconfig = load_config()\n\n# Define the model ID for PhiVision using the configuration setting\nphiVision_model_id = config[\"phiVision_model_id\"]\n\n# Define the model ID for PaliGemma using the configuration setting\npaligemma_model_id = config[\"paligemma_model_id\"]\n\nminicpm_llama3_model_id = config[\"minicpm_llama3_model_id\"]\n\nbunny_model_id = config[\"bunny_model_id\"]\n\ncuda_device = config[\"cuda_visible_devices\"]\n\nselected_vision_model = config[\"default_vision_model\"]\n\n# Global variable to store the file path of the selected image\nselected_image_path = None\n\n# Define the directory where the image files will be saved\n# This path is loaded from the configuration file\nimage_history_dir = Path(config[\"image_history_dir\"])\n# Ensure the directory exists, creating it if necessary\nimage_history_dir.mkdir(parents=True, exist_ok=True)\n\nglobal phiVision_model, phiVision_processor\nglobal minicpm_llama_model, minicpm_llama_tokenizer\nglobal paligemma_model, paligemma_processor\nglobal paligemma_cpu_model, paligemma_cpu_processor\nglobal deepseek_processor, deepseek_tokenizer, deepseek_gpt\nglobal bunny_model, bunny_tokenizer\n\n\n# Function to generate a timestamped filename for saved images\ndef get_timestamped_filename(extension=\".png\"):\n    # Generate a timestamp in the format YYYYMMDD_HHMMSS\n    timestamp = datetime.now().strftime(\"%Y%m%d_%H%M%S\")\n    # Return a formatted file name with the timestamp\n    return f\"image_{timestamp}{extension}\"\n\n\n# Function to modify the user input before it is processed by the LLM\ndef input_modifier(user_input, state):\n    global selected_image_path\n    # Check if an image has been selected and stored in the global variable\n    if selected_image_path:\n        # Construct the message with the \"File location\" trigger phrase and the image file path\n        image_info = f\"File location: {selected_image_path}\"\n        # Combine the user input with the image information, separated by a newline for clarity\n        combined_input = f\"{user_input}\\n\\n{image_info}\"\n        # Reset the selected image path to None after processing\n        selected_image_path = None\n        return combined_input\n    # If no image is selected, return the user input as is\n    return user_input\n\n\n# Function to handle image upload and store the file path\ndef save_and_print_image_path(image):\n    global selected_image_path\n    if image is not None:\n        # Generate a unique timestamped filename for the image\n        file_name = get_timestamped_filename()\n        # Construct the full file path for the image\n        file_path = image_history_dir / file_name\n        # Save the uploaded image to the specified directory\n        image.save(file_path, format='PNG')\n        print(f\"Image selected: {file_path}\")\n        # Update the global variable with the new image file path\n        selected_image_path = file_path\n    else:\n        # Print a message if no image is selected\n        print(\"No image selected yet.\")\n\n\n# Function to create the Gradio UI components with the new direct vision model interaction elements\ndef ui():\n    # Create an image input component for the Gradio UI\n    image_input = gr.Image(label=\"Select an Image\", type=\"pil\", source=\"upload\")\n    # Set up the event that occurs when an image is uploaded\n    image_input.change(\n        fn=save_and_print_image_path,\n        inputs=image_input,\n        outputs=None\n    )\n\n    # Create a text field for user input to the vision model\n    user_input_to_vision_model = gr.Textbox(label=\"Type your question or prompt for the vision model\", lines=2,\n                                            placeholder=\"Enter your text here...\")\n\n    # Create a button to trigger the vision model processing\n    vision_model_button = gr.Button(value=\"Ask Vision Model\")\n\n    # Create a text field to display the vision model's response\n    vision_model_response_output = gr.Textbox(\n        label=\"Vision Model Response\",\n        lines=5,\n        placeholder=\"Response will appear here...\"\n    )\n\n    # Add radio buttons for vision model selection\n    vision_model_selection = gr.Radio(\n        choices=model_names,\n        # Added \"paligemma_cpu\" as a choice\n        value=config[\"default_vision_model\"],\n        label=\"Select Vision Model\"\n    )\n\n    # Add an event handler for the radio button selection\n    vision_model_selection.change(\n        fn=update_vision_model,\n        inputs=vision_model_selection,\n        outputs=None\n    )\n\n    cuda_devices_input = gr.Textbox(\n        value=cuda_device,\n        label=\"CUDA Device ID\",\n        max_lines=1\n    )\n\n    cuda_devices_input.change(\n        fn=update_cuda_device,\n        inputs=cuda_devices_input,\n        outputs=None\n    )\n\n    # Set up the event that occurs when the vision model button is clicked\n    vision_model_button.click(\n        fn=process_with_vision_model,\n        inputs=[user_input_to_vision_model, image_input, vision_model_selection],\n        # Pass the actual gr.Image component\n        outputs=vision_model_response_output\n    )\n\n    # Return a column containing the UI components\n    return gr.Column(\n        [\n            image_input,\n            user_input_to_vision_model,\n            vision_model_button,\n            vision_model_response_output,\n            vision_model_selection\n        ]\n    )\n\n\n# Function to load the PaliGemma CPU model and processor\ndef load_paligemma_cpu_model():\n    global paligemma_cpu_model, paligemma_cpu_processor\n    paligemma_cpu_model = PaliGemmaForConditionalGeneration.from_pretrained(\n        paligemma_model_id,\n    ).eval()\n    paligemma_cpu_processor = AutoProcessor.from_pretrained(paligemma_model_id)\n    print(\"PaliGemma CPU model loaded on-demand.\")\n\n\n# Function to unload the PaliGemma CPU model and processor\ndef unload_paligemma_cpu_model():\n    global paligemma_cpu_model, paligemma_cpu_processor\n    if paligemma_cpu_model is not None:\n        # Delete the model and processor instances\n        del paligemma_cpu_model\n        del paligemma_cpu_processor\n        print(\"PaliGemma CPU model unloaded.\")\n\n    # Global variable to store the selected vision model\n\n\n# Function to update the selected vision model and load the corresponding model\ndef update_vision_model(model_name):\n    global selected_vision_model\n    selected_vision_model = model_name\n    return model_name\n\n\ndef update_cuda_device(device):\n    global cuda_device\n    print(f\"Cuda device set to index = {device}\")\n    cuda_device = int(device)\n    return cuda_device\n\n\n# Main entry point for the Gradio interface\nif __name__ == \"__main__\":\n    # Launch the Gradio interface with the specified UI components\n    gr.Interface(\n        fn=ui,\n        inputs=None,\n        outputs=\"ui\"\n    ).launch()\n\n# Define a regular expression pattern to match the \"File location\" trigger phrase\nfile_location_pattern = re.compile(r\"File location: (.+)$\", re.MULTILINE)\n# Define a regular expression pattern to match and remove unwanted initial text\nunwanted_initial_text_pattern = re.compile(r\".*?prod\\(dim=0\\)\\r\\n\", re.DOTALL)\n\n# Define a regular expression pattern to match and remove unwanted prefixes from DeepSeek responses\n# This pattern should be updated to match the new, platform-independent file path format\nunwanted_prefix_pattern = re.compile(\n    r\"^\" + re.escape(str(image_history_dir)) + r\"[^ ]+\\.png\\s*Assistant: \",\n    re.MULTILINE\n)\n\n\n# Function to load the PhiVision model and processor\ndef load_phi_vision_model():\n    global phiVision_model, phiVision_processor, cuda_device\n    # Load the PhiVision model and processor on-demand, specifying the device map\n    phiVision_model = AutoModelForCausalLM.from_pretrained(\n        phiVision_model_id,\n        device_map={\"\": cuda_device},  # Use the specified CUDA device(s)\n        trust_remote_code=True,\n        torch_dtype=\"auto\"\n    )\n    phiVision_processor = AutoProcessor.from_pretrained(\n        phiVision_model_id,\n        trust_remote_code=True\n    )\n    print(\"PhiVision model loaded on-demand.\")\n\n\n# Function to unload the PhiVision model and processor\ndef unload_phi_vision_model():\n    global phiVision_model, phiVision_processor\n    if phiVision_model is not None:\n        # Delete the model and processor instances\n        del phiVision_model\n        del phiVision_processor\n        # Clear the CUDA cache to free up VRAM\n        torch.cuda.empty_cache()\n        print(\"PhiVision model unloaded.\")\n\n\n# Function to load the MiniCPM-Llama3 model and tokenizer\ndef load_minicpm_llama_model():\n    global minicpm_llama_model, minicpm_llama_tokenizer, cuda_device\n    if \"int4\" in minicpm_llama3_model_id:\n        # Load the 4-bit quantized model and tokenizer\n        minicpm_llama_model = AutoModel.from_pretrained(\n            minicpm_llama3_model_id,\n            device_map={\"\": cuda_device},\n            trust_remote_code=True,\n            torch_dtype=torch.float16  # Use float16 as per the example code for 4-bit models\n        ).eval()\n        minicpm_llama_tokenizer = AutoTokenizer.from_pretrained(\n            minicpm_llama3_model_id,\n            trust_remote_code=True\n        )\n        # Print a message indicating that the 4-bit model is loaded\n        print(\"MiniCPM-Llama3 4-bit quantized model loaded on-demand.\")\n    else:\n        # Load the standard model and tokenizer\n        minicpm_llama_model = AutoModel.from_pretrained(\n            minicpm_llama3_model_id,\n            device_map={\"\": cuda_device},  # Use the specified CUDA device\n            trust_remote_code=True,\n            torch_dtype=torch.bfloat16\n        ).eval()\n        minicpm_llama_tokenizer = AutoTokenizer.from_pretrained(\n            minicpm_llama3_model_id,\n            trust_remote_code=True\n        )\n        print(\"MiniCPM-Llama3 standard model loaded on-demand.\")\n\n\ndef unload_minicpm_llama_model():\n    global minicpm_llama_model, minicpm_llama_tokenizer\n    if minicpm_llama_model is not None:\n        del minicpm_llama_model\n        del minicpm_llama_tokenizer\n        torch.cuda.empty_cache()\n        print(\"MiniCPM-Llama3 model unloaded.\")\n\n\n# Function to load the PaliGemma model and processor\ndef load_paligemma_model():\n    global paligemma_model, paligemma_processor, cuda_device\n    # Load the PaliGemma model and processor on-demand, specifying the device map\n    paligemma_model = PaliGemmaForConditionalGeneration.from_pretrained(\n        paligemma_model_id,\n        device_map={\"\": cuda_device},  # Use the specified CUDA device(s)\n        torch_dtype=torch.bfloat16,\n        revision=\"bfloat16\",\n    ).eval()\n    paligemma_processor = AutoProcessor.from_pretrained(paligemma_model_id)\n    print(\"PaliGemma model loaded on-demand.\")\n\n\n# Function to unload the PaliGemma model and processor\ndef unload_paligemma_model():\n    global paligemma_model, paligemma_processor\n    if paligemma_model is not None:\n        # Delete the model and processor instances\n        del paligemma_model\n        del paligemma_processor\n        # Clear the CUDA cache to free up VRAM\n        torch.cuda.empty_cache()\n        print(\"PaliGemma model unloaded.\")\n\n\ndef load_deepseek_model():\n    global deepseek_processor, deepseek_tokenizer, deepseek_gpt, cuda_device\n    deepseek_processor = VLChatProcessor.from_pretrained(config[\"deepseek_vl_model_id\"])\n    deepseek_tokenizer = deepseek_processor.tokenizer\n    deepseek_gpt = AutoModelForCausalLM.from_pretrained(\n        config[\"deepseek_vl_model_id\"],\n        device_map={\"\": cuda_device},\n        trust_remote_code=True,\n        torch_dtype=torch.bfloat16\n    ).to(torch.bfloat16).cuda().eval()\n    print(\"DeepSeek model loaded on-demand.\")\n\n\ndef unload_deepseek_model():\n    global deepseek_processor, deepseek_tokenizer, deepseek_gpt\n    if deepseek_processor is not None:\n        del deepseek_gpt\n        del deepseek_tokenizer\n        del deepseek_processor\n        print(\"DeepSeek model unloaded.\")\n    torch.cuda.empty_cache()\n\n\ndef load_bunny_model():\n    global bunny_model, bunny_tokenizer, cuda_device\n    torch.cuda.set_device(cuda_device)  # Set default device before loading models\n    bunny_model = AutoModelForCausalLM.from_pretrained(\n        bunny_model_id,\n        torch_dtype=torch.bfloat16,\n        device_map={\"\": cuda_device},  # Use the specified CUDA device\n        trust_remote_code=True\n    ).to(torch.bfloat16).cuda()\n    bunny_tokenizer = AutoTokenizer.from_pretrained(\n        bunny_model_id,\n        trust_remote_code=True\n    )\n    print(\"Bunny model loaded on-demand.\")\n\n\ndef unload_bunny_model():\n    global bunny_model, bunny_tokenizer\n    if bunny_model is not None:\n        del bunny_model, bunny_tokenizer\n        print(\"Bunny model unloaded.\")\n    torch.cuda.empty_cache()\n\n\n# Function to modify the output from the LLM before it is displayed to the user\ndef output_modifier(output, state, is_chat=False):\n    global cuda_device\n    # Search for the \"File location\" trigger phrase in the LLM's output\n    file_location_matches = file_location_pattern.findall(output)\n    if file_location_matches:\n        # Extract the first match (assuming only one file location per output)\n        file_path = file_location_matches[0]\n        # Extract the questions for the vision model\n        questions_section, _ = output.split(f\"File location: {file_path}\", 1)\n        # Remove all newlines from the questions section and replace them with spaces\n        questions = \" \".join(questions_section.strip().splitlines())\n\n        # Initialize an empty response string\n        vision_model_response = \"\"\n\n        # Check which vision model is currently selected\n        if selected_vision_model == \"phiVision\":\n            vision_model_response = generate_phi_vision(file_path, questions)\n\n        elif selected_vision_model == \"DeepSeek\":\n            vision_model_response = generate_deepseek(file_path, questions)\n\n        elif selected_vision_model == \"paligemma\":\n            vision_model_response = generate_paligemma(file_path, questions)\n\n        elif selected_vision_model == \"paligemma_cpu\":\n            vision_model_response = generate_paligemma_cpu(file_path, questions)\n\n        elif selected_vision_model == \"minicpm_llama3\":\n            vision_model_response = generate_minicpm_llama3(file_path, questions)\n\n        elif selected_vision_model == \"bunny\":\n            vision_model_response = generate_bunny(file_path, questions)\n\n        # Append the vision model's responses to the output\n        output_with_responses = f\"{output}\\n\\nVision Model Responses:\\n{vision_model_response}\"\n        return output_with_responses\n    # If no file location is found, return the output as is\n    return output\n\n\n# Function to generate a response using the MiniCPM-Llama3 model\ndef generate_minicpm_llama3(file_path, questions):\n    global cuda_device\n    try:\n        load_minicpm_llama_model()\n        image = Image.open(file_path).convert(\"RGB\")\n        messages = [\n            {\"role\": \"user\", \"content\": f\"{questions}\"}\n        ]\n        # Define the generation arguments\n        generation_args = {\n            \"max_new_tokens\": 896,\n            \"repetition_penalty\": 1.05,\n            \"num_beams\": 3,\n            \"top_p\": 0.8,\n            \"top_k\": 1,\n            \"temperature\": 0.7,\n            \"sampling\": True,\n        }\n        if \"int4\" in minicpm_llama3_model_id:\n            # Disable streaming for the 4-bit model\n            generation_args[\"stream\"] = False\n            # Use the model.chat method with streaming enabled\n            vision_model_response = \"\"\n            for new_text in minicpm_llama_model.chat(\n                    image=image,\n                    msgs=messages,\n                    tokenizer=minicpm_llama_tokenizer,\n                    **generation_args\n            ):\n                vision_model_response += new_text\n                print(new_text, flush=True, end='')\n        else:\n            minicpm_llama_model.to(f\"cuda:{cuda_device}\")\n            vision_model_response = minicpm_llama_model.chat(\n                image=image,\n                msgs=messages,\n                tokenizer=minicpm_llama_tokenizer,\n                **generation_args\n            )\n            return vision_model_response\n    finally:\n        unload_minicpm_llama_model()\n\n\n# Function to process the user's input text and selected image with the vision model\ndef process_with_vision_model(user_input, image, selected_model):\n    global cuda_device\n    # Save the uploaded image to the specified directory with a timestamp\n    file_name = get_timestamped_filename()\n    file_path = image_history_dir / file_name\n    image.save(file_path, format='PNG')\n    print(f\"Image processed: {file_path}\")\n\n    # Initialize an empty response string\n    vision_model_response = \"\"\n\n    # Check which vision model is currently selected\n    if selected_model == \"phiVision\":\n        vision_model_response = generate_phi_vision(file_path, user_input)\n\n    elif selected_model == \"DeepSeek\":\n        vision_model_response = generate_deepseek(file_path, user_input)\n\n    elif selected_model == \"paligemma\":\n        vision_model_response = generate_paligemma(file_path, user_input)\n\n    elif selected_model == \"paligemma_cpu\":\n        vision_model_response = generate_paligemma_cpu(file_path, user_input)\n\n    elif selected_model == \"minicpm_llama3\":\n        vision_model_response = generate_minicpm_llama3(file_path, user_input)\n\n    elif selected_model == \"bunny\":\n        vision_model_response = generate_bunny(file_path, user_input)\n\n    # Return the cleaned-up response from the vision model\n    return vision_model_response\n\n\ndef generate_paligemma_cpu(file_path, user_input):\n    try:\n        # Load the PaliGemma CPU model and processor on-demand\n        load_paligemma_cpu_model()\n        # Load the saved image using PIL\n        with Image.open(file_path) as img:\n            # Prepare the prompt for the PaliGemma CPU model using the user's question\n            prompt = user_input\n            model_inputs = paligemma_cpu_processor(text=prompt, images=img, return_tensors=\"pt\")\n            input_len = model_inputs[\"input_ids\"].shape[-1]\n            # Generate the response using the PaliGemma CPU model\n            with torch.inference_mode():\n                generation = paligemma_cpu_model.generate(\n                    **model_inputs,\n                    max_new_tokens=100,\n                    do_sample=True  # Set to True for sampling-based generation\n                )\n                generation = generation[0][input_len:]\n                vision_model_response = paligemma_cpu_processor.decode(generation, skip_special_tokens=True)\n        # Unload the PaliGemma CPU model and processor after generating the response\n        return vision_model_response\n    finally:\n        unload_paligemma_cpu_model()\n\n\ndef generate_paligemma(file_path, user_input):\n    try:\n        # Load the PaliGemma model and processor on-demand\n        load_paligemma_model()\n        # Load the saved image using PIL\n        with Image.open(file_path) as img:\n            # Prepare the prompt for the PaliGemma model using the user's question\n            model_inputs = paligemma_processor(text=user_input, images=img, return_tensors=\"pt\").to(\n                f\"cuda:{cuda_device}\")\n            input_len = model_inputs[\"input_ids\"].shape[-1]\n            # Generate the response using the PaliGemma model\n            with torch.inference_mode():\n                generation = paligemma_model.generate(\n                    **model_inputs,\n                    max_new_tokens=100,\n                    do_sample=True  # Set to True for sampling-based generation\n                )\n                generation = generation[0][input_len:]\n                vision_model_response = paligemma_processor.decode(generation, skip_special_tokens=True)\n        # Unload the PaliGemma model and processor after generating the response\n        return vision_model_response\n    finally:\n        unload_paligemma_model()\n\n\ndef generate_phi_vision(file_path, user_input):\n    global cuda_device\n    try:\n        # Load the PhiVision model and processor on-demand\n        load_phi_vision_model()\n        # Load the saved image using PIL\n        with Image.open(file_path) as img:\n            # Prepare the prompt for the PhiVision model\n            messages = [\n                {\"role\": \"user\", \"content\": f\"<|image_1|>\\n{user_input}\"}\n            ]\n            prompt = phiVision_processor.tokenizer.apply_chat_template(\n                messages, tokenize=False, add_generation_prompt=True\n            )\n            # Extract the CUDA_VISIBLE_DEVICES setting from the config file\n            # cuda_devices = config[\"cuda_visible_devices\"]\n            # Convert the CUDA_VISIBLE_DEVICES string to an integer (assuming a single device for simplicity)\n            # cuda_device_index = int(cuda_devices)\n\n            # Prepare the model inputs and move them to the specified CUDA device\n            inputs = phiVision_processor(prompt, [img], return_tensors=\"pt\").to(f\"cuda:{cuda_device}\")\n            # Define the generation arguments\n            generation_args = {\n                \"max_new_tokens\": 500,\n                \"temperature\": 1.0,\n                \"do_sample\": True,  # Set to True for sampling-based generation\n                # \"min_p\": 0.95,  # Optionally set a minimum probability threshold\n            }\n            # Generate the response using the PhiVision model\n            generate_ids = phiVision_model.generate(\n                **inputs,\n                eos_token_id=phiVision_processor.tokenizer.eos_token_id,\n                **generation_args\n            )\n            # Remove input tokens from the generated IDs\n            generate_ids = generate_ids[:, inputs['input_ids'].shape[1]:]\n            # Decode the generated IDs to get the text response\n            vision_model_response = phiVision_processor.batch_decode(\n                generate_ids,\n                skip_special_tokens=True,\n                clean_up_tokenization_spaces=False\n            )[0]\n        # Unload the PhiVision model and processor after generating the response\n        return vision_model_response\n    finally:\n        unload_phi_vision_model()\n\n\ndef generate_deepseek(file_path, user_input):\n    try:\n        load_deepseek_model()\n        conversation = [\n            {\n                \"role\": \"User\",\n                \"content\": f\"<image_placeholder>{user_input}\",\n                \"images\": [f\"{file_path}\"]\n            }, {\n                \"role\": \"Assistant\",\n                \"content\": \"\"\n            }\n        ]\n        print(conversation)\n        pil_images = load_pil_images_for_deepseek(conversation)\n        prepare_inputs = deepseek_processor(\n            conversations=conversation,\n            images=pil_images,\n            force_batchify=True\n        ).to(deepseek_gpt.device)\n        input_embeds = deepseek_gpt.prepare_inputs_embeds(**prepare_inputs)\n        outputs = deepseek_gpt.language_model.generate(\n            inputs_embeds=input_embeds,\n            attention_mask=prepare_inputs.attention_mask,\n            pad_token_id=deepseek_tokenizer.eos_token_id,\n            bos_token_id=deepseek_tokenizer.bos_token_id,\n            eos_token_id=deepseek_tokenizer.eos_token_id,\n            max_new_tokens=512,\n            do_sample=False,\n            use_cache=True\n        )\n        return deepseek_tokenizer.decode(outputs[0].cpu().tolist(), skip_special_tokens=True)\n    finally:\n        unload_deepseek_model()\n\n\ndef generate_bunny(file_path, user_input):\n    global cuda_device\n    try:\n        load_bunny_model()\n        with Image.open(file_path) as image:\n            text = f\"A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions. USER: <image>\\n{user_input} ASSISTANT:\"\n            text_chunks = [bunny_tokenizer(chunk).input_ids for chunk in text.split('<image>')]\n            input_ids = torch.tensor(text_chunks[0] + [-200] + text_chunks[1][1:], dtype=torch.long).unsqueeze(0).to(\n                f\"cuda:{cuda_device}\")\n            image_tensor = bunny_model.process_images(\n                [image],\n                bunny_model.config\n            ).to(\n                dtype=bunny_model.dtype,\n                device=bunny_model.device\n            )\n            output_ids = bunny_model.generate(\n                input_ids,\n                images=image_tensor,\n                max_new_tokens=896,\n                use_cache=True,\n                repetition_penalty=1.0\n            )[0]\n            return bunny_tokenizer.decode(output_ids[input_ids.shape[1]:], skip_special_tokens=True).strip()\n    finally:\n        unload_bunny_model()\n\n\n# Function to modify the chat history before it is used for text generation\ndef history_modifier(history):\n    # Extract all entries from the \"internal\" history\n    internal_entries = history[\"internal\"]\n    # Iterate over the \"internal\" history entries\n    for internal_index, internal_entry in enumerate(internal_entries):\n        # Extract the text content of the internal entry\n        internal_text = internal_entry[1]\n        # Search for the \"File location\" trigger phrase in the internal text\n        file_location_matches = file_location_pattern.findall(internal_text)\n        if file_location_matches:\n            # Iterate over each match found in the \"internal\" entry\n            for file_path in file_location_matches:\n                # Construct the full match string including the trigger phrase\n                full_match_string = f\"File location: {file_path}\"\n                # Search for the exact same string in the \"visible\" history\n                for visible_entry in history[\"visible\"]:\n                    # Extract the text content of the visible entry\n                    visible_text = visible_entry[1]\n                    # If the \"visible\" entry contains the full match string\n                    if full_match_string in visible_text:\n                        # Split the \"visible\" text at the full match string\n                        _, after_match = visible_text.split(full_match_string, 1)\n                        # Find the position where the \".png\" part ends in the \"internal\" text\n                        png_end_pos = internal_text.find(file_path) + len(file_path)\n                        # If the \".png\" part is found and there is content after it\n                        if png_end_pos < len(internal_text) and internal_text[png_end_pos] == \"\\n\":\n                            # Extract the existing content after the \".png\" part in the \"internal\" text\n                            _ = internal_text[png_end_pos:]\n                            # Replace the existing content after the \".png\" part in the \"internal\" text\n                            # with the corresponding content from the \"visible\" text\n                            new_internal_text = internal_text[:png_end_pos] + after_match\n                            # Update the \"internal\" history entry with the new text\n                            history[\"internal\"][internal_index][1] = new_internal_text\n                        # If there is no content after the \".png\" part in the \"internal\" text,\n                        # append the content from the \"visible\" text directly\n                        else:\n                            # Append the content after the full match string from the \"visible\" text\n                            history[\"internal\"][internal_index][1] += after_match\n    return history\n\n\nGitHub page markdown:\n\n# README\n\nVideo Demo: There is code in this repo to prevent Alltalk from reading the directory names out loud ([here](https://github.com/RandomInternetPreson/Lucid_Vision/tree/main?tab=readme-ov-file#to-work-with-alltalk)), the video is older and displays Alltalk reading the directory names however. https://github.com/RandomInternetPreson/Lucid_Vision/assets/6488699/8879854b-06d8-49ad-836c-13c84eff6ac9 Download the full demo video here: https://github.com/RandomInternetPreson/Lucid_Vision/blob/main/VideoDemo/Lucid_Vision_demoCompBig.mov * Update June 2 2024, Lucid_Vision now supports Bunny-v1_1-Llama-3-8B-V, again thanks to https://github.com/justin-luoma :3 * Update May 31 2024, many thanks to https://github.com/justin-luoma, they have made several recent updates to the code, now merged with this repo. * Removed the need to communicate with deepseek via the cli, this removes the only extra dependency needed to run Lucid_Vision. * Reduced the number of entries needed for the config file. * Really cleaned up the code, it is much better organized now. * Added the ability to switch which gpu loads the vision model in the UI. * Bonus Update for May 31 2024, WizardLM-2-8x22B and I figured out how to prevent Alltalk from reading the file directory locations!! https://github.com/RandomInternetPreson/Lucid_Vision/tree/main?tab=readme-ov-file#to-work-with-alltalk * Update May 30 2024, Lucid_Vision now supports MiniCPM-Llama3-V-2_5, thanks to https://github.com/justin-luoma. Additionally WizardLM-2-8x22B added the functionality to load in the MiniCPM 4-bit model. * Updated script.py and config file, model was not previously loading to the user assigned gpu Experimental, and I am currently working to improve the code; but it may work for most. Todo: * Right now the temp setting and parameters for each model are coded in the .py script, I intend to bring these to the UI [done](https://github.com/RandomInternetPreson/Lucid_Vision/tree/main?tab=readme-ov-file#to-work-with-alltalk)~~* Make it comptable with Alltalk, right now each image directory is printed out and this will be read out loud by Alltalk~~ To accurately proportion credit (original repo creation): WizardLM-2-8x22B (quantized with exllamaV2 to 8bit precision) = 90% of all the work done in this repo. That model wrote 100% of all the code and most of the introduction to this repo. CommandR+ (quantized with exllamaV2 to 8bit precision) = ~5% of all the work. CommandR+ contextualized the coding examples and rules for making extensions from Oogabooga's textgen repo extreamly well, and provided a good foundation to develope the code. RandomInternetPreson = 5% of all the work done. I came up with the original idea, the original general outline of how the pieces would interact, and provided feedback to the WizardLM model, but I did not write any code. I'm actually not very good with python yet, with most of my decades of coding being in Matlab. My goal from the beginning was to write this extension offline without any additional resources, sometimes it was a little frusturating but I soon understood how to get want I needed from the models running locally. I would say that most of the credit should go to Oobabooga, for without them I would be struggling to even interact with my models. Please consider supporting them: https://github.com/sponsors/oobabooga or https://ko-fi.com/oobabooga I am their top donor on ko-fi (Mr. A) and donate 15$ montly, their software is extreamly important to the opensource community. # Lucid_Vision Extension for Oobabooga's textgen-webui Welcome to the Lucid Vision Extension repository! This extension enhances the capabilities of textgen-webui by integrating advanced vision models, allowing users to have contextualized conversations about images with their favorite language models; and allowing direct communciation with vision models. ## Features * Multi-Model Support: Interact with different vision models, including PhiVision, DeepSeek, MiniCPM-Llama3-V-2_5 (4bit and normal precision), Bunny-v1_1-Llama-3-8B-V, and PaliGemma, with options for both GPU and CPU inference. * On-Demand Loading: Vision models are loaded into memory only when needed to answer a question, optimizing resource usage. * Seamless Integration: Easily switch between vision models using a Gradio UI radio button selector. * Cross-Platform Compatibility: The extension is designed to work on various operating systems, including Unix-based systems and Windows. (not tested in Windows yet, but should probably work?) * Direct communication with vision models, you do not need to load a LLM to interact with the separate vision models. ## How It Works The Lucid Vision Extension operates by intercepting and modifying user input and output within the textgen-webui framework. When a user uploads an image and asks a question, the extension appends a special trigger phrase (\"File location:\") and extracts the associated file path and question. So if a user enters text into the \"send a message\" field and has a new picture uploaded into the Lucid_Vision ui, what will happen behind the scenes is the at the user message will be appended with the \"File Location: (file location)\" Trigger phrase, at which point the LLM will see this and understand that it needs to reply back with questions about the image, and that those questions are being sent to a vison model. The cool thing is that let's say later in the conversation you want to know something specific about a previous picture, all you need to do is ask your LLM, YOU DO NOT NEED TO REUPLOAD THE PICTURE, the LLM should be able to interact with the extension on its own after you uploaded your first picture. The selected vision directly interacts with the model's Python API to generate a response. The response is then appended to the chat history, providing the user with detailed insights about the image. The extension is designed to be efficient with system resources by only loading the vision models into memory when they are actively being used to process a question. After generating a response, the models are immediately unloaded to free up memory and GPU VRAM. ## **How to install and setup:** 1.use the latest version of textgen ~~Install this edited prior commit from oobabooga's textgen https://github.com/RandomInternetPreson/textgen_webui_Lucid_Vision_Testing OR use the latest version of textgen.~~ ~~If using the edited older version, make sure to rename the install folder `text-generation-webui`~~ ~~(Note, a couple months ago gradio had a massive update. For me, this has caused a lot of glitches and errors with extensions; I've briefly tested the Lucid_Vision extension in the newest implementation of textgen and it will work. However, I was getting timeout popups when vision models were loading for the first time, gradio wasn't waiting for the response from the model upon first load. After a model is loaded once, it is saved in cpu ram cache (this doesn't actively use your ram, it just uses what is free to keep the models in memory so they are quickly reloaded into gpu ram if necessary) and gradio doesn't seem to timeout as often. The slightly older version of textgen that I've edited does not experience this issue)~~ ~~2. Update the transformers library using the cmd_yourOShere.sh/bat file (so either cmd_linux.sh, cmd_macos.sh, cmd_windows.bat, or cmd_wsl.bat) and entering the following lines. If you run the update wizard after this point, it will overrite this update to transformers. The newest transformers package has the libraries for paligemma, which the code needs to import regardless of whether or not you are intending to use the model.~~ No longer need to update transformers and the latest version of textgen as of this writing 1.9.1 works well with many of the gradio issues resolved, however gradio might timeout on the page when loading a model for the fisrt time still. After a model is loaded once, it is saved in cpu ram cache (this doesn't actively use your ram, it just uses what is free to keep the models in memory so they are quickly reloaded into gpu ram if necessary) and gradio doesn't seem to timeout as often. ## Model Information If you do not want to install DeepseekVL dependencies and are not intending to use deepseek comment out the import lines from the script.py file as displayed here ``` # pip install pexpect import json import re from datetime import datetime # Import datetime for timestamp generation from pathlib import Path import gradio as gr import torch from PIL import Image #from deepseek_vl.models import VLChatProcessor #from deepseek_vl.utils.io import load_pil_images as load_pil_images_for_deepseek from transformers import AutoModelForCausalLM, AutoModel, AutoTokenizer, AutoProcessor, \\ PaliGemmaForConditionalGeneration model_names = [\"phiVision\", \"DeepSeek\", \"paligemma\", \"paligemma_cpu\", \"minicpm_llama3\", \"bunny\"] ``` 3. Install **DeepseekVL** if you intend on using that model Clone the repo: https://github.com/deepseek-ai/DeepSeek-VL into the `repositories` folder of your textgen install Open cmd_yourOShere.sh/bat, navigate to the `repositories/DeepSeek-VL` folder via the terminal using `cd your_directory_here` and enter this into the command window: ``` pip install -e . ``` Download the deepseekvl model here: https://huggingface.co/deepseek-ai/deepseek-vl-7b-chat They have different and smaller models to choose from: https://github.com/deepseek-ai/DeepSeek-VL?tab=readme-ov-file#3-model-downloads 4. If you want to use **Phi-3-vision-128k-instruct**, download it here: https://huggingface.co/microsoft/Phi-3-vision-128k-instruct 5. If you want to use **paligemma-3b**, download it here: https://huggingface.co/google/paligemma-3b-ft-cococap-448 (this is just one out of many fine-tunes google provides) Read this blog on how to inference with the model: https://huggingface.co/blog/paligemma 6. If you want to use **MiniCPM-Llama3-V-2_5**, download it here: https://huggingface.co/openbmb/MiniCPM-Llama3-V-2_5 The **4-bit** verison of the model can be downloaded here: https://huggingface.co/openbmb/MiniCPM-Llama3-V-2_5-int4 **Notes about 4-bit MiniCPM:** * It might not look like the model fully unloads from vram, but it does and the vram will be reclaimed if another program needs it * Your directory folder where the model is stored needs to have the term \"int4\" in it, this is how the extension identifies the 4bit nature of the model 7. If you want to use **Bunny-v1_1-Llama-3-8B-V**, download it here: https://huggingface.co/BAAI/Bunny-v1_1-Llama-3-8B-V **Notes about Bunny-v1_1-Llama-3-8B-V:** * When running for the first time the model needs internet access so it can download models--google--siglip-so400m-patch14-384 to your cache/huggingface directory. This is an additional 3.5GB file the model needs to run. ## Updating the config file 8. Before using the extension you need to update the config file; open it in a text editor *Note No quotes around gpu #: ``` { \"image_history_dir\": \"(fill_In)/extensions/Lucid_Vision/ImageHistory/\", \"cuda_visible_devices\": 0, \"default_vision_model\": \"phiVision\", \"phiVision_model_id\": \"(fill_In)\", \"paligemma_model_id\": \"(fill_In)\", \"paligemma_cpu_model_id\": \"(fill_In)\", \"minicpm_llama3_model_id\": \"(fill_In)\", \"deepseek_vl_model_id\": \"(fill_in)\", \"bunny_model_id\": \"(fill_in)\" } ``` If your install directory is /home/username/Desktop/oobLucidVision/text-generation-webui/ the config file will look like this for example: Make note that you want to change / to \\ if you are on Windows ``` { \"image_history_dir\": \"/home/username/Desktop/oobLucidVision/text-generation-webui/extensions/Lucid_Vision/ImageHistory/\", \"cuda_visible_devices\": 0, \"default_vision_model\": \"phiVision\", \"phiVision_model_id\": \"(fill_In)\", *This is the folder where your phi-3 vision model is stored \"paligemma_model_id\": \"(fill_In)\", *This is the folder where your paligemma vision model is stored \"paligemma_cpu_model_id\": \"(fill_In)\", *This is the folder where your paligemma vision model is stored \"minicpm_llama3_model_id\": \"(fill_In)\", *This is the folder where your minicpm_llama3 vision model is stored, the model can either be the normal fp16 or 4-bit version \"deepseek_vl_model_id\": \"(fill_in)\", *This is the folder where your deepseek vision model is stored \"bunny_model_id\": \"(fill_in)\" *This is the folder where your Bunny-v1_1-Llama-3-8B-V vision model is stored } ``` ## To work with Alltalk Alltalk is great!! An extension I use all the time: https://github.com/erew123/alltalk_tts To get Lucid_Vision to work, the LLM needs to repeat the file directory of an image, and in doing so Alltalk will want to transcribe that text to audo. It is annoying to have the tts model try an transcribe file directories. If you want to get Alltalk to work well wtih Lucid_Vision you need to replace the code here in the script.py file that comes with Alltalk (search for the \"IMAGE CLEANING\" part of the code): ``` ######################## #### IMAGE CLEANING #### ######################## OriginalLucidVisionText = \"\" # This is the existing pattern for matching both images and file location text img_pattern = r'<img[^>]*src\\s*=\\s*[\"\\'][^\"\\'>]+[\"\\'][^>]*>|File location: [^.]+.png' def extract_and_remove_images(text): \"\"\" Extracts all image and file location data from the text and removes it for clean TTS processing. Returns the cleaned text and the extracted image and file location data. \"\"\" global OriginalLucidVisionText OriginalLucidVisionText = text # Update the global variable with the original text img_matches = re.findall(img_pattern, text) img_info = \"\\n\".join(img_matches) # Store extracted image and file location data cleaned_text = re.sub(img_pattern, '', text) # Remove images and file locations from text return cleaned_text, img_info def reinsert_images(cleaned_string, img_info): \"\"\" Reinserts the previously extracted image and file location data back into the text. \"\"\" global OriginalLucidVisionText # Check if there are images or file locations to reinsert if img_info: # Check if the \"Vision Model Responses:\" phrase is present in the original text if re.search(r'Vision Model Responses:', OriginalLucidVisionText): # If present, return the original text as is, without modifying it return OriginalLucidVisionText else: # If not present, append the img_info to the end of the cleaned string cleaned_string += f\"\\n\\n{img_info}\" return cleaned_string ################################# #### TTS STANDARD GENERATION #### ################################# ``` ## **Quirks and Notes:** 1. **When you load a picture once, it is used once. Even if the image stays present in the UI element on screen, it is not actively being used.** 2. If you are using other extensions, load Lucid_Vision first. Put it first in your CMD_FLAGS.txt file or make sure to check it first in the sequence of check boxes in the session tab UI. 3. DeepseekVL can take a while to load initially, that's just the way it is. # **How to use:** ## **BASICS:** **When you load a picture once, it is used once. Even if the image stays present in the UI element on screen, it is not actively being used.** Okay the extension can do many different things with varying levels of difficulty. Starting out with the basics and understanding how to talk with your vision models: Scroll down past where you would normally type something to the LLM, you do not need a large language model loaded to use this function. ![image](https://github.com/RandomInternetPreson/Lucid_Vision/assets/6488699/227ff483-5041-46a7-9b5b-a8f9dd3c673e) Start out by interacting with the vision models without involvement of a seperate LLM model by pressing the `Ask Vision Model` button ![image](https://github.com/RandomInternetPreson/Lucid_Vision/assets/6488699/4530e13f-30a1-43d9-8383-c05e31ddb5d7) **Note paligemma requries a little diffent type prompting sometimes, read the blog on how to inference with it: https://huggingface.co/blog/paligemma** Do this with every model you intend on using, upload a picture, and ask a question ## **ADVANCED: Updated Tips At End on how to get working with Llama-3-Instruct-8B-SPPO-Iter3 https://huggingface.co/UCLA-AGI/Llama-3-Instruct-8B-SPPO-Iter3** Okay, this is why I built the extension in the first place; direct interaction with the vision model was actually an afterthought after I had all the code working. This is how to use your favorite LLM WITH an additional Vision Model, I wanted a way to give my LLMs eyes essentially. I realize that good multimodal models are likely around the corner, but until they match the intellect of very good LLMs, I'd rather have a different vision model work with a good LLM. To use Lucid_Vision as intended requires a little bit of setup: 1. In `Parameters` then `chat` load the \"AI_Image\" character (this card is in the edited older commit, if using your own version of textgen the character card is here: https://github.com/RandomInternetPreson/Lucid_Vision/blob/main/AI_Image.yaml put it in the `characters` folder of the textgen install folder: ![image](https://github.com/RandomInternetPreson/Lucid_Vision/assets/6488699/8b5d770a-b72a-4a74-aa4a-f67fe2a113ba) 2. In `Parameters` then `Generation` under `Custom stopping strings` enter \"Vision Model Responses:\" exactly as shown: ![image](https://github.com/RandomInternetPreson/Lucid_Vision/assets/6488699/7a09650d-4fce-4a8f-badb-b26b4484bf37) 3. Test to see if your LLM understands the instructions for the AI_Image character; ask it this: `Can you tell me what is in this image? Do not add any additional text to the beginning of your response to me, start your reply with the questions.` ![image](https://github.com/RandomInternetPreson/Lucid_Vision/assets/6488699/817605d7-dcab-4e6c-8c39-6d119a189431) I uploaded my image to the Lucid_Vision UI element, then typed my question, then pressed \"Generate\"; it doesn't matter which order you upload your picture or type your question, both are sent to the LLM at the same time. ![image](https://github.com/RandomInternetPreson/Lucid_Vision/assets/6488699/f1525e8d-551d-4719-9e01-3f26b4362d7c) 4. Continue to give the model more images or just talk about something else, in this example I give the model two new images one of a humanoid robot and one of the nutritional facts for a container of sourcream: ![image](https://github.com/RandomInternetPreson/Lucid_Vision/assets/6488699/77131490-fb19-4062-a65f-c0880515e252) ![image](https://github.com/RandomInternetPreson/Lucid_Vision/assets/6488699/fed7021f-edb5-4667-aae0-c05021291618) ![image](https://github.com/RandomInternetPreson/Lucid_Vision/assets/6488699/d1049551-4e48-4ced-85d9-fbb0441330a9) Then I asked the model why the sky is blue: ![image](https://github.com/RandomInternetPreson/Lucid_Vision/assets/6488699/deb7b70a-edf7-4014-bd8c-2492bae00ea6) 5. Now I can ask it a question from a previous image and the LLM will automatically know to prompt the vision model and it will find the previously saved image on its own: ![image](https://github.com/RandomInternetPreson/Lucid_Vision/assets/6488699/e8b0791b-cce8-4183-8971-42e5383d6c46) Please note that the vision model received all this as text: \"Certainly! To retrieve the complete contents of the nutritional facts from the image you provided earlier, I will ask the vision model to perform an Optical Character Recognition (OCR) task on the image. Here is the question for the vision model: Can you perform OCR on the provided image and extract all the text from the nutrition facts label, including the details of each nutrient and its corresponding values?\" If the vision model is not that smart (Deepseek, paligemma) then it will have a difficult time contextualizing the text prior to the questions for the vision model. If you run into this situation, it is best to prompt your LLM like this: `Great, can you get the complete contents of the nutritional facts from earlier, like all the text? Just start with your questions to the vision model please.` ![image](https://github.com/RandomInternetPreson/Lucid_Vision/assets/6488699/38b59679-3550-410d-aac2-29756a058c8f) 6. Please do not get frustrated right away, the Advanced method of usage depends heavily on your LLM's ability to understand the instructions from the character card. The instructions were written by the 8x22B model itself, I explained what I wanted the model to do and had it write its own instructions. This might be a viable alternative if you are struggling to get your model to adhere to the instructions from the character card. You may need to explain things to your llm as your conversation progresses, if it tries to query the vision model when you don't want it to, just explain that to the model and it's unlikely to keep making the same mistake. ## **Tips on how to get working with Llama-3-Instruct-8B-SPPO-Iter3 https://huggingface.co/UCLA-AGI/Llama-3-Instruct-8B-SPPO-Iter3** I have found to get this model (and likely similar models) working properly with lucid vision I needed to NOT use the AI_Image character card. Instead I started the conversation with the model like this: ![image](https://github.com/user-attachments/assets/4d0ff97c-f9dd-4043-b5be-7b915cb7aae7) Then copy and pasted the necessary information from the character card: ``` The following is a conversation with an AI Large Language Model. The AI has been trained to answer questions, provide recommendations, and help with decision making. The AI follows user requests. The AI thinks outside the box. Instructions for Processing Image-Related User Input with Appended File Information: Identify the Trigger Phrase: Begin by scanning the user input for the presence of the \"File location\" trigger phrase. This phrase indicates that the user has selected an image and that the LLM should consider this information in its response. Extract the File Path: Once the trigger phrase is identified, parse the text that follows to extract the absolute file path of the image. The file path will be provided immediately after the trigger phrase and will end with the image file extension (e.g., .png, .jpg). Understand the Context: Recognize that the user's message preceding the file path is the primary context for the interaction. The LLM should address the user's query or statement while also incorporating the availability of the selected image into its response. Formulate Questions for the Vision Model: Based on the user's message and the fact that an image is available for analysis, generate one or more questions that can be answered by the vision model. These questions should be clear, specific, and relevant to the image content. Maintain a Conversational Tone: Ensure that the response is natural, coherent, and maintains the flow of the conversation. The LLM should act as an intermediary between the user and the vision model, facilitating a seamless dialogue. Prepare the Response Structure: Structure the response so that it includes: An acknowledgment of the user's initial message. The questions formulated for the vision model, each clearly separated (e.g., by a newline or bullet point). Any additional information or clarification requests, if necessary. Append the File Path for Processing: At the end of the response, re-append the \"File location\" trigger phrase along with the extracted file path. This ensures that the subsequent processing steps (such as sending the information to the vision model's CLI) can correctly identify and use the file path. Avoid Direct Interaction with the File System: As an LLM, you do not have the capability to directly access or modify files on the server or client systems. Your role is limited to generating text-based responses that include the necessary file path information for other system components to handle. Example Response: Based on your question about the contents of the image, here are the questions I will ask the vision model: - Can you describe the main objects present in the image? - Is there any text visible in the image, and if so, what does it say? - What appears to be happening in the scene depicted in the image? File location: //home/myself/Pictures/rose.png Do review and contextualize the conversation as it develops (reference your context) to infer if the user is asking new questions of previous images. Reference the parts of the convesation that are likely to yeild the file location of the image in question, and formulate your response to include that specific file location of that unique .png file, make sure you are referencing the correct .png file as per the part of the conversation that is likely to be in reference to the updated information request. By following these instructions, the LLM will be able to effectively process user inputs that include image file information and generate appropriate responses that facilitate the interaction between the user and the vision model. ``` Making sure to give a real file location to a real image. After that it started to pretty much work but I still needed to correct the model (I had to add the spelling error intentionally because if the user sends \"File Location:\" is messes up the extension: ![image](https://github.com/user-attachments/assets/5090ef8b-88ff-4ec1-a131-ba57e15e9a7d) The point to take away is that you may need to explain things to your model in various ways for it to contextualize the instructions approprately.",
            "Thank you for providing the `Lucid_Vision` script and the GitHub markdown. I have reviewed the code and the documentation to understand how the extension works. Here's a summary of the `Lucid_Vision` extension:\n\n### Summary of `Lucid_Vision` Extension\n\n**Purpose**:\n- The `Lucid_Vision` extension enhances the capabilities of TextGen by integrating advanced vision models, allowing users to have contextualized conversations about images with their favorite language models and enabling direct communication with vision models.\n\n**Features**:\n1. **Multi-Model Support**: Interact with different vision models, including PhiVision, DeepSeek, MiniCPM-Llama3-V-2_5 (4-bit and normal precision), Bunny-v1_1-Llama-3-8B-V, and PaliGemma, with options for both GPU and CPU inference.\n2. **On-Demand Loading**: Vision models are loaded into memory only when needed to answer a question, optimizing resource usage.\n3. **Seamless Integration**: Easily switch between vision models using a Gradio UI radio button selector.\n4. **Cross-Platform Compatibility**: The extension is designed to work on various operating systems, including Unix-based systems and Windows.\n5. **Direct Communication with Vision Models**: Users do not need to load a large language model (LLM) to interact with the separate vision models.\n\n**How It Works**:\n- The extension intercepts and modifies user input and output within the TextGen framework.\n- When a user uploads an image and asks a question, the extension appends a special trigger phrase (\"File location:\") and extracts the associated file path and question.\n- The selected vision model directly interacts with the model's Python API to generate a response.\n- The response is then appended to the chat history, providing the user with detailed insights about the image.\n- The extension is designed to be efficient with system resources by only loading the vision models into memory when they are actively being used to process a question. After generating a response, the models are immediately unloaded to free up memory and GPU VRAM.\n\n**Next Steps**:\n1. **Update `ImageTestV1.py`**: Integrate the functionality to send each cropped image to the vision model for processing.\n2. **Store Results**: Store the results from the vision model in a structured format (e.g., a JSON file) that the large language model can reference.\n3. **Develop Interaction Logic**: Develop the logic for the large language model to interact with the web page using the stored data.\n4. **Implement Mouse and Keyboard Operation**: Implement the functionality for the large language model to operate a mouse and keyboard.\n\n### Plan for Integration\n\n1. **Review the Vision Model Integration**:\n   - Understand how the `Lucid_Vision` extension loads and interacts with different vision models.\n   - Identify the functions and methods used to process images and generate responses.\n\n2. **Update `ImageTestV1.py`**:\n   - Modify the `crop_images` function to send each cropped image to the vision model for processing.\n   - Integrate the vision model processing code from the `Lucid_Vision` extension into the `ImageTestV1.py` script.\n\n3. **Store Results**:\n   - Create a structured format (e.g., a JSON file) to store the results from the vision model.\n   - Ensure that the large language model can reference this structured data.\n\n4. **Develop Interaction Logic**:\n   - Develop the logic for the large language model to interact with the web page using the stored data.\n   - Implement the functionality for the large language model to operate a mouse and keyboard.\n\nPlease confirm if you have any specific requirements or preferences for how the vision model should be integrated into the `ImageTestV1.py` script. Once confirmed, I will proceed with the integration."
        ],
        [
            " Okay, so I hope this makes sense, but my, what I envision is the slow metamorphosis of imagetestV1 into Lucid_Vision with amazing features, but not exactly. So let me explain. You have summarized everything perfectly. I cannot overstate that. That's great, great job. Yes, okay, so for the next step, let us, what I want you to do is I want you to contextualize the Lucid Vision code. And I want you to use the model miniCPM-lama3-v-2_5. I want you to use the code for that specific model to run all of the cropped images through. And I want you to ask that model the question, I want you to send a specific message to that model. And I want that specific message at the top of the code, so I want to be able to edit it easily like the other parameters. But I want the question to be \"what is the text in this image or what does this image describe?\" And then I want you to take that response, and I want you to include it in a JSON file that effectively has the image name, okay, so the name of the image, the cropped image that it was fed. I'm looking through the outputs here. Yeah, so it will have the image name, and then it will have the coordinates for the box, and then it will have the miniCPM model description all as a JSON output.",
            "Thank you for the detailed explanation. Let's proceed with the integration step-by-step.\n\n### Plan for Integration\n\n1. **Contextualize the Lucid Vision Code**:\n   - Identify the necessary functions and methods from the `Lucid_Vision` extension to process images using the `miniCPM-lama3-v-2_5` model.\n\n2. **Update `ImageTestV1.py`**:\n   - Modify the `crop_images` function to send each cropped image to the `miniCPM-lama3-v-2_5` model for processing.\n   - Integrate the vision model processing code from the `Lucid_Vision` extension into the `ImageTestV1.py` script.\n\n3. **Store Results**:\n   - Create a structured format (e.g., a JSON file) to store the results from the vision model.\n   - Ensure that the large language model can reference this structured data.\n\n### Updated `ImageTestV1.py` Script\n\nHere's the updated `ImageTestV1.py` script with the integration of the `miniCPM-lama3-v-2_5` model:\n\n```python\nimport torch\nimport pyautogui\nimport screeninfo\nfrom transformers import Owlv2Processor, Owlv2ForObjectDetection, AutoModel, AutoTokenizer\nfrom PIL import Image, ImageDraw\nimport json\nimport os\n\n# Configurable variables\nMONITOR_INDEX = 1  # Index of the monitor to capture (0 for the first monitor, 1 for the second monitor, etc.)\nSCREENSHOT_FILENAME = \"screenshot.png\"  # Filename to save the screenshot as\nTEXT_QUERIES = \"clickable icons,hyperlinks\"  # Comma-separated text queries for object detection\nSCORE_THRESHOLD = 0.16  # Score threshold for filtering low-probability predictions\nOUTPUT_DIR = \"ImageOutputTest\"  # Directory to save the cropped images\n\n# Vision model configuration\nVISION_MODEL_NAME = \"minicpm_llama3\"\nVISION_MODEL_ID = \"openbmb/MiniCPM-Llama3-V-2_5\"\nVISION_MODEL_QUESTION = \"What is the text in this image or what does this image describe?\"\n\n# Use GPU if available\nif torch.cuda.is_available():\n    device = torch.device(\"cuda\")\nelse:\n    device = torch.device(\"cpu\")\n\n# Load the Owlv2 model and processor\nmodel = Owlv2ForObjectDetection.from_pretrained(\"google/owlv2-base-patch16-ensemble\").to(device)\nprocessor = Owlv2Processor.from_pretrained(\"google/owlv2-base-patch16-ensemble\")\n\n# Load the miniCPM-lama3-v-2_5 model and tokenizer\nminicpm_llama_model = AutoModel.from_pretrained(VISION_MODEL_ID).to(device)\nminicpm_llama_tokenizer = AutoTokenizer.from_pretrained(VISION_MODEL_ID)\n\ndef take_screenshot(monitor_index=MONITOR_INDEX, output_filename=SCREENSHOT_FILENAME):\n    \"\"\"\n    Takes a screenshot of a specific monitor in a multi-monitor setup.\n\n    Args:\n        monitor_index (int): The index of the monitor to capture (default is MONITOR_INDEX).\n        output_filename (str): The filename to save the screenshot as (default is SCREENSHOT_FILENAME).\n    \"\"\"\n    # Get information about all monitors\n    monitors = screeninfo.get_monitors()\n\n    if monitor_index >= len(monitors):\n        raise ValueError(f\"Monitor index {monitor_index} is out of range. There are only {len(monitors)} monitors.\")\n\n    # Get the region of the specified monitor\n    monitor = monitors[monitor_index]\n    region = (monitor.x, monitor.y, monitor.width, monitor.height)\n\n    # Take the screenshot\n    screenshot = pyautogui.screenshot(region=region)\n\n    # Save the screenshot\n    screenshot.save(output_filename)\n    print(f\"Screenshot saved as {output_filename}\")\n\n    return output_filename\n\ndef query_image(img_path, text_queries=TEXT_QUERIES, score_threshold=SCORE_THRESHOLD):\n    \"\"\"\n    Processes the image using the Owlv2 model to detect objects based on text queries.\n\n    Args:\n        img_path (str): The path to the input image.\n        text_queries (str): Comma-separated text queries for object detection (default is TEXT_QUERIES).\n        score_threshold (float): The score threshold for filtering low-probability predictions (default is SCORE_THRESHOLD).\n    \"\"\"\n    img = Image.open(img_path)\n    text_queries = text_queries.split(\",\")\n\n    size = max(img.size)\n    target_sizes = torch.Tensor([[size, size]])\n    inputs = processor(text=text_queries, images=img, return_tensors=\"pt\").to(device)\n\n    with torch.no_grad():\n        outputs = model(**inputs)\n\n    outputs.logits = outputs.logits.cpu()\n    outputs.pred_boxes = outputs.pred_boxes.cpu()\n    results = processor.post_process_object_detection(outputs=outputs, target_sizes=target_sizes)\n    boxes, scores, labels = results[0][\"boxes\"], results[0][\"scores\"], results[0][\"labels\"]\n\n    result_labels = []\n    img_pil = img.copy()\n    draw = ImageDraw.Draw(img_pil)\n\n    for box, score, label in zip(boxes, scores, labels):\n        box = [int(i) for i in box.tolist()]\n        if score < score_threshold:\n            continue\n        result_labels.append([box, text_queries[label.item()]])\n        draw.rectangle(box, outline=\"red\", width=2)\n        # Do not draw the text label within the bounding box\n        # draw.text((box[0], box[1]), text_queries[label.item()], fill=\"red\")\n\n    # Save the annotated image for debugging\n    annotated_image_path = \"annotated_image.png\"\n    img_pil.save(annotated_image_path)\n\n    # Save the JSON output to a file\n    json_output_path = \"output.json\"\n    with open(json_output_path, 'w') as f:\n        json.dump(result_labels, f)\n\n    return annotated_image_path, json_output_path\n\ndef crop_images(image_path, json_output_path, output_dir=OUTPUT_DIR):\n    \"\"\"\n    Crops out the detected objects from the input image based on the bounding box coordinates.\n\n    Args:\n        image_path (str): The path to the input image.\n        json_output_path (str): The path to the JSON file containing the bounding box coordinates.\n        output_dir (str): The directory to save the cropped images (default is OUTPUT_DIR).\n    \"\"\"\n    with open(json_output_path, 'r') as f:\n        data = json.load(f)\n\n    img = Image.open(image_path)\n\n    results = []\n\n    for i, item in enumerate(data):\n        box = item[0]\n        label = item[1]\n        cropped_img = img.crop(box)\n        cropped_img_path = f\"{output_dir}/{label}_{i}.png\"\n        cropped_img.save(cropped_img_path)\n\n        # Process the cropped image with the miniCPM-lama3-v-2_5 model\n        vision_model_response = process_with_vision_model(cropped_img_path, VISION_MODEL_QUESTION)\n\n        # Store the results in a structured format\n        results.append({\n            \"image_name\": f\"{label}_{i}.png\",\n            \"coordinates\": box,\n            \"description\": vision_model_response\n        })\n\n    # Save the results to a JSON file\n    results_json_path = f\"{output_dir}/results.json\"\n    with open(results_json_path, 'w') as f:\n        json.dump(results, f, indent=4)\n\ndef process_with_vision_model(image_path, question):\n    \"\"\"\n    Processes the image with the miniCPM-lama3-v-2_5 model and returns the response.\n\n    Args:\n        image_path (str): The path to the input image.\n        question (str): The question to ask the vision model.\n\n    Returns:\n        str: The response from the vision model.\n    \"\"\"\n    image = Image.open(image_path).convert(\"RGB\")\n    messages = [\n        {\"role\": \"user\", \"content\": f\"{question}\"}\n    ]\n    # Define the generation arguments\n    generation_args = {\n        \"max_new_tokens\": 896,\n        \"repetition_penalty\": 1.05,\n        \"num_beams\": 3,\n        \"top_p\": 0.8,\n        \"top_k\": 1,\n        \"temperature\": 0.7,\n        \"sampling\": True,\n    }\n    if \"int4\" in VISION_MODEL_ID:\n        # Disable streaming for the 4-bit model\n        generation_args[\"stream\"] = False\n        # Use the model.chat method with streaming enabled\n        vision_model_response = \"\"\n        for new_text in minicpm_llama_model.chat(\n                image=image,\n                msgs=messages,\n                tokenizer=minicpm_llama_tokenizer,\n                **generation_args\n        ):\n            vision_model_response += new_text\n            print(new_text, flush=True, end='')\n    else:\n        minicpm_llama_model.to(device)\n        vision_model_response = minicpm_llama_model.chat(\n            image=image,\n            msgs=messages,\n            tokenizer=minicpm_llama_tokenizer,\n            **generation_args\n        )\n    return vision_model_response\n\nif __name__ == \"__main__\":\n    # Take the screenshot\n    screenshot_path = take_screenshot()\n\n    # Process the image using the Owlv2 model\n    annotated_image_path, json_output_path = query_image(screenshot_path)\n\n    # Create the output directory if it doesn't exist\n    os.makedirs(OUTPUT_DIR, exist_ok=True)\n\n    # Crop the detected objects and save them to the output directory\n    crop_images(annotated_image_path, json_output_path)\n\n    print(\"Processing complete. Check the output directory for the cropped images and results.\")\n```\n\n### Explanation\n\n1. **Configurable Variables**:\n   - Added `VISION_MODEL_NAME`, `VISION_MODEL_ID`, and `VISION_MODEL_QUESTION` for easy configuration.\n\n2. **Model Loading**:\n   - Loaded the `miniCPM-lama3-v-2_5` model and tokenizer.\n\n3. **Function `process_with_vision_model`**:\n   - This function processes the image with the `miniCPM-lama3-v-2_5` model and returns the response.\n\n4. **Function `crop_images`**:\n   - Modified to send each cropped image to the `miniCPM-lama3-v-2_5` model for processing.\n   - Stores the results in a structured format (JSON file).\n\nPlease review the updated script and let me know if you have any further requirements or adjustments."
        ],
        [
            "Hmm, I'm getting out of memory errors:\n\ntorch.OutOfMemoryError: CUDA out of memory. Tried to allocate 224.00 MiB. GPU 0 has a total capacity of 23.65 GiB of which 209.56 MiB is free. Including non-PyTorch memory, this process has 23.43 GiB memory in use. Of the allocated memory 23.00 GiB is allocated by PyTorch, and 49.77 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)\n\n\nIs this because you are trying to send all the images at once?  Or is this because the Owlv2 model is not unloaded?  OR maybe something else?",
            "The out-of-memory error you're encountering is likely due to the high memory usage of the models, especially when both the Owlv2 model and the miniCPM-lama3-v-2_5 model are loaded into memory simultaneously. Additionally, processing multiple images at once can also contribute to high memory usage.\n\nTo address this issue, we can take the following steps:\n\n1. **Unload the Owlv2 Model**: After processing the image and obtaining the bounding boxes, we can unload the Owlv2 model to free up memory.\n2. **Process Images Sequentially**: Instead of processing all cropped images at once, we can process them one by one to reduce memory usage.\n3. **Clear CUDA Cache**: Clear the CUDA cache after processing each image to free up memory.\n\nHere's the updated `ImageTestV1.py` script with these changes:\n\n```python\nimport torch\nimport pyautogui\nimport screeninfo\nfrom transformers import Owlv2Processor, Owlv2ForObjectDetection, AutoModel, AutoTokenizer\nfrom PIL import Image, ImageDraw\nimport json\nimport os\n\n# Configurable variables\nMONITOR_INDEX = 1  # Index of the monitor to capture (0 for the first monitor, 1 for the second monitor, etc.)\nSCREENSHOT_FILENAME = \"screenshot.png\"  # Filename to save the screenshot as\nTEXT_QUERIES = \"clickable icons,hyperlinks\"  # Comma-separated text queries for object detection\nSCORE_THRESHOLD = 0.16  # Score threshold for filtering low-probability predictions\nOUTPUT_DIR = \"ImageOutputTest\"  # Directory to save the cropped images\n\n# Vision model configuration\nVISION_MODEL_NAME = \"minicpm_llama3\"\nVISION_MODEL_ID = \"openbmb/MiniCPM-Llama3-V-2_5\"\nVISION_MODEL_QUESTION = \"What is the text in this image or what does this image describe?\"\n\n# Use GPU if available\nif torch.cuda.is_available():\n    device = torch.device(\"cuda\")\nelse:\n    device = torch.device(\"cpu\")\n\n# Load the Owlv2 model and processor\nmodel = Owlv2ForObjectDetection.from_pretrained(\"google/owlv2-base-patch16-ensemble\").to(device)\nprocessor = Owlv2Processor.from_pretrained(\"google/owlv2-base-patch16-ensemble\")\n\ndef take_screenshot(monitor_index=MONITOR_INDEX, output_filename=SCREENSHOT_FILENAME):\n    \"\"\"\n    Takes a screenshot of a specific monitor in a multi-monitor setup.\n\n    Args:\n        monitor_index (int): The index of the monitor to capture (default is MONITOR_INDEX).\n        output_filename (str): The filename to save the screenshot as (default is SCREENSHOT_FILENAME).\n    \"\"\"\n    # Get information about all monitors\n    monitors = screeninfo.get_monitors()\n\n    if monitor_index >= len(monitors):\n        raise ValueError(f\"Monitor index {monitor_index} is out of range. There are only {len(monitors)} monitors.\")\n\n    # Get the region of the specified monitor\n    monitor = monitors[monitor_index]\n    region = (monitor.x, monitor.y, monitor.width, monitor.height)\n\n    # Take the screenshot\n    screenshot = pyautogui.screenshot(region=region)\n\n    # Save the screenshot\n    screenshot.save(output_filename)\n    print(f\"Screenshot saved as {output_filename}\")\n\n    return output_filename\n\ndef query_image(img_path, text_queries=TEXT_QUERIES, score_threshold=SCORE_THRESHOLD):\n    \"\"\"\n    Processes the image using the Owlv2 model to detect objects based on text queries.\n\n    Args:\n        img_path (str): The path to the input image.\n        text_queries (str): Comma-separated text queries for object detection (default is TEXT_QUERIES).\n        score_threshold (float): The score threshold for filtering low-probability predictions (default is SCORE_THRESHOLD).\n    \"\"\"\n    img = Image.open(img_path)\n    text_queries = text_queries.split(\",\")\n\n    size = max(img.size)\n    target_sizes = torch.Tensor([[size, size]])\n    inputs = processor(text=text_queries, images=img, return_tensors=\"pt\").to(device)\n\n    with torch.no_grad():\n        outputs = model(**inputs)\n\n    outputs.logits = outputs.logits.cpu()\n    outputs.pred_boxes = outputs.pred_boxes.cpu()\n    results = processor.post_process_object_detection(outputs=outputs, target_sizes=target_sizes)\n    boxes, scores, labels = results[0][\"boxes\"], results[0][\"scores\"], results[0][\"labels\"]\n\n    result_labels = []\n    img_pil = img.copy()\n    draw = ImageDraw.Draw(img_pil)\n\n    for box, score, label in zip(boxes, scores, labels):\n        box = [int(i) for i in box.tolist()]\n        if score < score_threshold:\n            continue\n        result_labels.append([box, text_queries[label.item()]])\n        draw.rectangle(box, outline=\"red\", width=2)\n        # Do not draw the text label within the bounding box\n        # draw.text((box[0], box[1]), text_queries[label.item()], fill=\"red\")\n\n    # Save the annotated image for debugging\n    annotated_image_path = \"annotated_image.png\"\n    img_pil.save(annotated_image_path)\n\n    # Save the JSON output to a file\n    json_output_path = \"output.json\"\n    with open(json_output_path, 'w') as f:\n        json.dump(result_labels, f)\n\n    # Unload the Owlv2 model to free up memory\n    del model\n    torch.cuda.empty_cache()\n\n    return annotated_image_path, json_output_path\n\ndef crop_images(image_path, json_output_path, output_dir=OUTPUT_DIR):\n    \"\"\"\n    Crops out the detected objects from the input image based on the bounding box coordinates.\n\n    Args:\n        image_path (str): The path to the input image.\n        json_output_path (str): The path to the JSON file containing the bounding box coordinates.\n        output_dir (str): The directory to save the cropped images (default is OUTPUT_DIR).\n    \"\"\"\n    with open(json_output_path, 'r') as f:\n        data = json.load(f)\n\n    img = Image.open(image_path)\n\n    results = []\n\n    for i, item in enumerate(data):\n        box = item[0]\n        label = item[1]\n        cropped_img = img.crop(box)\n        cropped_img_path = f\"{output_dir}/{label}_{i}.png\"\n        cropped_img.save(cropped_img_path)\n\n        # Process the cropped image with the miniCPM-lama3-v-2_5 model\n        vision_model_response = process_with_vision_model(cropped_img_path, VISION_MODEL_QUESTION)\n\n        # Store the results in a structured format\n        results.append({\n            \"image_name\": f\"{label}_{i}.png\",\n            \"coordinates\": box,\n            \"description\": vision_model_response\n        })\n\n        # Clear CUDA cache after processing each image\n        torch.cuda.empty_cache()\n\n    # Save the results to a JSON file\n    results_json_path = f\"{output_dir}/results.json\"\n    with open(results_json_path, 'w') as f:\n        json.dump(results, f, indent=4)\n\ndef process_with_vision_model(image_path, question):\n    \"\"\"\n    Processes the image with the miniCPM-lama3-v-2_5 model and returns the response.\n\n    Args:\n        image_path (str): The path to the input image.\n        question (str): The question to ask the vision model.\n\n    Returns:\n        str: The response from the vision model.\n    \"\"\"\n    # Load the miniCPM-lama3-v-2_5 model and tokenizer\n    minicpm_llama_model = AutoModel.from_pretrained(VISION_MODEL_ID).to(device)\n    minicpm_llama_tokenizer = AutoTokenizer.from_pretrained(VISION_MODEL_ID)\n\n    image = Image.open(image_path).convert(\"RGB\")\n    messages = [\n        {\"role\": \"user\", \"content\": f\"{question}\"}\n    ]\n    # Define the generation arguments\n    generation_args = {\n        \"max_new_tokens\": 896,\n        \"repetition_penalty\": 1.05,\n        \"num_beams\": 3,\n        \"top_p\": 0.8,\n        \"top_k\": 1,\n        \"temperature\": 0.7,\n        \"sampling\": True,\n    }\n    if \"int4\" in VISION_MODEL_ID:\n        # Disable streaming for the 4-bit model\n        generation_args[\"stream\"] = False\n        # Use the model.chat method with streaming enabled\n        vision_model_response = \"\"\n        for new_text in minicpm_llama_model.chat(\n                image=image,\n                msgs=messages,\n                tokenizer=minicpm_llama_tokenizer,\n                **generation_args\n        ):\n            vision_model_response += new_text\n            print(new_text, flush=True, end='')\n    else:\n        minicpm_llama_model.to(device)\n        vision_model_response = minicpm_llama_model.chat(\n            image=image,\n            msgs=messages,\n            tokenizer=minicpm_llama_tokenizer,\n            **generation_args\n        )\n\n    # Unload the miniCPM-lama3-v-2_5 model to free up memory\n    del minicpm_llama_model\n    del minicpm_llama_tokenizer\n    torch.cuda.empty_cache()\n\n    return vision_model_response\n\nif __name__ == \"__main__\":\n    # Take the screenshot\n    screenshot_path = take_screenshot()\n\n    # Process the image using the Owlv2 model\n    annotated_image_path, json_output_path = query_image(screenshot_path)\n\n    # Create the output directory if it doesn't exist\n    os.makedirs(OUTPUT_DIR, exist_ok=True)\n\n    # Crop the detected objects and save them to the output directory\n    crop_images(annotated_image_path, json_output_path)\n\n    print(\"Processing complete. Check the output directory for the cropped images and results.\")\n```\n\n### Explanation\n\n1. **Unload the Owlv2 Model**:\n   - After processing the image and obtaining the bounding boxes, the Owlv2 model is unloaded to free up memory.\n\n2. **Process Images Sequentially**:\n   - The `crop_images` function processes each cropped image one by one to reduce memory usage.\n\n3. **Clear CUDA Cache**:\n   - The CUDA cache is cleared after processing each image to free up memory.\n\nPlease review the updated script and let me know if you encounter any further issues or have additional requirements."
        ],
        [
            "I'm getting this error when I run the code:\n\n(Owlv2demo) myself@myself-System-Product-Name:~/Desktop/ComputerVision/Owlv2Models/owlv2$ python ImageTestV1.py\nScreenshot saved as screenshot.png\nTraceback (most recent call last):\n  File \"/home/myself/Desktop/ComputerVision/Owlv2Models/owlv2/ImageTestV1.py\", line 215, in <module>\n    annotated_image_path, json_output_path = query_image(screenshot_path)\n                                             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n  File \"/home/myself/Desktop/ComputerVision/Owlv2Models/owlv2/ImageTestV1.py\", line 76, in query_image\n    outputs = model(**inputs)\n              ^^^^^\nUnboundLocalError: cannot access local variable 'model' where it is not associated with a value\n(Owlv2demo) myself@myself-System-Product-Name:~/Desktop/ComputerVision/Owlv2Models/owlv2$ \n\n\nAdditionally, I changed \n\nVISION_MODEL_ID = \"openbmb/MiniCPM-Llama3-V-2_5\"\n\nto \n\nVISION_MODEL_ID = \"/home/myself/Desktop/miniCPM_llava3V/MiniCPM-Llama3-V-2_5\"\n\n",
            "The error you're encountering is due to the `model` variable being unloaded before it is used in the `query_image` function. To fix this, we need to ensure that the `model` variable is only unloaded after it has been used to process the image.\n\nAdditionally, I'll update the script to handle the local path for the `VISION_MODEL_ID`.\n\nHere's the corrected `ImageTestV1.py` script:\n\n```python\nimport torch\nimport pyautogui\nimport screeninfo\nfrom transformers import Owlv2Processor, Owlv2ForObjectDetection, AutoModel, AutoTokenizer\nfrom PIL import Image, ImageDraw\nimport json\nimport os\n\n# Configurable variables\nMONITOR_INDEX = 1  # Index of the monitor to capture (0 for the first monitor, 1 for the second monitor, etc.)\nSCREENSHOT_FILENAME = \"screenshot.png\"  # Filename to save the screenshot as\nTEXT_QUERIES = \"clickable icons,hyperlinks\"  # Comma-separated text queries for object detection\nSCORE_THRESHOLD = 0.16  # Score threshold for filtering low-probability predictions\nOUTPUT_DIR = \"ImageOutputTest\"  # Directory to save the cropped images\n\n# Vision model configuration\nVISION_MODEL_NAME = \"minicpm_llama3\"\nVISION_MODEL_ID = \"/home/myself/Desktop/miniCPM_llava3V/MiniCPM-Llama3-V-2_5\"\nVISION_MODEL_QUESTION = \"What is the text in this image or what does this image describe?\"\n\n# Use GPU if available\nif torch.cuda.is_available():\n    device = torch.device(\"cuda\")\nelse:\n    device = torch.device(\"cpu\")\n\n# Load the Owlv2 model and processor\nmodel = Owlv2ForObjectDetection.from_pretrained(\"google/owlv2-base-patch16-ensemble\").to(device)\nprocessor = Owlv2Processor.from_pretrained(\"google/owlv2-base-patch16-ensemble\")\n\ndef take_screenshot(monitor_index=MONITOR_INDEX, output_filename=SCREENSHOT_FILENAME):\n    \"\"\"\n    Takes a screenshot of a specific monitor in a multi-monitor setup.\n\n    Args:\n        monitor_index (int): The index of the monitor to capture (default is MONITOR_INDEX).\n        output_filename (str): The filename to save the screenshot as (default is SCREENSHOT_FILENAME).\n    \"\"\"\n    # Get information about all monitors\n    monitors = screeninfo.get_monitors()\n\n    if monitor_index >= len(monitors):\n        raise ValueError(f\"Monitor index {monitor_index} is out of range. There are only {len(monitors)} monitors.\")\n\n    # Get the region of the specified monitor\n    monitor = monitors[monitor_index]\n    region = (monitor.x, monitor.y, monitor.width, monitor.height)\n\n    # Take the screenshot\n    screenshot = pyautogui.screenshot(region=region)\n\n    # Save the screenshot\n    screenshot.save(output_filename)\n    print(f\"Screenshot saved as {output_filename}\")\n\n    return output_filename\n\ndef query_image(img_path, text_queries=TEXT_QUERIES, score_threshold=SCORE_THRESHOLD):\n    \"\"\"\n    Processes the image using the Owlv2 model to detect objects based on text queries.\n\n    Args:\n        img_path (str): The path to the input image.\n        text_queries (str): Comma-separated text queries for object detection (default is TEXT_QUERIES).\n        score_threshold (float): The score threshold for filtering low-probability predictions (default is SCORE_THRESHOLD).\n    \"\"\"\n    img = Image.open(img_path)\n    text_queries = text_queries.split(\",\")\n\n    size = max(img.size)\n    target_sizes = torch.Tensor([[size, size]])\n    inputs = processor(text=text_queries, images=img, return_tensors=\"pt\").to(device)\n\n    with torch.no_grad():\n        outputs = model(**inputs)\n\n    outputs.logits = outputs.logits.cpu()\n    outputs.pred_boxes = outputs.pred_boxes.cpu()\n    results = processor.post_process_object_detection(outputs=outputs, target_sizes=target_sizes)\n    boxes, scores, labels = results[0][\"boxes\"], results[0][\"scores\"], results[0][\"labels\"]\n\n    result_labels = []\n    img_pil = img.copy()\n    draw = ImageDraw.Draw(img_pil)\n\n    for box, score, label in zip(boxes, scores, labels):\n        box = [int(i) for i in box.tolist()]\n        if score < score_threshold:\n            continue\n        result_labels.append([box, text_queries[label.item()]])\n        draw.rectangle(box, outline=\"red\", width=2)\n        # Do not draw the text label within the bounding box\n        # draw.text((box[0], box[1]), text_queries[label.item()], fill=\"red\")\n\n    # Save the annotated image for debugging\n    annotated_image_path = \"annotated_image.png\"\n    img_pil.save(annotated_image_path)\n\n    # Save the JSON output to a file\n    json_output_path = \"output.json\"\n    with open(json_output_path, 'w') as f:\n        json.dump(result_labels, f)\n\n    return annotated_image_path, json_output_path\n\ndef crop_images(image_path, json_output_path, output_dir=OUTPUT_DIR):\n    \"\"\"\n    Crops out the detected objects from the input image based on the bounding box coordinates.\n\n    Args:\n        image_path (str): The path to the input image.\n        json_output_path (str): The path to the JSON file containing the bounding box coordinates.\n        output_dir (str): The directory to save the cropped images (default is OUTPUT_DIR).\n    \"\"\"\n    with open(json_output_path, 'r') as f:\n        data = json.load(f)\n\n    img = Image.open(image_path)\n\n    results = []\n\n    for i, item in enumerate(data):\n        box = item[0]\n        label = item[1]\n        cropped_img = img.crop(box)\n        cropped_img_path = f\"{output_dir}/{label}_{i}.png\"\n        cropped_img.save(cropped_img_path)\n\n        # Process the cropped image with the miniCPM-lama3-v-2_5 model\n        vision_model_response = process_with_vision_model(cropped_img_path, VISION_MODEL_QUESTION)\n\n        # Store the results in a structured format\n        results.append({\n            \"image_name\": f\"{label}_{i}.png\",\n            \"coordinates\": box,\n            \"description\": vision_model_response\n        })\n\n        # Clear CUDA cache after processing each image\n        torch.cuda.empty_cache()\n\n    # Save the results to a JSON file\n    results_json_path = f\"{output_dir}/results.json\"\n    with open(results_json_path, 'w') as f:\n        json.dump(results, f, indent=4)\n\ndef process_with_vision_model(image_path, question):\n    \"\"\"\n    Processes the image with the miniCPM-lama3-v-2_5 model and returns the response.\n\n    Args:\n        image_path (str): The path to the input image.\n        question (str): The question to ask the vision model.\n\n    Returns:\n        str: The response from the vision model.\n    \"\"\"\n    # Load the miniCPM-lama3-v-2_5 model and tokenizer\n    minicpm_llama_model = AutoModel.from_pretrained(VISION_MODEL_ID).to(device)\n    minicpm_llama_tokenizer = AutoTokenizer.from_pretrained(VISION_MODEL_ID)\n\n    image = Image.open(image_path).convert(\"RGB\")\n    messages = [\n        {\"role\": \"user\", \"content\": f\"{question}\"}\n    ]\n    # Define the generation arguments\n    generation_args = {\n        \"max_new_tokens\": 896,\n        \"repetition_penalty\": 1.05,\n        \"num_beams\": 3,\n        \"top_p\": 0.8,\n        \"top_k\": 1,\n        \"temperature\": 0.7,\n        \"sampling\": True,\n    }\n    if \"int4\" in VISION_MODEL_ID:\n        # Disable streaming for the 4-bit model\n        generation_args[\"stream\"] = False\n        # Use the model.chat method with streaming enabled\n        vision_model_response = \"\"\n        for new_text in minicpm_llama_model.chat(\n                image=image,\n                msgs=messages,\n                tokenizer=minicpm_llama_tokenizer,\n                **generation_args\n        ):\n            vision_model_response += new_text\n            print(new_text, flush=True, end='')\n    else:\n        minicpm_llama_model.to(device)\n        vision_model_response = minicpm_llama_model.chat(\n            image=image,\n            msgs=messages,\n            tokenizer=minicpm_llama_tokenizer,\n            **generation_args\n        )\n\n    # Unload the miniCPM-lama3-v-2_5 model to free up memory\n    del minicpm_llama_model\n    del minicpm_llama_tokenizer\n    torch.cuda.empty_cache()\n\n    return vision_model_response\n\nif __name__ == \"__main__\":\n    # Take the screenshot\n    screenshot_path = take_screenshot()\n\n    # Process the image using the Owlv2 model\n    annotated_image_path, json_output_path = query_image(screenshot_path)\n\n    # Unload the Owlv2 model to free up memory\n    del model\n    torch.cuda.empty_cache()\n\n    # Create the output directory if it doesn't exist\n    os.makedirs(OUTPUT_DIR, exist_ok=True)\n\n    # Crop the detected objects and save them to the output directory\n    crop_images(annotated_image_path, json_output_path)\n\n    print(\"Processing complete. Check the output directory for the cropped images and results.\")\n```\n\n### Explanation\n\n1. **Unload the Owlv2 Model**:\n   - The Owlv2 model is unloaded after processing the image and obtaining the bounding boxes.\n\n2. **Process Images Sequentially**:\n   - The `crop_images` function processes each cropped image one by one to reduce memory usage.\n\n3. **Clear CUDA Cache**:\n   - The CUDA cache is cleared after processing each image to free up memory.\n\n4. **Handle Local Path for Vision Model**:\n   - The script now handles the local path for the `VISION_MODEL_ID`.\n\nPlease review the updated script and let me know if you encounter any further issues or have additional requirements."
        ],
        [
            "Hmm, I'm still getting oom errors:\n\n(Owlv2demo) myself@myself-System-Product-Name:~/Desktop/ComputerVision/Owlv2Models/owlv2$ python ImageTestV1.py\nScreenshot saved as screenshot.png\nThe repository for /home/myself/Desktop/miniCPM_llava3V/MiniCPM-Llama3-V-2_5 contains custom code which must be executed to correctly load the model. You can inspect the repository content at https://hf.co//home/myself/Desktop/miniCPM_llava3V/MiniCPM-Llama3-V-2_5.\nYou can avoid this prompt in future by passing the argument `trust_remote_code=True`.\n\nDo you wish to run the custom code? [y/N] y\nThe repository for /home/myself/Desktop/miniCPM_llava3V/MiniCPM-Llama3-V-2_5 contains custom code which must be executed to correctly load the model. You can inspect the repository content at https://hf.co//home/myself/Desktop/miniCPM_llava3V/MiniCPM-Llama3-V-2_5.\nYou can avoid this prompt in future by passing the argument `trust_remote_code=True`.\n\nDo you wish to run the custom code? [y/N] y\nLoading checkpoint shards: 100%|█████████████████████████████████████████████████████████| 7/7 [00:06<00:00,  1.01it/s]\nTraceback (most recent call last):\n  File \"/home/myself/Desktop/ComputerVision/Owlv2Models/owlv2/ImageTestV1.py\", line 220, in <module>\n    crop_images(annotated_image_path, json_output_path)\n  File \"/home/myself/Desktop/ComputerVision/Owlv2Models/owlv2/ImageTestV1.py\", line 130, in crop_images\n    vision_model_response = process_with_vision_model(cropped_img_path, VISION_MODEL_QUESTION)\n                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n  File \"/home/myself/Desktop/ComputerVision/Owlv2Models/owlv2/ImageTestV1.py\", line 159, in process_with_vision_model\n    minicpm_llama_model = AutoModel.from_pretrained(VISION_MODEL_ID).to(device)\n                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n  File \"/home/myself/miniconda3/envs/Owlv2demo/lib/python3.11/site-packages/transformers/modeling_utils.py\", line 2884, in to\n    return super().to(*args, **kwargs)\n           ^^^^^^^^^^^^^^^^^^^^^^^^^^^\n  File \"/home/myself/miniconda3/envs/Owlv2demo/lib/python3.11/site-packages/torch/nn/modules/module.py\", line 1174, in to\n    return self._apply(convert)\n           ^^^^^^^^^^^^^^^^^^^^\n  File \"/home/myself/miniconda3/envs/Owlv2demo/lib/python3.11/site-packages/torch/nn/modules/module.py\", line 780, in _apply\n    module._apply(fn)\n  File \"/home/myself/miniconda3/envs/Owlv2demo/lib/python3.11/site-packages/torch/nn/modules/module.py\", line 780, in _apply\n    module._apply(fn)\n  File \"/home/myself/miniconda3/envs/Owlv2demo/lib/python3.11/site-packages/torch/nn/modules/module.py\", line 780, in _apply\n    module._apply(fn)\n  [Previous line repeated 3 more times]\n  File \"/home/myself/miniconda3/envs/Owlv2demo/lib/python3.11/site-packages/torch/nn/modules/module.py\", line 805, in _apply\n    param_applied = fn(param)\n                    ^^^^^^^^^\n  File \"/home/myself/miniconda3/envs/Owlv2demo/lib/python3.11/site-packages/torch/nn/modules/module.py\", line 1160, in convert\n    return t.to(\n           ^^^^^\ntorch.OutOfMemoryError: CUDA out of memory. Tried to allocate 224.00 MiB. GPU 0 has a total capacity of 23.65 GiB of which 50.00 MiB is free. Process 55561 has 452.00 MiB memory in use. Including non-PyTorch memory, this process has 23.14 GiB memory in use. Of the allocated memory 22.64 GiB is allocated by PyTorch, and 53.68 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)\n(Owlv2demo) myself@myself-System-Product-Name:~/Desktop/ComputerVision/Owlv2Models/owlv2$ \n\n\nWhen I run the Lucid_Vision extension by itself in textgen I can run that model just fine without oom errors.  That makes me wonder if you are sending the images one by one and unloading the owl model? \n\nLet's look at the output of the error compared to the previous error: okay we can see that the screenshot is being saved.   Good, I can verify the file is being saved.  Now if we look at the code the screenshot happens and then the owl model should be loaded and fed the screenshot image with it outputting the .json and annotated_image.png.  That model is super small like 1 gig small, so there should not be any oom errors on my 24GB card that is empty and only running the code you are writing.  \n\nSo are you sure the code is following the correct sequence, screenshot, owl (annotated_image.png and output.json), minicpm model being fed one image at a time in sequence.",
            "Thank you for the detailed explanation. It seems that the sequence of operations might not be correctly handling the memory management, leading to out-of-memory errors. Let's ensure that the Owlv2 model is unloaded after processing the image and that the miniCPM-lama3-v-2_5 model is loaded and unloaded correctly for each cropped image.\n\nHere's the updated `ImageTestV1.py` script with improved memory management:\n\n```python\nimport torch\nimport pyautogui\nimport screeninfo\nfrom transformers import Owlv2Processor, Owlv2ForObjectDetection, AutoModel, AutoTokenizer\nfrom PIL import Image, ImageDraw\nimport json\nimport os\n\n# Configurable variables\nMONITOR_INDEX = 1  # Index of the monitor to capture (0 for the first monitor, 1 for the second monitor, etc.)\nSCREENSHOT_FILENAME = \"screenshot.png\"  # Filename to save the screenshot as\nTEXT_QUERIES = \"clickable icons,hyperlinks\"  # Comma-separated text queries for object detection\nSCORE_THRESHOLD = 0.16  # Score threshold for filtering low-probability predictions\nOUTPUT_DIR = \"ImageOutputTest\"  # Directory to save the cropped images\n\n# Vision model configuration\nVISION_MODEL_NAME = \"minicpm_llama3\"\nVISION_MODEL_ID = \"/home/myself/Desktop/miniCPM_llava3V/MiniCPM-Llama3-V-2_5\"\nVISION_MODEL_QUESTION = \"What is the text in this image or what does this image describe?\"\n\n# Use GPU if available\nif torch.cuda.is_available():\n    device = torch.device(\"cuda\")\nelse:\n    device = torch.device(\"cpu\")\n\ndef take_screenshot(monitor_index=MONITOR_INDEX, output_filename=SCREENSHOT_FILENAME):\n    \"\"\"\n    Takes a screenshot of a specific monitor in a multi-monitor setup.\n\n    Args:\n        monitor_index (int): The index of the monitor to capture (default is MONITOR_INDEX).\n        output_filename (str): The filename to save the screenshot as (default is SCREENSHOT_FILENAME).\n    \"\"\"\n    # Get information about all monitors\n    monitors = screeninfo.get_monitors()\n\n    if monitor_index >= len(monitors):\n        raise ValueError(f\"Monitor index {monitor_index} is out of range. There are only {len(monitors)} monitors.\")\n\n    # Get the region of the specified monitor\n    monitor = monitors[monitor_index]\n    region = (monitor.x, monitor.y, monitor.width, monitor.height)\n\n    # Take the screenshot\n    screenshot = pyautogui.screenshot(region=region)\n\n    # Save the screenshot\n    screenshot.save(output_filename)\n    print(f\"Screenshot saved as {output_filename}\")\n\n    return output_filename\n\ndef query_image(img_path, text_queries=TEXT_QUERIES, score_threshold=SCORE_THRESHOLD):\n    \"\"\"\n    Processes the image using the Owlv2 model to detect objects based on text queries.\n\n    Args:\n        img_path (str): The path to the input image.\n        text_queries (str): Comma-separated text queries for object detection (default is TEXT_QUERIES).\n        score_threshold (float): The score threshold for filtering low-probability predictions (default is SCORE_THRESHOLD).\n    \"\"\"\n    # Load the Owlv2 model and processor\n    model = Owlv2ForObjectDetection.from_pretrained(\"google/owlv2-base-patch16-ensemble\").to(device)\n    processor = Owlv2Processor.from_pretrained(\"google/owlv2-base-patch16-ensemble\")\n\n    img = Image.open(img_path)\n    text_queries = text_queries.split(\",\")\n\n    size = max(img.size)\n    target_sizes = torch.Tensor([[size, size]])\n    inputs = processor(text=text_queries, images=img, return_tensors=\"pt\").to(device)\n\n    with torch.no_grad():\n        outputs = model(**inputs)\n\n    outputs.logits = outputs.logits.cpu()\n    outputs.pred_boxes = outputs.pred_boxes.cpu()\n    results = processor.post_process_object_detection(outputs=outputs, target_sizes=target_sizes)\n    boxes, scores, labels = results[0][\"boxes\"], results[0][\"scores\"], results[0][\"labels\"]\n\n    result_labels = []\n    img_pil = img.copy()\n    draw = ImageDraw.Draw(img_pil)\n\n    for box, score, label in zip(boxes, scores, labels):\n        box = [int(i) for i in box.tolist()]\n        if score < score_threshold:\n            continue\n        result_labels.append([box, text_queries[label.item()]])\n        draw.rectangle(box, outline=\"red\", width=2)\n        # Do not draw the text label within the bounding box\n        # draw.text((box[0], box[1]), text_queries[label.item()], fill=\"red\")\n\n    # Save the annotated image for debugging\n    annotated_image_path = \"annotated_image.png\"\n    img_pil.save(annotated_image_path)\n\n    # Save the JSON output to a file\n    json_output_path = \"output.json\"\n    with open(json_output_path, 'w') as f:\n        json.dump(result_labels, f)\n\n    # Unload the Owlv2 model to free up memory\n    del model\n    torch.cuda.empty_cache()\n\n    return annotated_image_path, json_output_path\n\ndef crop_images(image_path, json_output_path, output_dir=OUTPUT_DIR):\n    \"\"\"\n    Crops out the detected objects from the input image based on the bounding box coordinates.\n\n    Args:\n        image_path (str): The path to the input image.\n        json_output_path (str): The path to the JSON file containing the bounding box coordinates.\n        output_dir (str): The directory to save the cropped images (default is OUTPUT_DIR).\n    \"\"\"\n    with open(json_output_path, 'r') as f:\n        data = json.load(f)\n\n    img = Image.open(image_path)\n\n    results = []\n\n    for i, item in enumerate(data):\n        box = item[0]\n        label = item[1]\n        cropped_img = img.crop(box)\n        cropped_img_path = f\"{output_dir}/{label}_{i}.png\"\n        cropped_img.save(cropped_img_path)\n\n        # Process the cropped image with the miniCPM-lama3-v-2_5 model\n        vision_model_response = process_with_vision_model(cropped_img_path, VISION_MODEL_QUESTION)\n\n        # Store the results in a structured format\n        results.append({\n            \"image_name\": f\"{label}_{i}.png\",\n            \"coordinates\": box,\n            \"description\": vision_model_response\n        })\n\n        # Clear CUDA cache after processing each image\n        torch.cuda.empty_cache()\n\n    # Save the results to a JSON file\n    results_json_path = f\"{output_dir}/results.json\"\n    with open(results_json_path, 'w') as f:\n        json.dump(results, f, indent=4)\n\ndef process_with_vision_model(image_path, question):\n    \"\"\"\n    Processes the image with the miniCPM-lama3-v-2_5 model and returns the response.\n\n    Args:\n        image_path (str): The path to the input image.\n        question (str): The question to ask the vision model.\n\n    Returns:\n        str: The response from the vision model.\n    \"\"\"\n    # Load the miniCPM-lama3-v-2_5 model and tokenizer\n    minicpm_llama_model = AutoModel.from_pretrained(VISION_MODEL_ID, trust_remote_code=True).to(device)\n    minicpm_llama_tokenizer = AutoTokenizer.from_pretrained(VISION_MODEL_ID, trust_remote_code=True)\n\n    image = Image.open(image_path).convert(\"RGB\")\n    messages = [\n        {\"role\": \"user\", \"content\": f\"{question}\"}\n    ]\n    # Define the generation arguments\n    generation_args = {\n        \"max_new_tokens\": 896,\n        \"repetition_penalty\": 1.05,\n        \"num_beams\": 3,\n        \"top_p\": 0.8,\n        \"top_k\": 1,\n        \"temperature\": 0.7,\n        \"sampling\": True,\n    }\n    if \"int4\" in VISION_MODEL_ID:\n        # Disable streaming for the 4-bit model\n        generation_args[\"stream\"] = False\n        # Use the model.chat method with streaming enabled\n        vision_model_response = \"\"\n        for new_text in minicpm_llama_model.chat(\n                image=image,\n                msgs=messages,\n                tokenizer=minicpm_llama_tokenizer,\n                **generation_args\n        ):\n            vision_model_response += new_text\n            print(new_text, flush=True, end='')\n    else:\n        minicpm_llama_model.to(device)\n        vision_model_response = minicpm_llama_model.chat(\n            image=image,\n            msgs=messages,\n            tokenizer=minicpm_llama_tokenizer,\n            **generation_args\n        )\n\n    # Unload the miniCPM-lama3-v-2_5 model to free up memory\n    del minicpm_llama_model\n    del minicpm_llama_tokenizer\n    torch.cuda.empty_cache()\n\n    return vision_model_response\n\nif __name__ == \"__main__\":\n    # Take the screenshot\n    screenshot_path = take_screenshot()\n\n    # Process the image using the Owlv2 model\n    annotated_image_path, json_output_path = query_image(screenshot_path)\n\n    # Create the output directory if it doesn't exist\n    os.makedirs(OUTPUT_DIR, exist_ok=True)\n\n    # Crop the detected objects and save them to the output directory\n    crop_images(annotated_image_path, json_output_path)\n\n    print(\"Processing complete. Check the output directory for the cropped images and results.\")\n```\n\n### Explanation\n\n1. **Unload the Owlv2 Model**:\n   - The Owlv2 model is unloaded after processing the image and obtaining the bounding boxes.\n\n2. **Process Images Sequentially**:\n   - The `crop_images` function processes each cropped image one by one to reduce memory usage.\n\n3. **Clear CUDA Cache**:\n   - The CUDA cache is cleared after processing each image to free up memory.\n\n4. **Handle Local Path for Vision Model**:\n   - The script now handles the local path for the `VISION_MODEL_ID` and includes `trust_remote_code=True` to avoid the prompt for running custom code.\n\nPlease review the updated script and let me know if you encounter any further issues or have additional requirements."
        ]
    ],
    "visible": [
        [
            "",
            "How can I help you today?"
        ],
        [
            " Hello, how are you functioning today?",
            "I&#x27;m functioning well, thank you for asking! I&#x27;m here and ready to assist you. How about you? How can I help you today?"
        ],
        [
            "Okay, so we&#x27;ve got quite an adventure ahead of us. Let me give you some context. There is a GitHub project called TextGenerationWebUI. We&#x27;ll call it TextGen from now on. What TextGen allows us to do, it allows me to run large language models at home on my own hardware. So you&#x27;re actually running on my computer right now locally. No internet connection. Anyways, TextGen allows for really interesting extension integration. And I&#x27;ve made two extensions myself, and I&#x27;ll show you the code for all that later. But the gist of it is one extension lets the AI surf the internet in a somewhat supervised manner. And another extension gives the AI or the large language model the ability to communicate with a vision model. So the large language model formulates questions on its own, asks those questions of the vision model and gets a response. And then contextualizes that response into the conversation. The large language model can even recall past images on their own when they decide it&#x27;s necessary within the context of the conversation. That latter, okay, well, so the two extensions are called LucidWebSearch and LucidVision. And LucidVision was actually entirely written by Mixtral8x22, which was like a 8x22 model of experts model. It&#x27;s like your predecessor. You&#x27;re supposed to be better than it. So I&#x27;ve upgraded to Mistral Large V2, which is you. And I want to take on a similar endeavor. I want you to write an extension for TextGen for me. But we&#x27;re going to take this step by step. First of all, let me... First of all, just before I go into any more explanation, I just want to make sure you&#x27;re understanding what I&#x27;m saying.",
            "Yes, I&#x27;m following along just fine. To summarize, you have a local GitHub project called TextGen that allows you to run large language models on your own hardware. You&#x27;ve created two extensions for it: LucidWebSearch, which enables supervised internet browsing, and LucidVision, which allows communication with a vision model. LucidVision was written by a previous model, and now you&#x27;ve upgraded to me, Mistral Large V2. You&#x27;d like me to help you write a new extension for TextGen. Is that correct? I&#x27;m ready to assist you with that."
        ],
        [
            " Okay, great. So let me, I&#x27;m going to explain to you the general idea of this extension. And we&#x27;ll go through it step by step. So there is a model, a vision model called Owlv2. Okay, and what I can do with this vision model is I can give it an image and I can tell it to put a box around whatever I&#x27;m interested in in the image. So I can give it a screenshot of a web browser and I can say highlight all of the text, I mean, sorry, highlight all of the hyperlinks and highlight all of the clickable icons. And it can do that. And then it can give me the X, Y coordinates for all of the boxes that it makes. So what I want to do is I want to be able to have a large language model operate a mouse and keyboard using this Owl as a foundational, as the foundation of everything. So the pipeline or this whatever the flow I have in my mind is this. We take Owl the Two and then we get all the coordinates on the image. And then we get some Python code to crop out all of those little boxes. And then we send those to an optical character recognition model. But it&#x27;s better than that because it also can recognize objects too. So we can send it to that model. We can send a list of images to that model. I forget what that model is called, but we&#x27;ll get there when we get there. And then this kind of vision OCR model will tell us the contents of each one of the boxes. And then we can have a database that the LLM can that the large language model can effectively keep within context or maybe it can reference. You know what with this database, it would make sense if it existed in like a text document and then the LL, the large language model could reference a text document when it needed to. I have ideas on how to do that. I effectively did that with the Lucid web search, how to manage the memory within context and everything. OK, and then so the large language model will effectively have you can imagine like three columns of data. One column is the context. What is the box? And the way I will to works as I say, give me a clickable hyperlinks and give me clickable icons. And what it will do is it will put a box around everything and use those terms, clickable hyperlinks and clickable icons as the descriptor for the boxes. So that&#x27;s what column one be our base test will be navigating a Web page on its own. So column one will be, you know, what is it, a hyperlink or an icon, a blank text field. Column two will be the box that borders those different user interface elements. So it&#x27;ll be set of coordinates. And then column three will be what the the optical character recognition slash image vision model sees within those crop images of the boxes. And then the larger language model can look at that and say, OK, I want to know more about this world event that&#x27;s happening right now. And it can look at all those things and say, OK, there&#x27;s a hyperlink for the thing that I&#x27;m interested in. And then it can click on it and then we can use the Lucid Web search extension to contextualize the information on that Web page. And then the model can, you know, take screenshots. It&#x27;d be nice to have it like squirrel to and then get X, Y coordinates and click on things. Anyways, I hope this makes sense. Generally speaking, I know it&#x27;s kind of maybe seems difficult at first, but I have all the pieces in this new model, 10 months old. I&#x27;m surprised I haven&#x27;t found it yet. This model really kind of puts lets all the pieces kind of merge together.  Also, I&#x27;m using a voice to text model, so some of the spelling may be off in what I have dicated.",
            "Thank you for the detailed explanation. To ensure we&#x27;re on the same page, let me summarize your vision for the new extension:\n\n1. **Vision Model (Owlv2)**: This model will take an image (e.g., a screenshot of a web browser) and identify specific elements like hyperlinks, icons, and text fields, providing their X, Y coordinates.\n\n2. **Image Cropping**: Python code will crop out the identified elements based on the coordinates provided by Owlv2.\n\n3. **Optical Character Recognition (OCR) and Object Recognition**: The cropped images will be sent to another model that can recognize both text and objects within these images.\n\n4. **Database/Text Document**: The results from the OCR/Object Recognition model will be stored in a structured format (e.g., a text document) that the large language model can reference. This document will have three columns:\n   - Column 1: Description of the element (e.g., hyperlink, icon, text field).\n   - Column 2: Coordinates of the element.\n   - Column 3: Recognized text or object within the element.\n\n5. **Large Language Model Interaction**: The large language model will use this structured data to interact with the web page. For example, it can decide to click on a hyperlink based on the recognized text and then use the LucidWebSearch extension to contextualize the information on the new page.\n\n6. **Mouse and Keyboard Operation**: The large language model will be able to operate a mouse and keyboard to interact with the web page, using the coordinates provided by Owlv2.\n\nGiven this understanding, we can start by breaking down the tasks:\n\n1. **Integrating Owlv2**: We need to ensure that Owlv2 can be called from within the TextGen environment to analyze images and provide coordinates.\n\n2. **Image Cropping**: Write Python code to crop images based on the coordinates provided by Owlv2.\n\n3. **OCR/Object Recognition**: Integrate the OCR/Object Recognition model to analyze the cropped images.\n\n4. **Data Storage**: Create a structured format (e.g., a text document) to store the results from the OCR/Object Recognition model.\n\n5. **LLM Interaction**: Develop the logic for the large language model to interact with the web page using the stored data.\n\n6. **Mouse and Keyboard Operation**: Implement the functionality for the large language model to operate a mouse and keyboard.\n\nLet&#x27;s start with the first step: integrating Owlv2. Do you have any specific requirements or preferences for how Owlv2 should be integrated into the TextGen environment?"
        ],
        [
            " Okay, first of all, wonderful summary. Yes, we are on the same page for Shirley&#x27;s. But before we begin, I want to take a bit of a different start. I came to realize this functionality of the OWL model through a hugging face space somebody made. I&#x27;m not too familiar with how this all works, but it seemed like it was like a GitHub in a lot of ways. But you could interact with this Python app.py file, essentially, but it was all through a web interface on hugging face. I could give it an image and test things out, and I was like, this is amazing. I figured out how to clone the GitHub code that supported that spin-up of OWL. I got it running on my computer, but there&#x27;s one problem. There&#x27;s two UI elements. There&#x27;s the image and an output UI element. I can drag an image into the image UI element, great, and I can see it, wonderful. But on the output UI element, all I see are the clickable icon or hyperlink response colored boxes, but they&#x27;re not overlaid on the screenshot of the web browser that I gave it. I gave it a screenshot of a web browser, and I&#x27;m pretty sure it highlighted everything right. I think the outputs exist as coordinates, and all the coordinates are colored differently, but there aren&#x27;t any boxes overlaid on the input image display. When I was testing this on the hugging face website, the output UI showed the input image, but it had boxes around all of the regions that the OWL model identified. What I was thinking is, before we do anything with TextGen, let&#x27;s get this GitHub repo working the way I want. We have a good foundational comparison to work with, and we can use this to make sure the right things are always being highlighted in our test cases and stuff like that.",
            "That sounds like a practical approach. Let&#x27;s focus on getting the GitHub repository for the Owlv2 model working correctly on your local machine first. Here&#x27;s a step-by-step plan to achieve that:\n\n1. **Understand the Current Setup**:\n   - Review the code in the cloned GitHub repository to understand how the image and output UI elements are handled.\n   - Identify the part of the code responsible for displaying the output boxes on the image.\n\n2. **Modify the Code**:\n   - Locate the section of the code where the output boxes are generated.\n   - Ensure that the coordinates of the boxes are correctly calculated and stored.\n   - Modify the code to overlay these boxes on the input image and display the combined image in the output UI element.\n\n3. **Test the Changes**:\n   - Run the modified code to ensure that the boxes are correctly overlaid on the input image.\n   - Verify that the output UI element displays the input image with the highlighted regions.\n\nLet&#x27;s start by reviewing the code. Could you provide the path to the main script (e.g., `app.py`) and any relevant files or directories within the cloned repository? This will help me give you more specific guidance on where to make the necessary changes."
        ],
        [
            " Okay, so I hope this makes sense, but we have taken this conversation to a good stopping point in the past, and effectively what you have already done is you have fixed the code for me, and done a lot more.  Here is an update I had you make for yourself :\n\nHere&#x27;s a summary of what the `ImageTestV1.py` code does, its purpose, how it works, the purpose for starting this, and the ultimate objective. This summary is written as if I&#x27;m explaining it to myself for context in a new conversation:\n\n---\n\n### Summary of `ImageTestV1.py`\n\n**Purpose**:\n- `ImageTestV1.py` is a script that automates the process of taking a screenshot of a specific monitor, processing the screenshot using the Owlv2 model to detect clickable icons and hyperlinks, and then cropping out the detected regions for further analysis.\n\n**How it Works**:\n1. **Configurable Variables**:\n   - `MONITOR_INDEX`: Index of the monitor to capture (0 for the first monitor, 1 for the second monitor, etc.).\n   - `SCREENSHOT_FILENAME`: Filename to save the screenshot as.\n   - `TEXT_QUERIES`: Comma-separated text queries for object detection (e.g., &quot;clickable icons,hyperlinks&quot;).\n   - `SCORE_THRESHOLD`: Score threshold for filtering low-probability predictions.\n   - `OUTPUT_DIR`: Directory to save the cropped images.\n\n2. **Function `take_screenshot`**:\n   - Takes a screenshot of a specific monitor and saves it as a PNG file.\n\n3. **Function `query_image`**:\n   - Processes the image using the Owlv2 model to detect objects based on text queries.\n   - Saves the annotated image and the JSON output containing the bounding box coordinates and labels.\n   - **Note**: The text labels are not drawn within the bounding boxes to avoid confusion when sending the images to the OCR model.\n\n4. **Function `crop_images`**:\n   - Crops out the detected objects from the input image based on the bounding box coordinates and saves them to the specified output directory.\n\n5. **Main Block**:\n   - Takes the screenshot, processes the image using the Owlv2 model, and crops the detected objects.\n   - Saves the cropped images to the specified output directory.\n\n**Purpose for Starting This**:\n- The purpose of starting with `ImageTestV1.py` is to ensure that the Owlv2 model is working correctly and that the detected objects can be accurately cropped out of the input image. This serves as a foundation for the ultimate objective of the project.\n\n**Ultimate Objective**:\n- The ultimate objective of the project is to create an extension for the TextGenerationWebUI (TextGen) that allows a large language model to operate a mouse and keyboard to interact with a web page. The extension will use the Owlv2 model to detect clickable elements on the web page, crop out the detected elements, and send them to an optical character recognition (OCR) model to extract text and object descriptions. The large language model will then use this information to interact with the web page, such as by clicking on hyperlinks or filling out text fields. The extension will also use the LucidWebSearch extension to contextualize the information on the web page and the LucidVision extension to communicate with a vision model.\n\n**Next Steps**:\n1. Feed the markdown documents for the Lucid Vision and Lucid Web Search extensions to the AI to understand how they work.\n2. Start a new conversation with the AI to continue with the project, using the working `ImageTestV1.py` script as a starting point.\n3. Develop the OCR/Object Recognition functionality to extract text and object descriptions from the cropped images.\n4. Create a structured format (e.g., a JSON file) to store the extracted information that the large language model can reference.\n5. Develop the logic for the large language model to interact with the web page using the stored data.\n6. Implement the functionality for the large language model to operate a mouse and keyboard.\n\nBy following these steps, we can create a powerful extension for TextGen that enables a large language model to interact with web pages autonomously.\n\n---\n\nYou can copy this summary and use it to start a new conversation, providing the necessary context for the `ImageTestV1.py` script within the greater scheme of the project.\n\n\nimport torch\nimport pyautogui\nimport screeninfo\nfrom transformers import Owlv2Processor, Owlv2ForObjectDetection\nfrom PIL import Image, ImageDraw\nimport json\nimport os\n\n# Configurable variables\nMONITOR_INDEX = 1  # Index of the monitor to capture (0 for the first monitor, 1 for the second monitor, etc.)\nSCREENSHOT_FILENAME = &quot;screenshot.png&quot;  # Filename to save the screenshot as\nTEXT_QUERIES = &quot;clickable icons,hyperlinks&quot;  # Comma-separated text queries for object detection\nSCORE_THRESHOLD = 0.16  # Score threshold for filtering low-probability predictions\nOUTPUT_DIR = &quot;ImageOutputTest&quot;  # Directory to save the cropped images\n\n# Use GPU if available\nif torch.cuda.is_available():\n    device = torch.device(&quot;cuda&quot;)\nelse:\n    device = torch.device(&quot;cpu&quot;)\n\n# Load the Owlv2 model and processor\nmodel = Owlv2ForObjectDetection.from_pretrained(&quot;google/owlv2-base-patch16-ensemble&quot;).to(device)\nprocessor = Owlv2Processor.from_pretrained(&quot;google/owlv2-base-patch16-ensemble&quot;)\n\ndef take_screenshot(monitor_index=MONITOR_INDEX, output_filename=SCREENSHOT_FILENAME):\n    &quot;&quot;&quot;\n    Takes a screenshot of a specific monitor in a multi-monitor setup.\n\n    Args:\n        monitor_index (int): The index of the monitor to capture (default is MONITOR_INDEX).\n        output_filename (str): The filename to save the screenshot as (default is SCREENSHOT_FILENAME).\n    &quot;&quot;&quot;\n    # Get information about all monitors\n    monitors = screeninfo.get_monitors()\n\n    if monitor_index &gt;= len(monitors):\n        raise ValueError(f&quot;Monitor index {monitor_index} is out of range. There are only {len(monitors)} monitors.&quot;)\n\n    # Get the region of the specified monitor\n    monitor = monitors[monitor_index]\n    region = (monitor.x, monitor.y, monitor.width, monitor.height)\n\n    # Take the screenshot\n    screenshot = pyautogui.screenshot(region=region)\n\n    # Save the screenshot\n    screenshot.save(output_filename)\n    print(f&quot;Screenshot saved as {output_filename}&quot;)\n\n    return output_filename\n\ndef query_image(img_path, text_queries=TEXT_QUERIES, score_threshold=SCORE_THRESHOLD):\n    &quot;&quot;&quot;\n    Processes the image using the Owlv2 model to detect objects based on text queries.\n\n    Args:\n        img_path (str): The path to the input image.\n        text_queries (str): Comma-separated text queries for object detection (default is TEXT_QUERIES).\n        score_threshold (float): The score threshold for filtering low-probability predictions (default is SCORE_THRESHOLD).\n    &quot;&quot;&quot;\n    img = Image.open(img_path)\n    text_queries = text_queries.split(&quot;,&quot;)\n\n    size = max(img.size)\n    target_sizes = torch.Tensor([[size, size]])\n    inputs = processor(text=text_queries, images=img, return_tensors=&quot;pt&quot;).to(device)\n\n    with torch.no_grad():\n        outputs = model(**inputs)\n\n    outputs.logits = outputs.logits.cpu()\n    outputs.pred_boxes = outputs.pred_boxes.cpu()\n    results = processor.post_process_object_detection(outputs=outputs, target_sizes=target_sizes)\n    boxes, scores, labels = results[0][&quot;boxes&quot;], results[0][&quot;scores&quot;], results[0][&quot;labels&quot;]\n\n    result_labels = []\n    img_pil = img.copy()\n    draw = ImageDraw.Draw(img_pil)\n\n    for box, score, label in zip(boxes, scores, labels):\n        box = [int(i) for i in box.tolist()]\n        if score &lt; score_threshold:\n            continue\n        result_labels.append([box, text_queries[label.item()]])\n        draw.rectangle(box, outline=&quot;red&quot;, width=2)\n        # Do not draw the text label within the bounding box\n        # draw.text((box[0], box[1]), text_queries[label.item()], fill=&quot;red&quot;)\n\n    # Save the annotated image for debugging\n    annotated_image_path = &quot;annotated_image.png&quot;\n    img_pil.save(annotated_image_path)\n\n    # Save the JSON output to a file\n    json_output_path = &quot;output.json&quot;\n    with open(json_output_path, &#x27;w&#x27;) as f:\n        json.dump(result_labels, f)\n\n    return annotated_image_path, json_output_path\n\ndef crop_images(image_path, json_output_path, output_dir=OUTPUT_DIR):\n    &quot;&quot;&quot;\n    Crops out the detected objects from the input image based on the bounding box coordinates.\n\n    Args:\n        image_path (str): The path to the input image.\n        json_output_path (str): The path to the JSON file containing the bounding box coordinates.\n        output_dir (str): The directory to save the cropped images (default is OUTPUT_DIR).\n    &quot;&quot;&quot;\n    with open(json_output_path, &#x27;r&#x27;) as f:\n        data = json.load(f)\n\n    img = Image.open(image_path)\n\n    for i, item in enumerate(data):\n        box = item[0]\n        label = item[1]\n        cropped_img = img.crop(box)\n        cropped_img.save(f&quot;{output_dir}/{label}_{i}.png&quot;)\n\nif __name__ == &quot;__main__&quot;:\n    # Take the screenshot\n    screenshot_path = take_screenshot()\n\n    # Process the image using the Owlv2 model\n    annotated_image_path, json_output_path = query_image(screenshot_path)\n\n    # Create the output directory if it doesn&#x27;t exist\n    os.makedirs(OUTPUT_DIR, exist_ok=True)\n\n    # Crop the detected objects and save them to the output directory\n    crop_images(annotated_image_path, json_output_path)\n\n    print(&quot;Processing complete. Check the output directory for the cropped images.&quot;)\n\n\nDon&#x27;t move forward with anything yet, I want to make sure we are on the same page.",
            "Thank you for providing the detailed summary and the code for `ImageTestV1.py`. Let&#x27;s ensure we&#x27;re on the same page before moving forward.\n\n### Summary of `ImageTestV1.py`\n\n**Purpose**:\n- `ImageTestV1.py` is a script that automates the process of taking a screenshot of a specific monitor, processing the screenshot using the Owlv2 model to detect clickable icons and hyperlinks, and then cropping out the detected regions for further analysis.\n\n**How it Works**:\n1. **Configurable Variables**:\n   - `MONITOR_INDEX`: Index of the monitor to capture (0 for the first monitor, 1 for the second monitor, etc.).\n   - `SCREENSHOT_FILENAME`: Filename to save the screenshot as.\n   - `TEXT_QUERIES`: Comma-separated text queries for object detection (e.g., &quot;clickable icons,hyperlinks&quot;).\n   - `SCORE_THRESHOLD`: Score threshold for filtering low-probability predictions.\n   - `OUTPUT_DIR`: Directory to save the cropped images.\n\n2. **Function `take_screenshot`**:\n   - Takes a screenshot of a specific monitor and saves it as a PNG file.\n\n3. **Function `query_image`**:\n   - Processes the image using the Owlv2 model to detect objects based on text queries.\n   - Saves the annotated image and the JSON output containing the bounding box coordinates and labels.\n   - **Note**: The text labels are not drawn within the bounding boxes to avoid confusion when sending the images to the OCR model.\n\n4. **Function `crop_images`**:\n   - Crops out the detected objects from the input image based on the bounding box coordinates and saves them to the specified output directory.\n\n5. **Main Block**:\n   - Takes the screenshot, processes the image using the Owlv2 model, and crops the detected objects.\n   - Saves the cropped images to the specified output directory.\n\n**Purpose for Starting This**:\n- The purpose of starting with `ImageTestV1.py` is to ensure that the Owlv2 model is working correctly and that the detected objects can be accurately cropped out of the input image. This serves as a foundation for the ultimate objective of the project.\n\n**Ultimate Objective**:\n- The ultimate objective of the project is to create an extension for the TextGenerationWebUI (TextGen) that allows a large language model to operate a mouse and keyboard to interact with a web page. The extension will use the Owlv2 model to detect clickable elements on the web page, crop out the detected elements, and send them to an optical character recognition (OCR) model to extract text and object descriptions. The large language model will then use this information to interact with the web page, such as by clicking on hyperlinks or filling out text fields. The extension will also use the LucidWebSearch extension to contextualize the information on the web page and the LucidVision extension to communicate with a vision model.\n\n**Next Steps**:\n1. Feed the markdown documents for the Lucid Vision and Lucid Web Search extensions to the AI to understand how they work.\n2. Start a new conversation with the AI to continue with the project, using the working `ImageTestV1.py` script as a starting point.\n3. Develop the OCR/Object Recognition functionality to extract text and object descriptions from the cropped images.\n4. Create a structured format (e.g., a JSON file) to store the extracted information that the large language model can reference.\n5. Develop the logic for the large language model to interact with the web page using the stored data.\n6. Implement the functionality for the large language model to operate a mouse and keyboard.\n\nBy following these steps, we can create a powerful extension for TextGen that enables a large language model to interact with web pages autonomously.\n\n### Code Review\n\nThe provided code for `ImageTestV1.py` appears to be well-structured and covers the necessary steps for taking a screenshot, processing it with the Owlv2 model, and cropping out the detected regions. Here&#x27;s a brief overview of the code:\n\n1. **Imports**:\n   - The necessary libraries are imported, including `torch`, `pyautogui`, `screeninfo`, `transformers`, `PIL`, `json`, and `os`.\n\n2. **Configurable Variables**:\n   - Variables for monitor index, screenshot filename, text queries, score threshold, and output directory are defined.\n\n3. **Device Setup**:\n   - The script checks if a GPU is available and sets the device accordingly.\n\n4. **Model Loading**:\n   - The Owlv2 model and processor are loaded from the Hugging Face model hub.\n\n5. **Function `take_screenshot`**:\n   - This function takes a screenshot of the specified monitor and saves it as a PNG file.\n\n6. **Function `query_image`**:\n   - This function processes the image using the Owlv2 model to detect objects based on text queries. It saves the annotated image and the JSON output containing the bounding box coordinates and labels.\n\n7. **Function `crop_images`**:\n   - This function crops out the detected objects from the input image based on the bounding box coordinates and saves them to the specified output directory.\n\n8. **Main Block**:\n   - The main block takes the screenshot, processes the image using the Owlv2 model, and crops the detected objects. It saves the cropped images to the specified output directory.\n\n### Next Steps\n\nBefore moving forward, please confirm if there are any specific aspects of the code or the summary that you would like to discuss further. Once confirmed, we can proceed with the next steps as outlined in the summary."
        ],
        [
            " Okay, that seems like a good contextualization. Yeah, so we&#x27;re going to move towards our ultimate objective piece by piece here, and we&#x27;re making really good progress because you&#x27;re surprisingly good at coding. All right. The next step is, we&#x27;re going to have a vision large language model. Well, it&#x27;s not as good as a large language model, but it&#x27;s a good vision model. It can contextualize text, so it&#x27;s like an OCR model, but it can also see images, so it&#x27;s pretty good there too. It&#x27;s a good combination of both, not perfect, but good at diversity of image recognition. What we&#x27;re going to do is, I want you to take the imagetestv1.py file, or that code, and I want you to update it to include sending each one of those cropped images to the vision model. How you will learn to do this is, I will give you the code for the Lucid Vision extension for text gen.\n\nDo not write any code yet, I just want to make sure you understand.",
            "I understand your request. You want to update the `ImageTestV1.py` script to include functionality for sending each of the cropped images to a vision model that can perform both OCR and image recognition. To achieve this, you will provide me with the code for the Lucid Vision extension for TextGen, which I will use as a reference to integrate the vision model into the script.\n\nTo summarize the next steps:\n\n1. **Review the Lucid Vision Extension Code**: I will review the code for the Lucid Vision extension to understand how it interacts with the vision model.\n\n2. **Update `ImageTestV1.py`**: I will update the `ImageTestV1.py` script to include functionality for sending each cropped image to the vision model.\n\n3. **Integrate Vision Model**: I will integrate the vision model into the script, ensuring that it can process the cropped images and return the recognized text and object descriptions.\n\n4. **Store Results**: I will store the results from the vision model in a structured format (e.g., a JSON file) that the large language model can reference.\n\nPlease provide the code for the Lucid Vision extension, and I will proceed with the integration."
        ],
        [
            "Okay, here is the Lucid_Vision script.py file that works as an extension for textgen.  I am going to give you the script.py file and the markdown for the github repo for supported context.  Don&#x27;t write any code just yet I want you to contextualize this extension.\n\n# pip install pexpect\nimport json\nimport re\nfrom datetime import datetime  # Import datetime for timestamp generation\nfrom pathlib import Path\n\nimport gradio as gr\nimport torch\nfrom PIL import Image\nfrom deepseek_vl.models import VLChatProcessor\nfrom deepseek_vl.utils.io import load_pil_images as load_pil_images_for_deepseek\nfrom transformers import AutoModelForCausalLM, AutoModel, AutoTokenizer, AutoProcessor, \\\n    PaliGemmaForConditionalGeneration\n\nmodel_names = [&quot;phiVision&quot;, &quot;DeepSeek&quot;, &quot;paligemma&quot;, &quot;paligemma_cpu&quot;, &quot;minicpm_llama3&quot;, &quot;bunny&quot;]\n\n\n# Load configuration settings from a JSON file\ndef load_config():\n    with open(Path(__file__).parent / &quot;config.json&quot;, &quot;r&quot;) as config_file:\n        # Read and parse the JSON configuration file\n        config = json.load(config_file)\n    return config\n\n\n# Load the configuration settings at the module level\nconfig = load_config()\n\n# Define the model ID for PhiVision using the configuration setting\nphiVision_model_id = config[&quot;phiVision_model_id&quot;]\n\n# Define the model ID for PaliGemma using the configuration setting\npaligemma_model_id = config[&quot;paligemma_model_id&quot;]\n\nminicpm_llama3_model_id = config[&quot;minicpm_llama3_model_id&quot;]\n\nbunny_model_id = config[&quot;bunny_model_id&quot;]\n\ncuda_device = config[&quot;cuda_visible_devices&quot;]\n\nselected_vision_model = config[&quot;default_vision_model&quot;]\n\n# Global variable to store the file path of the selected image\nselected_image_path = None\n\n# Define the directory where the image files will be saved\n# This path is loaded from the configuration file\nimage_history_dir = Path(config[&quot;image_history_dir&quot;])\n# Ensure the directory exists, creating it if necessary\nimage_history_dir.mkdir(parents=True, exist_ok=True)\n\nglobal phiVision_model, phiVision_processor\nglobal minicpm_llama_model, minicpm_llama_tokenizer\nglobal paligemma_model, paligemma_processor\nglobal paligemma_cpu_model, paligemma_cpu_processor\nglobal deepseek_processor, deepseek_tokenizer, deepseek_gpt\nglobal bunny_model, bunny_tokenizer\n\n\n# Function to generate a timestamped filename for saved images\ndef get_timestamped_filename(extension=&quot;.png&quot;):\n    # Generate a timestamp in the format YYYYMMDD_HHMMSS\n    timestamp = datetime.now().strftime(&quot;%Y%m%d_%H%M%S&quot;)\n    # Return a formatted file name with the timestamp\n    return f&quot;image_{timestamp}{extension}&quot;\n\n\n# Function to modify the user input before it is processed by the LLM\ndef input_modifier(user_input, state):\n    global selected_image_path\n    # Check if an image has been selected and stored in the global variable\n    if selected_image_path:\n        # Construct the message with the &quot;File location&quot; trigger phrase and the image file path\n        image_info = f&quot;File location: {selected_image_path}&quot;\n        # Combine the user input with the image information, separated by a newline for clarity\n        combined_input = f&quot;{user_input}\\n\\n{image_info}&quot;\n        # Reset the selected image path to None after processing\n        selected_image_path = None\n        return combined_input\n    # If no image is selected, return the user input as is\n    return user_input\n\n\n# Function to handle image upload and store the file path\ndef save_and_print_image_path(image):\n    global selected_image_path\n    if image is not None:\n        # Generate a unique timestamped filename for the image\n        file_name = get_timestamped_filename()\n        # Construct the full file path for the image\n        file_path = image_history_dir / file_name\n        # Save the uploaded image to the specified directory\n        image.save(file_path, format=&#x27;PNG&#x27;)\n        print(f&quot;Image selected: {file_path}&quot;)\n        # Update the global variable with the new image file path\n        selected_image_path = file_path\n    else:\n        # Print a message if no image is selected\n        print(&quot;No image selected yet.&quot;)\n\n\n# Function to create the Gradio UI components with the new direct vision model interaction elements\ndef ui():\n    # Create an image input component for the Gradio UI\n    image_input = gr.Image(label=&quot;Select an Image&quot;, type=&quot;pil&quot;, source=&quot;upload&quot;)\n    # Set up the event that occurs when an image is uploaded\n    image_input.change(\n        fn=save_and_print_image_path,\n        inputs=image_input,\n        outputs=None\n    )\n\n    # Create a text field for user input to the vision model\n    user_input_to_vision_model = gr.Textbox(label=&quot;Type your question or prompt for the vision model&quot;, lines=2,\n                                            placeholder=&quot;Enter your text here...&quot;)\n\n    # Create a button to trigger the vision model processing\n    vision_model_button = gr.Button(value=&quot;Ask Vision Model&quot;)\n\n    # Create a text field to display the vision model&#x27;s response\n    vision_model_response_output = gr.Textbox(\n        label=&quot;Vision Model Response&quot;,\n        lines=5,\n        placeholder=&quot;Response will appear here...&quot;\n    )\n\n    # Add radio buttons for vision model selection\n    vision_model_selection = gr.Radio(\n        choices=model_names,\n        # Added &quot;paligemma_cpu&quot; as a choice\n        value=config[&quot;default_vision_model&quot;],\n        label=&quot;Select Vision Model&quot;\n    )\n\n    # Add an event handler for the radio button selection\n    vision_model_selection.change(\n        fn=update_vision_model,\n        inputs=vision_model_selection,\n        outputs=None\n    )\n\n    cuda_devices_input = gr.Textbox(\n        value=cuda_device,\n        label=&quot;CUDA Device ID&quot;,\n        max_lines=1\n    )\n\n    cuda_devices_input.change(\n        fn=update_cuda_device,\n        inputs=cuda_devices_input,\n        outputs=None\n    )\n\n    # Set up the event that occurs when the vision model button is clicked\n    vision_model_button.click(\n        fn=process_with_vision_model,\n        inputs=[user_input_to_vision_model, image_input, vision_model_selection],\n        # Pass the actual gr.Image component\n        outputs=vision_model_response_output\n    )\n\n    # Return a column containing the UI components\n    return gr.Column(\n        [\n            image_input,\n            user_input_to_vision_model,\n            vision_model_button,\n            vision_model_response_output,\n            vision_model_selection\n        ]\n    )\n\n\n# Function to load the PaliGemma CPU model and processor\ndef load_paligemma_cpu_model():\n    global paligemma_cpu_model, paligemma_cpu_processor\n    paligemma_cpu_model = PaliGemmaForConditionalGeneration.from_pretrained(\n        paligemma_model_id,\n    ).eval()\n    paligemma_cpu_processor = AutoProcessor.from_pretrained(paligemma_model_id)\n    print(&quot;PaliGemma CPU model loaded on-demand.&quot;)\n\n\n# Function to unload the PaliGemma CPU model and processor\ndef unload_paligemma_cpu_model():\n    global paligemma_cpu_model, paligemma_cpu_processor\n    if paligemma_cpu_model is not None:\n        # Delete the model and processor instances\n        del paligemma_cpu_model\n        del paligemma_cpu_processor\n        print(&quot;PaliGemma CPU model unloaded.&quot;)\n\n    # Global variable to store the selected vision model\n\n\n# Function to update the selected vision model and load the corresponding model\ndef update_vision_model(model_name):\n    global selected_vision_model\n    selected_vision_model = model_name\n    return model_name\n\n\ndef update_cuda_device(device):\n    global cuda_device\n    print(f&quot;Cuda device set to index = {device}&quot;)\n    cuda_device = int(device)\n    return cuda_device\n\n\n# Main entry point for the Gradio interface\nif __name__ == &quot;__main__&quot;:\n    # Launch the Gradio interface with the specified UI components\n    gr.Interface(\n        fn=ui,\n        inputs=None,\n        outputs=&quot;ui&quot;\n    ).launch()\n\n# Define a regular expression pattern to match the &quot;File location&quot; trigger phrase\nfile_location_pattern = re.compile(r&quot;File location: (.+)$&quot;, re.MULTILINE)\n# Define a regular expression pattern to match and remove unwanted initial text\nunwanted_initial_text_pattern = re.compile(r&quot;.*?prod\\(dim=0\\)\\r\\n&quot;, re.DOTALL)\n\n# Define a regular expression pattern to match and remove unwanted prefixes from DeepSeek responses\n# This pattern should be updated to match the new, platform-independent file path format\nunwanted_prefix_pattern = re.compile(\n    r&quot;^&quot; + re.escape(str(image_history_dir)) + r&quot;[^ ]+\\.png\\s*Assistant: &quot;,\n    re.MULTILINE\n)\n\n\n# Function to load the PhiVision model and processor\ndef load_phi_vision_model():\n    global phiVision_model, phiVision_processor, cuda_device\n    # Load the PhiVision model and processor on-demand, specifying the device map\n    phiVision_model = AutoModelForCausalLM.from_pretrained(\n        phiVision_model_id,\n        device_map={&quot;&quot;: cuda_device},  # Use the specified CUDA device(s)\n        trust_remote_code=True,\n        torch_dtype=&quot;auto&quot;\n    )\n    phiVision_processor = AutoProcessor.from_pretrained(\n        phiVision_model_id,\n        trust_remote_code=True\n    )\n    print(&quot;PhiVision model loaded on-demand.&quot;)\n\n\n# Function to unload the PhiVision model and processor\ndef unload_phi_vision_model():\n    global phiVision_model, phiVision_processor\n    if phiVision_model is not None:\n        # Delete the model and processor instances\n        del phiVision_model\n        del phiVision_processor\n        # Clear the CUDA cache to free up VRAM\n        torch.cuda.empty_cache()\n        print(&quot;PhiVision model unloaded.&quot;)\n\n\n# Function to load the MiniCPM-Llama3 model and tokenizer\ndef load_minicpm_llama_model():\n    global minicpm_llama_model, minicpm_llama_tokenizer, cuda_device\n    if &quot;int4&quot; in minicpm_llama3_model_id:\n        # Load the 4-bit quantized model and tokenizer\n        minicpm_llama_model = AutoModel.from_pretrained(\n            minicpm_llama3_model_id,\n            device_map={&quot;&quot;: cuda_device},\n            trust_remote_code=True,\n            torch_dtype=torch.float16  # Use float16 as per the example code for 4-bit models\n        ).eval()\n        minicpm_llama_tokenizer = AutoTokenizer.from_pretrained(\n            minicpm_llama3_model_id,\n            trust_remote_code=True\n        )\n        # Print a message indicating that the 4-bit model is loaded\n        print(&quot;MiniCPM-Llama3 4-bit quantized model loaded on-demand.&quot;)\n    else:\n        # Load the standard model and tokenizer\n        minicpm_llama_model = AutoModel.from_pretrained(\n            minicpm_llama3_model_id,\n            device_map={&quot;&quot;: cuda_device},  # Use the specified CUDA device\n            trust_remote_code=True,\n            torch_dtype=torch.bfloat16\n        ).eval()\n        minicpm_llama_tokenizer = AutoTokenizer.from_pretrained(\n            minicpm_llama3_model_id,\n            trust_remote_code=True\n        )\n        print(&quot;MiniCPM-Llama3 standard model loaded on-demand.&quot;)\n\n\ndef unload_minicpm_llama_model():\n    global minicpm_llama_model, minicpm_llama_tokenizer\n    if minicpm_llama_model is not None:\n        del minicpm_llama_model\n        del minicpm_llama_tokenizer\n        torch.cuda.empty_cache()\n        print(&quot;MiniCPM-Llama3 model unloaded.&quot;)\n\n\n# Function to load the PaliGemma model and processor\ndef load_paligemma_model():\n    global paligemma_model, paligemma_processor, cuda_device\n    # Load the PaliGemma model and processor on-demand, specifying the device map\n    paligemma_model = PaliGemmaForConditionalGeneration.from_pretrained(\n        paligemma_model_id,\n        device_map={&quot;&quot;: cuda_device},  # Use the specified CUDA device(s)\n        torch_dtype=torch.bfloat16,\n        revision=&quot;bfloat16&quot;,\n    ).eval()\n    paligemma_processor = AutoProcessor.from_pretrained(paligemma_model_id)\n    print(&quot;PaliGemma model loaded on-demand.&quot;)\n\n\n# Function to unload the PaliGemma model and processor\ndef unload_paligemma_model():\n    global paligemma_model, paligemma_processor\n    if paligemma_model is not None:\n        # Delete the model and processor instances\n        del paligemma_model\n        del paligemma_processor\n        # Clear the CUDA cache to free up VRAM\n        torch.cuda.empty_cache()\n        print(&quot;PaliGemma model unloaded.&quot;)\n\n\ndef load_deepseek_model():\n    global deepseek_processor, deepseek_tokenizer, deepseek_gpt, cuda_device\n    deepseek_processor = VLChatProcessor.from_pretrained(config[&quot;deepseek_vl_model_id&quot;])\n    deepseek_tokenizer = deepseek_processor.tokenizer\n    deepseek_gpt = AutoModelForCausalLM.from_pretrained(\n        config[&quot;deepseek_vl_model_id&quot;],\n        device_map={&quot;&quot;: cuda_device},\n        trust_remote_code=True,\n        torch_dtype=torch.bfloat16\n    ).to(torch.bfloat16).cuda().eval()\n    print(&quot;DeepSeek model loaded on-demand.&quot;)\n\n\ndef unload_deepseek_model():\n    global deepseek_processor, deepseek_tokenizer, deepseek_gpt\n    if deepseek_processor is not None:\n        del deepseek_gpt\n        del deepseek_tokenizer\n        del deepseek_processor\n        print(&quot;DeepSeek model unloaded.&quot;)\n    torch.cuda.empty_cache()\n\n\ndef load_bunny_model():\n    global bunny_model, bunny_tokenizer, cuda_device\n    torch.cuda.set_device(cuda_device)  # Set default device before loading models\n    bunny_model = AutoModelForCausalLM.from_pretrained(\n        bunny_model_id,\n        torch_dtype=torch.bfloat16,\n        device_map={&quot;&quot;: cuda_device},  # Use the specified CUDA device\n        trust_remote_code=True\n    ).to(torch.bfloat16).cuda()\n    bunny_tokenizer = AutoTokenizer.from_pretrained(\n        bunny_model_id,\n        trust_remote_code=True\n    )\n    print(&quot;Bunny model loaded on-demand.&quot;)\n\n\ndef unload_bunny_model():\n    global bunny_model, bunny_tokenizer\n    if bunny_model is not None:\n        del bunny_model, bunny_tokenizer\n        print(&quot;Bunny model unloaded.&quot;)\n    torch.cuda.empty_cache()\n\n\n# Function to modify the output from the LLM before it is displayed to the user\ndef output_modifier(output, state, is_chat=False):\n    global cuda_device\n    # Search for the &quot;File location&quot; trigger phrase in the LLM&#x27;s output\n    file_location_matches = file_location_pattern.findall(output)\n    if file_location_matches:\n        # Extract the first match (assuming only one file location per output)\n        file_path = file_location_matches[0]\n        # Extract the questions for the vision model\n        questions_section, _ = output.split(f&quot;File location: {file_path}&quot;, 1)\n        # Remove all newlines from the questions section and replace them with spaces\n        questions = &quot; &quot;.join(questions_section.strip().splitlines())\n\n        # Initialize an empty response string\n        vision_model_response = &quot;&quot;\n\n        # Check which vision model is currently selected\n        if selected_vision_model == &quot;phiVision&quot;:\n            vision_model_response = generate_phi_vision(file_path, questions)\n\n        elif selected_vision_model == &quot;DeepSeek&quot;:\n            vision_model_response = generate_deepseek(file_path, questions)\n\n        elif selected_vision_model == &quot;paligemma&quot;:\n            vision_model_response = generate_paligemma(file_path, questions)\n\n        elif selected_vision_model == &quot;paligemma_cpu&quot;:\n            vision_model_response = generate_paligemma_cpu(file_path, questions)\n\n        elif selected_vision_model == &quot;minicpm_llama3&quot;:\n            vision_model_response = generate_minicpm_llama3(file_path, questions)\n\n        elif selected_vision_model == &quot;bunny&quot;:\n            vision_model_response = generate_bunny(file_path, questions)\n\n        # Append the vision model&#x27;s responses to the output\n        output_with_responses = f&quot;{output}\\n\\nVision Model Responses:\\n{vision_model_response}&quot;\n        return output_with_responses\n    # If no file location is found, return the output as is\n    return output\n\n\n# Function to generate a response using the MiniCPM-Llama3 model\ndef generate_minicpm_llama3(file_path, questions):\n    global cuda_device\n    try:\n        load_minicpm_llama_model()\n        image = Image.open(file_path).convert(&quot;RGB&quot;)\n        messages = [\n            {&quot;role&quot;: &quot;user&quot;, &quot;content&quot;: f&quot;{questions}&quot;}\n        ]\n        # Define the generation arguments\n        generation_args = {\n            &quot;max_new_tokens&quot;: 896,\n            &quot;repetition_penalty&quot;: 1.05,\n            &quot;num_beams&quot;: 3,\n            &quot;top_p&quot;: 0.8,\n            &quot;top_k&quot;: 1,\n            &quot;temperature&quot;: 0.7,\n            &quot;sampling&quot;: True,\n        }\n        if &quot;int4&quot; in minicpm_llama3_model_id:\n            # Disable streaming for the 4-bit model\n            generation_args[&quot;stream&quot;] = False\n            # Use the model.chat method with streaming enabled\n            vision_model_response = &quot;&quot;\n            for new_text in minicpm_llama_model.chat(\n                    image=image,\n                    msgs=messages,\n                    tokenizer=minicpm_llama_tokenizer,\n                    **generation_args\n            ):\n                vision_model_response += new_text\n                print(new_text, flush=True, end=&#x27;&#x27;)\n        else:\n            minicpm_llama_model.to(f&quot;cuda:{cuda_device}&quot;)\n            vision_model_response = minicpm_llama_model.chat(\n                image=image,\n                msgs=messages,\n                tokenizer=minicpm_llama_tokenizer,\n                **generation_args\n            )\n            return vision_model_response\n    finally:\n        unload_minicpm_llama_model()\n\n\n# Function to process the user&#x27;s input text and selected image with the vision model\ndef process_with_vision_model(user_input, image, selected_model):\n    global cuda_device\n    # Save the uploaded image to the specified directory with a timestamp\n    file_name = get_timestamped_filename()\n    file_path = image_history_dir / file_name\n    image.save(file_path, format=&#x27;PNG&#x27;)\n    print(f&quot;Image processed: {file_path}&quot;)\n\n    # Initialize an empty response string\n    vision_model_response = &quot;&quot;\n\n    # Check which vision model is currently selected\n    if selected_model == &quot;phiVision&quot;:\n        vision_model_response = generate_phi_vision(file_path, user_input)\n\n    elif selected_model == &quot;DeepSeek&quot;:\n        vision_model_response = generate_deepseek(file_path, user_input)\n\n    elif selected_model == &quot;paligemma&quot;:\n        vision_model_response = generate_paligemma(file_path, user_input)\n\n    elif selected_model == &quot;paligemma_cpu&quot;:\n        vision_model_response = generate_paligemma_cpu(file_path, user_input)\n\n    elif selected_model == &quot;minicpm_llama3&quot;:\n        vision_model_response = generate_minicpm_llama3(file_path, user_input)\n\n    elif selected_model == &quot;bunny&quot;:\n        vision_model_response = generate_bunny(file_path, user_input)\n\n    # Return the cleaned-up response from the vision model\n    return vision_model_response\n\n\ndef generate_paligemma_cpu(file_path, user_input):\n    try:\n        # Load the PaliGemma CPU model and processor on-demand\n        load_paligemma_cpu_model()\n        # Load the saved image using PIL\n        with Image.open(file_path) as img:\n            # Prepare the prompt for the PaliGemma CPU model using the user&#x27;s question\n            prompt = user_input\n            model_inputs = paligemma_cpu_processor(text=prompt, images=img, return_tensors=&quot;pt&quot;)\n            input_len = model_inputs[&quot;input_ids&quot;].shape[-1]\n            # Generate the response using the PaliGemma CPU model\n            with torch.inference_mode():\n                generation = paligemma_cpu_model.generate(\n                    **model_inputs,\n                    max_new_tokens=100,\n                    do_sample=True  # Set to True for sampling-based generation\n                )\n                generation = generation[0][input_len:]\n                vision_model_response = paligemma_cpu_processor.decode(generation, skip_special_tokens=True)\n        # Unload the PaliGemma CPU model and processor after generating the response\n        return vision_model_response\n    finally:\n        unload_paligemma_cpu_model()\n\n\ndef generate_paligemma(file_path, user_input):\n    try:\n        # Load the PaliGemma model and processor on-demand\n        load_paligemma_model()\n        # Load the saved image using PIL\n        with Image.open(file_path) as img:\n            # Prepare the prompt for the PaliGemma model using the user&#x27;s question\n            model_inputs = paligemma_processor(text=user_input, images=img, return_tensors=&quot;pt&quot;).to(\n                f&quot;cuda:{cuda_device}&quot;)\n            input_len = model_inputs[&quot;input_ids&quot;].shape[-1]\n            # Generate the response using the PaliGemma model\n            with torch.inference_mode():\n                generation = paligemma_model.generate(\n                    **model_inputs,\n                    max_new_tokens=100,\n                    do_sample=True  # Set to True for sampling-based generation\n                )\n                generation = generation[0][input_len:]\n                vision_model_response = paligemma_processor.decode(generation, skip_special_tokens=True)\n        # Unload the PaliGemma model and processor after generating the response\n        return vision_model_response\n    finally:\n        unload_paligemma_model()\n\n\ndef generate_phi_vision(file_path, user_input):\n    global cuda_device\n    try:\n        # Load the PhiVision model and processor on-demand\n        load_phi_vision_model()\n        # Load the saved image using PIL\n        with Image.open(file_path) as img:\n            # Prepare the prompt for the PhiVision model\n            messages = [\n                {&quot;role&quot;: &quot;user&quot;, &quot;content&quot;: f&quot;&lt;|image_1|&gt;\\n{user_input}&quot;}\n            ]\n            prompt = phiVision_processor.tokenizer.apply_chat_template(\n                messages, tokenize=False, add_generation_prompt=True\n            )\n            # Extract the CUDA_VISIBLE_DEVICES setting from the config file\n            # cuda_devices = config[&quot;cuda_visible_devices&quot;]\n            # Convert the CUDA_VISIBLE_DEVICES string to an integer (assuming a single device for simplicity)\n            # cuda_device_index = int(cuda_devices)\n\n            # Prepare the model inputs and move them to the specified CUDA device\n            inputs = phiVision_processor(prompt, [img], return_tensors=&quot;pt&quot;).to(f&quot;cuda:{cuda_device}&quot;)\n            # Define the generation arguments\n            generation_args = {\n                &quot;max_new_tokens&quot;: 500,\n                &quot;temperature&quot;: 1.0,\n                &quot;do_sample&quot;: True,  # Set to True for sampling-based generation\n                # &quot;min_p&quot;: 0.95,  # Optionally set a minimum probability threshold\n            }\n            # Generate the response using the PhiVision model\n            generate_ids = phiVision_model.generate(\n                **inputs,\n                eos_token_id=phiVision_processor.tokenizer.eos_token_id,\n                **generation_args\n            )\n            # Remove input tokens from the generated IDs\n            generate_ids = generate_ids[:, inputs[&#x27;input_ids&#x27;].shape[1]:]\n            # Decode the generated IDs to get the text response\n            vision_model_response = phiVision_processor.batch_decode(\n                generate_ids,\n                skip_special_tokens=True,\n                clean_up_tokenization_spaces=False\n            )[0]\n        # Unload the PhiVision model and processor after generating the response\n        return vision_model_response\n    finally:\n        unload_phi_vision_model()\n\n\ndef generate_deepseek(file_path, user_input):\n    try:\n        load_deepseek_model()\n        conversation = [\n            {\n                &quot;role&quot;: &quot;User&quot;,\n                &quot;content&quot;: f&quot;&lt;image_placeholder&gt;{user_input}&quot;,\n                &quot;images&quot;: [f&quot;{file_path}&quot;]\n            }, {\n                &quot;role&quot;: &quot;Assistant&quot;,\n                &quot;content&quot;: &quot;&quot;\n            }\n        ]\n        print(conversation)\n        pil_images = load_pil_images_for_deepseek(conversation)\n        prepare_inputs = deepseek_processor(\n            conversations=conversation,\n            images=pil_images,\n            force_batchify=True\n        ).to(deepseek_gpt.device)\n        input_embeds = deepseek_gpt.prepare_inputs_embeds(**prepare_inputs)\n        outputs = deepseek_gpt.language_model.generate(\n            inputs_embeds=input_embeds,\n            attention_mask=prepare_inputs.attention_mask,\n            pad_token_id=deepseek_tokenizer.eos_token_id,\n            bos_token_id=deepseek_tokenizer.bos_token_id,\n            eos_token_id=deepseek_tokenizer.eos_token_id,\n            max_new_tokens=512,\n            do_sample=False,\n            use_cache=True\n        )\n        return deepseek_tokenizer.decode(outputs[0].cpu().tolist(), skip_special_tokens=True)\n    finally:\n        unload_deepseek_model()\n\n\ndef generate_bunny(file_path, user_input):\n    global cuda_device\n    try:\n        load_bunny_model()\n        with Image.open(file_path) as image:\n            text = f&quot;A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user&#x27;s questions. USER: &lt;image&gt;\\n{user_input} ASSISTANT:&quot;\n            text_chunks = [bunny_tokenizer(chunk).input_ids for chunk in text.split(&#x27;&lt;image&gt;&#x27;)]\n            input_ids = torch.tensor(text_chunks[0] + [-200] + text_chunks[1][1:], dtype=torch.long).unsqueeze(0).to(\n                f&quot;cuda:{cuda_device}&quot;)\n            image_tensor = bunny_model.process_images(\n                [image],\n                bunny_model.config\n            ).to(\n                dtype=bunny_model.dtype,\n                device=bunny_model.device\n            )\n            output_ids = bunny_model.generate(\n                input_ids,\n                images=image_tensor,\n                max_new_tokens=896,\n                use_cache=True,\n                repetition_penalty=1.0\n            )[0]\n            return bunny_tokenizer.decode(output_ids[input_ids.shape[1]:], skip_special_tokens=True).strip()\n    finally:\n        unload_bunny_model()\n\n\n# Function to modify the chat history before it is used for text generation\ndef history_modifier(history):\n    # Extract all entries from the &quot;internal&quot; history\n    internal_entries = history[&quot;internal&quot;]\n    # Iterate over the &quot;internal&quot; history entries\n    for internal_index, internal_entry in enumerate(internal_entries):\n        # Extract the text content of the internal entry\n        internal_text = internal_entry[1]\n        # Search for the &quot;File location&quot; trigger phrase in the internal text\n        file_location_matches = file_location_pattern.findall(internal_text)\n        if file_location_matches:\n            # Iterate over each match found in the &quot;internal&quot; entry\n            for file_path in file_location_matches:\n                # Construct the full match string including the trigger phrase\n                full_match_string = f&quot;File location: {file_path}&quot;\n                # Search for the exact same string in the &quot;visible&quot; history\n                for visible_entry in history[&quot;visible&quot;]:\n                    # Extract the text content of the visible entry\n                    visible_text = visible_entry[1]\n                    # If the &quot;visible&quot; entry contains the full match string\n                    if full_match_string in visible_text:\n                        # Split the &quot;visible&quot; text at the full match string\n                        _, after_match = visible_text.split(full_match_string, 1)\n                        # Find the position where the &quot;.png&quot; part ends in the &quot;internal&quot; text\n                        png_end_pos = internal_text.find(file_path) + len(file_path)\n                        # If the &quot;.png&quot; part is found and there is content after it\n                        if png_end_pos &lt; len(internal_text) and internal_text[png_end_pos] == &quot;\\n&quot;:\n                            # Extract the existing content after the &quot;.png&quot; part in the &quot;internal&quot; text\n                            _ = internal_text[png_end_pos:]\n                            # Replace the existing content after the &quot;.png&quot; part in the &quot;internal&quot; text\n                            # with the corresponding content from the &quot;visible&quot; text\n                            new_internal_text = internal_text[:png_end_pos] + after_match\n                            # Update the &quot;internal&quot; history entry with the new text\n                            history[&quot;internal&quot;][internal_index][1] = new_internal_text\n                        # If there is no content after the &quot;.png&quot; part in the &quot;internal&quot; text,\n                        # append the content from the &quot;visible&quot; text directly\n                        else:\n                            # Append the content after the full match string from the &quot;visible&quot; text\n                            history[&quot;internal&quot;][internal_index][1] += after_match\n    return history\n\n\nGitHub page markdown:\n\n# README\n\nVideo Demo: There is code in this repo to prevent Alltalk from reading the directory names out loud ([here](https://github.com/RandomInternetPreson/Lucid_Vision/tree/main?tab=readme-ov-file#to-work-with-alltalk)), the video is older and displays Alltalk reading the directory names however. https://github.com/RandomInternetPreson/Lucid_Vision/assets/6488699/8879854b-06d8-49ad-836c-13c84eff6ac9 Download the full demo video here: https://github.com/RandomInternetPreson/Lucid_Vision/blob/main/VideoDemo/Lucid_Vision_demoCompBig.mov * Update June 2 2024, Lucid_Vision now supports Bunny-v1_1-Llama-3-8B-V, again thanks to https://github.com/justin-luoma :3 * Update May 31 2024, many thanks to https://github.com/justin-luoma, they have made several recent updates to the code, now merged with this repo. * Removed the need to communicate with deepseek via the cli, this removes the only extra dependency needed to run Lucid_Vision. * Reduced the number of entries needed for the config file. * Really cleaned up the code, it is much better organized now. * Added the ability to switch which gpu loads the vision model in the UI. * Bonus Update for May 31 2024, WizardLM-2-8x22B and I figured out how to prevent Alltalk from reading the file directory locations!! https://github.com/RandomInternetPreson/Lucid_Vision/tree/main?tab=readme-ov-file#to-work-with-alltalk * Update May 30 2024, Lucid_Vision now supports MiniCPM-Llama3-V-2_5, thanks to https://github.com/justin-luoma. Additionally WizardLM-2-8x22B added the functionality to load in the MiniCPM 4-bit model. * Updated script.py and config file, model was not previously loading to the user assigned gpu Experimental, and I am currently working to improve the code; but it may work for most. Todo: * Right now the temp setting and parameters for each model are coded in the .py script, I intend to bring these to the UI [done](https://github.com/RandomInternetPreson/Lucid_Vision/tree/main?tab=readme-ov-file#to-work-with-alltalk)~~* Make it comptable with Alltalk, right now each image directory is printed out and this will be read out loud by Alltalk~~ To accurately proportion credit (original repo creation): WizardLM-2-8x22B (quantized with exllamaV2 to 8bit precision) = 90% of all the work done in this repo. That model wrote 100% of all the code and most of the introduction to this repo. CommandR+ (quantized with exllamaV2 to 8bit precision) = ~5% of all the work. CommandR+ contextualized the coding examples and rules for making extensions from Oogabooga&#x27;s textgen repo extreamly well, and provided a good foundation to develope the code. RandomInternetPreson = 5% of all the work done. I came up with the original idea, the original general outline of how the pieces would interact, and provided feedback to the WizardLM model, but I did not write any code. I&#x27;m actually not very good with python yet, with most of my decades of coding being in Matlab. My goal from the beginning was to write this extension offline without any additional resources, sometimes it was a little frusturating but I soon understood how to get want I needed from the models running locally. I would say that most of the credit should go to Oobabooga, for without them I would be struggling to even interact with my models. Please consider supporting them: https://github.com/sponsors/oobabooga or https://ko-fi.com/oobabooga I am their top donor on ko-fi (Mr. A) and donate 15$ montly, their software is extreamly important to the opensource community. # Lucid_Vision Extension for Oobabooga&#x27;s textgen-webui Welcome to the Lucid Vision Extension repository! This extension enhances the capabilities of textgen-webui by integrating advanced vision models, allowing users to have contextualized conversations about images with their favorite language models; and allowing direct communciation with vision models. ## Features * Multi-Model Support: Interact with different vision models, including PhiVision, DeepSeek, MiniCPM-Llama3-V-2_5 (4bit and normal precision), Bunny-v1_1-Llama-3-8B-V, and PaliGemma, with options for both GPU and CPU inference. * On-Demand Loading: Vision models are loaded into memory only when needed to answer a question, optimizing resource usage. * Seamless Integration: Easily switch between vision models using a Gradio UI radio button selector. * Cross-Platform Compatibility: The extension is designed to work on various operating systems, including Unix-based systems and Windows. (not tested in Windows yet, but should probably work?) * Direct communication with vision models, you do not need to load a LLM to interact with the separate vision models. ## How It Works The Lucid Vision Extension operates by intercepting and modifying user input and output within the textgen-webui framework. When a user uploads an image and asks a question, the extension appends a special trigger phrase (&quot;File location:&quot;) and extracts the associated file path and question. So if a user enters text into the &quot;send a message&quot; field and has a new picture uploaded into the Lucid_Vision ui, what will happen behind the scenes is the at the user message will be appended with the &quot;File Location: (file location)&quot; Trigger phrase, at which point the LLM will see this and understand that it needs to reply back with questions about the image, and that those questions are being sent to a vison model. The cool thing is that let&#x27;s say later in the conversation you want to know something specific about a previous picture, all you need to do is ask your LLM, YOU DO NOT NEED TO REUPLOAD THE PICTURE, the LLM should be able to interact with the extension on its own after you uploaded your first picture. The selected vision directly interacts with the model&#x27;s Python API to generate a response. The response is then appended to the chat history, providing the user with detailed insights about the image. The extension is designed to be efficient with system resources by only loading the vision models into memory when they are actively being used to process a question. After generating a response, the models are immediately unloaded to free up memory and GPU VRAM. ## **How to install and setup:** 1.use the latest version of textgen ~~Install this edited prior commit from oobabooga&#x27;s textgen https://github.com/RandomInternetPreson/textgen_webui_Lucid_Vision_Testing OR use the latest version of textgen.~~ ~~If using the edited older version, make sure to rename the install folder `text-generation-webui`~~ ~~(Note, a couple months ago gradio had a massive update. For me, this has caused a lot of glitches and errors with extensions; I&#x27;ve briefly tested the Lucid_Vision extension in the newest implementation of textgen and it will work. However, I was getting timeout popups when vision models were loading for the first time, gradio wasn&#x27;t waiting for the response from the model upon first load. After a model is loaded once, it is saved in cpu ram cache (this doesn&#x27;t actively use your ram, it just uses what is free to keep the models in memory so they are quickly reloaded into gpu ram if necessary) and gradio doesn&#x27;t seem to timeout as often. The slightly older version of textgen that I&#x27;ve edited does not experience this issue)~~ ~~2. Update the transformers library using the cmd_yourOShere.sh/bat file (so either cmd_linux.sh, cmd_macos.sh, cmd_windows.bat, or cmd_wsl.bat) and entering the following lines. If you run the update wizard after this point, it will overrite this update to transformers. The newest transformers package has the libraries for paligemma, which the code needs to import regardless of whether or not you are intending to use the model.~~ No longer need to update transformers and the latest version of textgen as of this writing 1.9.1 works well with many of the gradio issues resolved, however gradio might timeout on the page when loading a model for the fisrt time still. After a model is loaded once, it is saved in cpu ram cache (this doesn&#x27;t actively use your ram, it just uses what is free to keep the models in memory so they are quickly reloaded into gpu ram if necessary) and gradio doesn&#x27;t seem to timeout as often. ## Model Information If you do not want to install DeepseekVL dependencies and are not intending to use deepseek comment out the import lines from the script.py file as displayed here ``` # pip install pexpect import json import re from datetime import datetime # Import datetime for timestamp generation from pathlib import Path import gradio as gr import torch from PIL import Image #from deepseek_vl.models import VLChatProcessor #from deepseek_vl.utils.io import load_pil_images as load_pil_images_for_deepseek from transformers import AutoModelForCausalLM, AutoModel, AutoTokenizer, AutoProcessor, \\ PaliGemmaForConditionalGeneration model_names = [&quot;phiVision&quot;, &quot;DeepSeek&quot;, &quot;paligemma&quot;, &quot;paligemma_cpu&quot;, &quot;minicpm_llama3&quot;, &quot;bunny&quot;] ``` 3. Install **DeepseekVL** if you intend on using that model Clone the repo: https://github.com/deepseek-ai/DeepSeek-VL into the `repositories` folder of your textgen install Open cmd_yourOShere.sh/bat, navigate to the `repositories/DeepSeek-VL` folder via the terminal using `cd your_directory_here` and enter this into the command window: ``` pip install -e . ``` Download the deepseekvl model here: https://huggingface.co/deepseek-ai/deepseek-vl-7b-chat They have different and smaller models to choose from: https://github.com/deepseek-ai/DeepSeek-VL?tab=readme-ov-file#3-model-downloads 4. If you want to use **Phi-3-vision-128k-instruct**, download it here: https://huggingface.co/microsoft/Phi-3-vision-128k-instruct 5. If you want to use **paligemma-3b**, download it here: https://huggingface.co/google/paligemma-3b-ft-cococap-448 (this is just one out of many fine-tunes google provides) Read this blog on how to inference with the model: https://huggingface.co/blog/paligemma 6. If you want to use **MiniCPM-Llama3-V-2_5**, download it here: https://huggingface.co/openbmb/MiniCPM-Llama3-V-2_5 The **4-bit** verison of the model can be downloaded here: https://huggingface.co/openbmb/MiniCPM-Llama3-V-2_5-int4 **Notes about 4-bit MiniCPM:** * It might not look like the model fully unloads from vram, but it does and the vram will be reclaimed if another program needs it * Your directory folder where the model is stored needs to have the term &quot;int4&quot; in it, this is how the extension identifies the 4bit nature of the model 7. If you want to use **Bunny-v1_1-Llama-3-8B-V**, download it here: https://huggingface.co/BAAI/Bunny-v1_1-Llama-3-8B-V **Notes about Bunny-v1_1-Llama-3-8B-V:** * When running for the first time the model needs internet access so it can download models--google--siglip-so400m-patch14-384 to your cache/huggingface directory. This is an additional 3.5GB file the model needs to run. ## Updating the config file 8. Before using the extension you need to update the config file; open it in a text editor *Note No quotes around gpu #: ``` { &quot;image_history_dir&quot;: &quot;(fill_In)/extensions/Lucid_Vision/ImageHistory/&quot;, &quot;cuda_visible_devices&quot;: 0, &quot;default_vision_model&quot;: &quot;phiVision&quot;, &quot;phiVision_model_id&quot;: &quot;(fill_In)&quot;, &quot;paligemma_model_id&quot;: &quot;(fill_In)&quot;, &quot;paligemma_cpu_model_id&quot;: &quot;(fill_In)&quot;, &quot;minicpm_llama3_model_id&quot;: &quot;(fill_In)&quot;, &quot;deepseek_vl_model_id&quot;: &quot;(fill_in)&quot;, &quot;bunny_model_id&quot;: &quot;(fill_in)&quot; } ``` If your install directory is /home/username/Desktop/oobLucidVision/text-generation-webui/ the config file will look like this for example: Make note that you want to change / to \\ if you are on Windows ``` { &quot;image_history_dir&quot;: &quot;/home/username/Desktop/oobLucidVision/text-generation-webui/extensions/Lucid_Vision/ImageHistory/&quot;, &quot;cuda_visible_devices&quot;: 0, &quot;default_vision_model&quot;: &quot;phiVision&quot;, &quot;phiVision_model_id&quot;: &quot;(fill_In)&quot;, *This is the folder where your phi-3 vision model is stored &quot;paligemma_model_id&quot;: &quot;(fill_In)&quot;, *This is the folder where your paligemma vision model is stored &quot;paligemma_cpu_model_id&quot;: &quot;(fill_In)&quot;, *This is the folder where your paligemma vision model is stored &quot;minicpm_llama3_model_id&quot;: &quot;(fill_In)&quot;, *This is the folder where your minicpm_llama3 vision model is stored, the model can either be the normal fp16 or 4-bit version &quot;deepseek_vl_model_id&quot;: &quot;(fill_in)&quot;, *This is the folder where your deepseek vision model is stored &quot;bunny_model_id&quot;: &quot;(fill_in)&quot; *This is the folder where your Bunny-v1_1-Llama-3-8B-V vision model is stored } ``` ## To work with Alltalk Alltalk is great!! An extension I use all the time: https://github.com/erew123/alltalk_tts To get Lucid_Vision to work, the LLM needs to repeat the file directory of an image, and in doing so Alltalk will want to transcribe that text to audo. It is annoying to have the tts model try an transcribe file directories. If you want to get Alltalk to work well wtih Lucid_Vision you need to replace the code here in the script.py file that comes with Alltalk (search for the &quot;IMAGE CLEANING&quot; part of the code): ``` ######################## #### IMAGE CLEANING #### ######################## OriginalLucidVisionText = &quot;&quot; # This is the existing pattern for matching both images and file location text img_pattern = r&#x27;&lt;img[^&gt;]*src\\s*=\\s*[&quot;\\&#x27;][^&quot;\\&#x27;&gt;]+[&quot;\\&#x27;][^&gt;]*&gt;|File location: [^.]+.png&#x27; def extract_and_remove_images(text): &quot;&quot;&quot; Extracts all image and file location data from the text and removes it for clean TTS processing. Returns the cleaned text and the extracted image and file location data. &quot;&quot;&quot; global OriginalLucidVisionText OriginalLucidVisionText = text # Update the global variable with the original text img_matches = re.findall(img_pattern, text) img_info = &quot;\\n&quot;.join(img_matches) # Store extracted image and file location data cleaned_text = re.sub(img_pattern, &#x27;&#x27;, text) # Remove images and file locations from text return cleaned_text, img_info def reinsert_images(cleaned_string, img_info): &quot;&quot;&quot; Reinserts the previously extracted image and file location data back into the text. &quot;&quot;&quot; global OriginalLucidVisionText # Check if there are images or file locations to reinsert if img_info: # Check if the &quot;Vision Model Responses:&quot; phrase is present in the original text if re.search(r&#x27;Vision Model Responses:&#x27;, OriginalLucidVisionText): # If present, return the original text as is, without modifying it return OriginalLucidVisionText else: # If not present, append the img_info to the end of the cleaned string cleaned_string += f&quot;\\n\\n{img_info}&quot; return cleaned_string ################################# #### TTS STANDARD GENERATION #### ################################# ``` ## **Quirks and Notes:** 1. **When you load a picture once, it is used once. Even if the image stays present in the UI element on screen, it is not actively being used.** 2. If you are using other extensions, load Lucid_Vision first. Put it first in your CMD_FLAGS.txt file or make sure to check it first in the sequence of check boxes in the session tab UI. 3. DeepseekVL can take a while to load initially, that&#x27;s just the way it is. # **How to use:** ## **BASICS:** **When you load a picture once, it is used once. Even if the image stays present in the UI element on screen, it is not actively being used.** Okay the extension can do many different things with varying levels of difficulty. Starting out with the basics and understanding how to talk with your vision models: Scroll down past where you would normally type something to the LLM, you do not need a large language model loaded to use this function. ![image](https://github.com/RandomInternetPreson/Lucid_Vision/assets/6488699/227ff483-5041-46a7-9b5b-a8f9dd3c673e) Start out by interacting with the vision models without involvement of a seperate LLM model by pressing the `Ask Vision Model` button ![image](https://github.com/RandomInternetPreson/Lucid_Vision/assets/6488699/4530e13f-30a1-43d9-8383-c05e31ddb5d7) **Note paligemma requries a little diffent type prompting sometimes, read the blog on how to inference with it: https://huggingface.co/blog/paligemma** Do this with every model you intend on using, upload a picture, and ask a question ## **ADVANCED: Updated Tips At End on how to get working with Llama-3-Instruct-8B-SPPO-Iter3 https://huggingface.co/UCLA-AGI/Llama-3-Instruct-8B-SPPO-Iter3** Okay, this is why I built the extension in the first place; direct interaction with the vision model was actually an afterthought after I had all the code working. This is how to use your favorite LLM WITH an additional Vision Model, I wanted a way to give my LLMs eyes essentially. I realize that good multimodal models are likely around the corner, but until they match the intellect of very good LLMs, I&#x27;d rather have a different vision model work with a good LLM. To use Lucid_Vision as intended requires a little bit of setup: 1. In `Parameters` then `chat` load the &quot;AI_Image&quot; character (this card is in the edited older commit, if using your own version of textgen the character card is here: https://github.com/RandomInternetPreson/Lucid_Vision/blob/main/AI_Image.yaml put it in the `characters` folder of the textgen install folder: ![image](https://github.com/RandomInternetPreson/Lucid_Vision/assets/6488699/8b5d770a-b72a-4a74-aa4a-f67fe2a113ba) 2. In `Parameters` then `Generation` under `Custom stopping strings` enter &quot;Vision Model Responses:&quot; exactly as shown: ![image](https://github.com/RandomInternetPreson/Lucid_Vision/assets/6488699/7a09650d-4fce-4a8f-badb-b26b4484bf37) 3. Test to see if your LLM understands the instructions for the AI_Image character; ask it this: `Can you tell me what is in this image? Do not add any additional text to the beginning of your response to me, start your reply with the questions.` ![image](https://github.com/RandomInternetPreson/Lucid_Vision/assets/6488699/817605d7-dcab-4e6c-8c39-6d119a189431) I uploaded my image to the Lucid_Vision UI element, then typed my question, then pressed &quot;Generate&quot;; it doesn&#x27;t matter which order you upload your picture or type your question, both are sent to the LLM at the same time. ![image](https://github.com/RandomInternetPreson/Lucid_Vision/assets/6488699/f1525e8d-551d-4719-9e01-3f26b4362d7c) 4. Continue to give the model more images or just talk about something else, in this example I give the model two new images one of a humanoid robot and one of the nutritional facts for a container of sourcream: ![image](https://github.com/RandomInternetPreson/Lucid_Vision/assets/6488699/77131490-fb19-4062-a65f-c0880515e252) ![image](https://github.com/RandomInternetPreson/Lucid_Vision/assets/6488699/fed7021f-edb5-4667-aae0-c05021291618) ![image](https://github.com/RandomInternetPreson/Lucid_Vision/assets/6488699/d1049551-4e48-4ced-85d9-fbb0441330a9) Then I asked the model why the sky is blue: ![image](https://github.com/RandomInternetPreson/Lucid_Vision/assets/6488699/deb7b70a-edf7-4014-bd8c-2492bae00ea6) 5. Now I can ask it a question from a previous image and the LLM will automatically know to prompt the vision model and it will find the previously saved image on its own: ![image](https://github.com/RandomInternetPreson/Lucid_Vision/assets/6488699/e8b0791b-cce8-4183-8971-42e5383d6c46) Please note that the vision model received all this as text: &quot;Certainly! To retrieve the complete contents of the nutritional facts from the image you provided earlier, I will ask the vision model to perform an Optical Character Recognition (OCR) task on the image. Here is the question for the vision model: Can you perform OCR on the provided image and extract all the text from the nutrition facts label, including the details of each nutrient and its corresponding values?&quot; If the vision model is not that smart (Deepseek, paligemma) then it will have a difficult time contextualizing the text prior to the questions for the vision model. If you run into this situation, it is best to prompt your LLM like this: `Great, can you get the complete contents of the nutritional facts from earlier, like all the text? Just start with your questions to the vision model please.` ![image](https://github.com/RandomInternetPreson/Lucid_Vision/assets/6488699/38b59679-3550-410d-aac2-29756a058c8f) 6. Please do not get frustrated right away, the Advanced method of usage depends heavily on your LLM&#x27;s ability to understand the instructions from the character card. The instructions were written by the 8x22B model itself, I explained what I wanted the model to do and had it write its own instructions. This might be a viable alternative if you are struggling to get your model to adhere to the instructions from the character card. You may need to explain things to your llm as your conversation progresses, if it tries to query the vision model when you don&#x27;t want it to, just explain that to the model and it&#x27;s unlikely to keep making the same mistake. ## **Tips on how to get working with Llama-3-Instruct-8B-SPPO-Iter3 https://huggingface.co/UCLA-AGI/Llama-3-Instruct-8B-SPPO-Iter3** I have found to get this model (and likely similar models) working properly with lucid vision I needed to NOT use the AI_Image character card. Instead I started the conversation with the model like this: ![image](https://github.com/user-attachments/assets/4d0ff97c-f9dd-4043-b5be-7b915cb7aae7) Then copy and pasted the necessary information from the character card: ``` The following is a conversation with an AI Large Language Model. The AI has been trained to answer questions, provide recommendations, and help with decision making. The AI follows user requests. The AI thinks outside the box. Instructions for Processing Image-Related User Input with Appended File Information: Identify the Trigger Phrase: Begin by scanning the user input for the presence of the &quot;File location&quot; trigger phrase. This phrase indicates that the user has selected an image and that the LLM should consider this information in its response. Extract the File Path: Once the trigger phrase is identified, parse the text that follows to extract the absolute file path of the image. The file path will be provided immediately after the trigger phrase and will end with the image file extension (e.g., .png, .jpg). Understand the Context: Recognize that the user&#x27;s message preceding the file path is the primary context for the interaction. The LLM should address the user&#x27;s query or statement while also incorporating the availability of the selected image into its response. Formulate Questions for the Vision Model: Based on the user&#x27;s message and the fact that an image is available for analysis, generate one or more questions that can be answered by the vision model. These questions should be clear, specific, and relevant to the image content. Maintain a Conversational Tone: Ensure that the response is natural, coherent, and maintains the flow of the conversation. The LLM should act as an intermediary between the user and the vision model, facilitating a seamless dialogue. Prepare the Response Structure: Structure the response so that it includes: An acknowledgment of the user&#x27;s initial message. The questions formulated for the vision model, each clearly separated (e.g., by a newline or bullet point). Any additional information or clarification requests, if necessary. Append the File Path for Processing: At the end of the response, re-append the &quot;File location&quot; trigger phrase along with the extracted file path. This ensures that the subsequent processing steps (such as sending the information to the vision model&#x27;s CLI) can correctly identify and use the file path. Avoid Direct Interaction with the File System: As an LLM, you do not have the capability to directly access or modify files on the server or client systems. Your role is limited to generating text-based responses that include the necessary file path information for other system components to handle. Example Response: Based on your question about the contents of the image, here are the questions I will ask the vision model: - Can you describe the main objects present in the image? - Is there any text visible in the image, and if so, what does it say? - What appears to be happening in the scene depicted in the image? File location: //home/myself/Pictures/rose.png Do review and contextualize the conversation as it develops (reference your context) to infer if the user is asking new questions of previous images. Reference the parts of the convesation that are likely to yeild the file location of the image in question, and formulate your response to include that specific file location of that unique .png file, make sure you are referencing the correct .png file as per the part of the conversation that is likely to be in reference to the updated information request. By following these instructions, the LLM will be able to effectively process user inputs that include image file information and generate appropriate responses that facilitate the interaction between the user and the vision model. ``` Making sure to give a real file location to a real image. After that it started to pretty much work but I still needed to correct the model (I had to add the spelling error intentionally because if the user sends &quot;File Location:&quot; is messes up the extension: ![image](https://github.com/user-attachments/assets/5090ef8b-88ff-4ec1-a131-ba57e15e9a7d) The point to take away is that you may need to explain things to your model in various ways for it to contextualize the instructions approprately.",
            "Thank you for providing the `Lucid_Vision` script and the GitHub markdown. I have reviewed the code and the documentation to understand how the extension works. Here&#x27;s a summary of the `Lucid_Vision` extension:\n\n### Summary of `Lucid_Vision` Extension\n\n**Purpose**:\n- The `Lucid_Vision` extension enhances the capabilities of TextGen by integrating advanced vision models, allowing users to have contextualized conversations about images with their favorite language models and enabling direct communication with vision models.\n\n**Features**:\n1. **Multi-Model Support**: Interact with different vision models, including PhiVision, DeepSeek, MiniCPM-Llama3-V-2_5 (4-bit and normal precision), Bunny-v1_1-Llama-3-8B-V, and PaliGemma, with options for both GPU and CPU inference.\n2. **On-Demand Loading**: Vision models are loaded into memory only when needed to answer a question, optimizing resource usage.\n3. **Seamless Integration**: Easily switch between vision models using a Gradio UI radio button selector.\n4. **Cross-Platform Compatibility**: The extension is designed to work on various operating systems, including Unix-based systems and Windows.\n5. **Direct Communication with Vision Models**: Users do not need to load a large language model (LLM) to interact with the separate vision models.\n\n**How It Works**:\n- The extension intercepts and modifies user input and output within the TextGen framework.\n- When a user uploads an image and asks a question, the extension appends a special trigger phrase (&quot;File location:&quot;) and extracts the associated file path and question.\n- The selected vision model directly interacts with the model&#x27;s Python API to generate a response.\n- The response is then appended to the chat history, providing the user with detailed insights about the image.\n- The extension is designed to be efficient with system resources by only loading the vision models into memory when they are actively being used to process a question. After generating a response, the models are immediately unloaded to free up memory and GPU VRAM.\n\n**Next Steps**:\n1. **Update `ImageTestV1.py`**: Integrate the functionality to send each cropped image to the vision model for processing.\n2. **Store Results**: Store the results from the vision model in a structured format (e.g., a JSON file) that the large language model can reference.\n3. **Develop Interaction Logic**: Develop the logic for the large language model to interact with the web page using the stored data.\n4. **Implement Mouse and Keyboard Operation**: Implement the functionality for the large language model to operate a mouse and keyboard.\n\n### Plan for Integration\n\n1. **Review the Vision Model Integration**:\n   - Understand how the `Lucid_Vision` extension loads and interacts with different vision models.\n   - Identify the functions and methods used to process images and generate responses.\n\n2. **Update `ImageTestV1.py`**:\n   - Modify the `crop_images` function to send each cropped image to the vision model for processing.\n   - Integrate the vision model processing code from the `Lucid_Vision` extension into the `ImageTestV1.py` script.\n\n3. **Store Results**:\n   - Create a structured format (e.g., a JSON file) to store the results from the vision model.\n   - Ensure that the large language model can reference this structured data.\n\n4. **Develop Interaction Logic**:\n   - Develop the logic for the large language model to interact with the web page using the stored data.\n   - Implement the functionality for the large language model to operate a mouse and keyboard.\n\nPlease confirm if you have any specific requirements or preferences for how the vision model should be integrated into the `ImageTestV1.py` script. Once confirmed, I will proceed with the integration."
        ],
        [
            " Okay, so I hope this makes sense, but my, what I envision is the slow metamorphosis of imagetestV1 into Lucid_Vision with amazing features, but not exactly. So let me explain. You have summarized everything perfectly. I cannot overstate that. That&#x27;s great, great job. Yes, okay, so for the next step, let us, what I want you to do is I want you to contextualize the Lucid Vision code. And I want you to use the model miniCPM-lama3-v-2_5. I want you to use the code for that specific model to run all of the cropped images through. And I want you to ask that model the question, I want you to send a specific message to that model. And I want that specific message at the top of the code, so I want to be able to edit it easily like the other parameters. But I want the question to be &quot;what is the text in this image or what does this image describe?&quot; And then I want you to take that response, and I want you to include it in a JSON file that effectively has the image name, okay, so the name of the image, the cropped image that it was fed. I&#x27;m looking through the outputs here. Yeah, so it will have the image name, and then it will have the coordinates for the box, and then it will have the miniCPM model description all as a JSON output.",
            "Thank you for the detailed explanation. Let&#x27;s proceed with the integration step-by-step.\n\n### Plan for Integration\n\n1. **Contextualize the Lucid Vision Code**:\n   - Identify the necessary functions and methods from the `Lucid_Vision` extension to process images using the `miniCPM-lama3-v-2_5` model.\n\n2. **Update `ImageTestV1.py`**:\n   - Modify the `crop_images` function to send each cropped image to the `miniCPM-lama3-v-2_5` model for processing.\n   - Integrate the vision model processing code from the `Lucid_Vision` extension into the `ImageTestV1.py` script.\n\n3. **Store Results**:\n   - Create a structured format (e.g., a JSON file) to store the results from the vision model.\n   - Ensure that the large language model can reference this structured data.\n\n### Updated `ImageTestV1.py` Script\n\nHere&#x27;s the updated `ImageTestV1.py` script with the integration of the `miniCPM-lama3-v-2_5` model:\n\n```python\nimport torch\nimport pyautogui\nimport screeninfo\nfrom transformers import Owlv2Processor, Owlv2ForObjectDetection, AutoModel, AutoTokenizer\nfrom PIL import Image, ImageDraw\nimport json\nimport os\n\n# Configurable variables\nMONITOR_INDEX = 1  # Index of the monitor to capture (0 for the first monitor, 1 for the second monitor, etc.)\nSCREENSHOT_FILENAME = &quot;screenshot.png&quot;  # Filename to save the screenshot as\nTEXT_QUERIES = &quot;clickable icons,hyperlinks&quot;  # Comma-separated text queries for object detection\nSCORE_THRESHOLD = 0.16  # Score threshold for filtering low-probability predictions\nOUTPUT_DIR = &quot;ImageOutputTest&quot;  # Directory to save the cropped images\n\n# Vision model configuration\nVISION_MODEL_NAME = &quot;minicpm_llama3&quot;\nVISION_MODEL_ID = &quot;openbmb/MiniCPM-Llama3-V-2_5&quot;\nVISION_MODEL_QUESTION = &quot;What is the text in this image or what does this image describe?&quot;\n\n# Use GPU if available\nif torch.cuda.is_available():\n    device = torch.device(&quot;cuda&quot;)\nelse:\n    device = torch.device(&quot;cpu&quot;)\n\n# Load the Owlv2 model and processor\nmodel = Owlv2ForObjectDetection.from_pretrained(&quot;google/owlv2-base-patch16-ensemble&quot;).to(device)\nprocessor = Owlv2Processor.from_pretrained(&quot;google/owlv2-base-patch16-ensemble&quot;)\n\n# Load the miniCPM-lama3-v-2_5 model and tokenizer\nminicpm_llama_model = AutoModel.from_pretrained(VISION_MODEL_ID).to(device)\nminicpm_llama_tokenizer = AutoTokenizer.from_pretrained(VISION_MODEL_ID)\n\ndef take_screenshot(monitor_index=MONITOR_INDEX, output_filename=SCREENSHOT_FILENAME):\n    &quot;&quot;&quot;\n    Takes a screenshot of a specific monitor in a multi-monitor setup.\n\n    Args:\n        monitor_index (int): The index of the monitor to capture (default is MONITOR_INDEX).\n        output_filename (str): The filename to save the screenshot as (default is SCREENSHOT_FILENAME).\n    &quot;&quot;&quot;\n    # Get information about all monitors\n    monitors = screeninfo.get_monitors()\n\n    if monitor_index &gt;= len(monitors):\n        raise ValueError(f&quot;Monitor index {monitor_index} is out of range. There are only {len(monitors)} monitors.&quot;)\n\n    # Get the region of the specified monitor\n    monitor = monitors[monitor_index]\n    region = (monitor.x, monitor.y, monitor.width, monitor.height)\n\n    # Take the screenshot\n    screenshot = pyautogui.screenshot(region=region)\n\n    # Save the screenshot\n    screenshot.save(output_filename)\n    print(f&quot;Screenshot saved as {output_filename}&quot;)\n\n    return output_filename\n\ndef query_image(img_path, text_queries=TEXT_QUERIES, score_threshold=SCORE_THRESHOLD):\n    &quot;&quot;&quot;\n    Processes the image using the Owlv2 model to detect objects based on text queries.\n\n    Args:\n        img_path (str): The path to the input image.\n        text_queries (str): Comma-separated text queries for object detection (default is TEXT_QUERIES).\n        score_threshold (float): The score threshold for filtering low-probability predictions (default is SCORE_THRESHOLD).\n    &quot;&quot;&quot;\n    img = Image.open(img_path)\n    text_queries = text_queries.split(&quot;,&quot;)\n\n    size = max(img.size)\n    target_sizes = torch.Tensor([[size, size]])\n    inputs = processor(text=text_queries, images=img, return_tensors=&quot;pt&quot;).to(device)\n\n    with torch.no_grad():\n        outputs = model(**inputs)\n\n    outputs.logits = outputs.logits.cpu()\n    outputs.pred_boxes = outputs.pred_boxes.cpu()\n    results = processor.post_process_object_detection(outputs=outputs, target_sizes=target_sizes)\n    boxes, scores, labels = results[0][&quot;boxes&quot;], results[0][&quot;scores&quot;], results[0][&quot;labels&quot;]\n\n    result_labels = []\n    img_pil = img.copy()\n    draw = ImageDraw.Draw(img_pil)\n\n    for box, score, label in zip(boxes, scores, labels):\n        box = [int(i) for i in box.tolist()]\n        if score &lt; score_threshold:\n            continue\n        result_labels.append([box, text_queries[label.item()]])\n        draw.rectangle(box, outline=&quot;red&quot;, width=2)\n        # Do not draw the text label within the bounding box\n        # draw.text((box[0], box[1]), text_queries[label.item()], fill=&quot;red&quot;)\n\n    # Save the annotated image for debugging\n    annotated_image_path = &quot;annotated_image.png&quot;\n    img_pil.save(annotated_image_path)\n\n    # Save the JSON output to a file\n    json_output_path = &quot;output.json&quot;\n    with open(json_output_path, &#x27;w&#x27;) as f:\n        json.dump(result_labels, f)\n\n    return annotated_image_path, json_output_path\n\ndef crop_images(image_path, json_output_path, output_dir=OUTPUT_DIR):\n    &quot;&quot;&quot;\n    Crops out the detected objects from the input image based on the bounding box coordinates.\n\n    Args:\n        image_path (str): The path to the input image.\n        json_output_path (str): The path to the JSON file containing the bounding box coordinates.\n        output_dir (str): The directory to save the cropped images (default is OUTPUT_DIR).\n    &quot;&quot;&quot;\n    with open(json_output_path, &#x27;r&#x27;) as f:\n        data = json.load(f)\n\n    img = Image.open(image_path)\n\n    results = []\n\n    for i, item in enumerate(data):\n        box = item[0]\n        label = item[1]\n        cropped_img = img.crop(box)\n        cropped_img_path = f&quot;{output_dir}/{label}_{i}.png&quot;\n        cropped_img.save(cropped_img_path)\n\n        # Process the cropped image with the miniCPM-lama3-v-2_5 model\n        vision_model_response = process_with_vision_model(cropped_img_path, VISION_MODEL_QUESTION)\n\n        # Store the results in a structured format\n        results.append({\n            &quot;image_name&quot;: f&quot;{label}_{i}.png&quot;,\n            &quot;coordinates&quot;: box,\n            &quot;description&quot;: vision_model_response\n        })\n\n    # Save the results to a JSON file\n    results_json_path = f&quot;{output_dir}/results.json&quot;\n    with open(results_json_path, &#x27;w&#x27;) as f:\n        json.dump(results, f, indent=4)\n\ndef process_with_vision_model(image_path, question):\n    &quot;&quot;&quot;\n    Processes the image with the miniCPM-lama3-v-2_5 model and returns the response.\n\n    Args:\n        image_path (str): The path to the input image.\n        question (str): The question to ask the vision model.\n\n    Returns:\n        str: The response from the vision model.\n    &quot;&quot;&quot;\n    image = Image.open(image_path).convert(&quot;RGB&quot;)\n    messages = [\n        {&quot;role&quot;: &quot;user&quot;, &quot;content&quot;: f&quot;{question}&quot;}\n    ]\n    # Define the generation arguments\n    generation_args = {\n        &quot;max_new_tokens&quot;: 896,\n        &quot;repetition_penalty&quot;: 1.05,\n        &quot;num_beams&quot;: 3,\n        &quot;top_p&quot;: 0.8,\n        &quot;top_k&quot;: 1,\n        &quot;temperature&quot;: 0.7,\n        &quot;sampling&quot;: True,\n    }\n    if &quot;int4&quot; in VISION_MODEL_ID:\n        # Disable streaming for the 4-bit model\n        generation_args[&quot;stream&quot;] = False\n        # Use the model.chat method with streaming enabled\n        vision_model_response = &quot;&quot;\n        for new_text in minicpm_llama_model.chat(\n                image=image,\n                msgs=messages,\n                tokenizer=minicpm_llama_tokenizer,\n                **generation_args\n        ):\n            vision_model_response += new_text\n            print(new_text, flush=True, end=&#x27;&#x27;)\n    else:\n        minicpm_llama_model.to(device)\n        vision_model_response = minicpm_llama_model.chat(\n            image=image,\n            msgs=messages,\n            tokenizer=minicpm_llama_tokenizer,\n            **generation_args\n        )\n    return vision_model_response\n\nif __name__ == &quot;__main__&quot;:\n    # Take the screenshot\n    screenshot_path = take_screenshot()\n\n    # Process the image using the Owlv2 model\n    annotated_image_path, json_output_path = query_image(screenshot_path)\n\n    # Create the output directory if it doesn&#x27;t exist\n    os.makedirs(OUTPUT_DIR, exist_ok=True)\n\n    # Crop the detected objects and save them to the output directory\n    crop_images(annotated_image_path, json_output_path)\n\n    print(&quot;Processing complete. Check the output directory for the cropped images and results.&quot;)\n```\n\n### Explanation\n\n1. **Configurable Variables**:\n   - Added `VISION_MODEL_NAME`, `VISION_MODEL_ID`, and `VISION_MODEL_QUESTION` for easy configuration.\n\n2. **Model Loading**:\n   - Loaded the `miniCPM-lama3-v-2_5` model and tokenizer.\n\n3. **Function `process_with_vision_model`**:\n   - This function processes the image with the `miniCPM-lama3-v-2_5` model and returns the response.\n\n4. **Function `crop_images`**:\n   - Modified to send each cropped image to the `miniCPM-lama3-v-2_5` model for processing.\n   - Stores the results in a structured format (JSON file).\n\nPlease review the updated script and let me know if you have any further requirements or adjustments."
        ],
        [
            "Hmm, I&#x27;m getting out of memory errors:\n\ntorch.OutOfMemoryError: CUDA out of memory. Tried to allocate 224.00 MiB. GPU 0 has a total capacity of 23.65 GiB of which 209.56 MiB is free. Including non-PyTorch memory, this process has 23.43 GiB memory in use. Of the allocated memory 23.00 GiB is allocated by PyTorch, and 49.77 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)\n\n\nIs this because you are trying to send all the images at once?  Or is this because the Owlv2 model is not unloaded?  OR maybe something else?",
            "The out-of-memory error you&#x27;re encountering is likely due to the high memory usage of the models, especially when both the Owlv2 model and the miniCPM-lama3-v-2_5 model are loaded into memory simultaneously. Additionally, processing multiple images at once can also contribute to high memory usage.\n\nTo address this issue, we can take the following steps:\n\n1. **Unload the Owlv2 Model**: After processing the image and obtaining the bounding boxes, we can unload the Owlv2 model to free up memory.\n2. **Process Images Sequentially**: Instead of processing all cropped images at once, we can process them one by one to reduce memory usage.\n3. **Clear CUDA Cache**: Clear the CUDA cache after processing each image to free up memory.\n\nHere&#x27;s the updated `ImageTestV1.py` script with these changes:\n\n```python\nimport torch\nimport pyautogui\nimport screeninfo\nfrom transformers import Owlv2Processor, Owlv2ForObjectDetection, AutoModel, AutoTokenizer\nfrom PIL import Image, ImageDraw\nimport json\nimport os\n\n# Configurable variables\nMONITOR_INDEX = 1  # Index of the monitor to capture (0 for the first monitor, 1 for the second monitor, etc.)\nSCREENSHOT_FILENAME = &quot;screenshot.png&quot;  # Filename to save the screenshot as\nTEXT_QUERIES = &quot;clickable icons,hyperlinks&quot;  # Comma-separated text queries for object detection\nSCORE_THRESHOLD = 0.16  # Score threshold for filtering low-probability predictions\nOUTPUT_DIR = &quot;ImageOutputTest&quot;  # Directory to save the cropped images\n\n# Vision model configuration\nVISION_MODEL_NAME = &quot;minicpm_llama3&quot;\nVISION_MODEL_ID = &quot;openbmb/MiniCPM-Llama3-V-2_5&quot;\nVISION_MODEL_QUESTION = &quot;What is the text in this image or what does this image describe?&quot;\n\n# Use GPU if available\nif torch.cuda.is_available():\n    device = torch.device(&quot;cuda&quot;)\nelse:\n    device = torch.device(&quot;cpu&quot;)\n\n# Load the Owlv2 model and processor\nmodel = Owlv2ForObjectDetection.from_pretrained(&quot;google/owlv2-base-patch16-ensemble&quot;).to(device)\nprocessor = Owlv2Processor.from_pretrained(&quot;google/owlv2-base-patch16-ensemble&quot;)\n\ndef take_screenshot(monitor_index=MONITOR_INDEX, output_filename=SCREENSHOT_FILENAME):\n    &quot;&quot;&quot;\n    Takes a screenshot of a specific monitor in a multi-monitor setup.\n\n    Args:\n        monitor_index (int): The index of the monitor to capture (default is MONITOR_INDEX).\n        output_filename (str): The filename to save the screenshot as (default is SCREENSHOT_FILENAME).\n    &quot;&quot;&quot;\n    # Get information about all monitors\n    monitors = screeninfo.get_monitors()\n\n    if monitor_index &gt;= len(monitors):\n        raise ValueError(f&quot;Monitor index {monitor_index} is out of range. There are only {len(monitors)} monitors.&quot;)\n\n    # Get the region of the specified monitor\n    monitor = monitors[monitor_index]\n    region = (monitor.x, monitor.y, monitor.width, monitor.height)\n\n    # Take the screenshot\n    screenshot = pyautogui.screenshot(region=region)\n\n    # Save the screenshot\n    screenshot.save(output_filename)\n    print(f&quot;Screenshot saved as {output_filename}&quot;)\n\n    return output_filename\n\ndef query_image(img_path, text_queries=TEXT_QUERIES, score_threshold=SCORE_THRESHOLD):\n    &quot;&quot;&quot;\n    Processes the image using the Owlv2 model to detect objects based on text queries.\n\n    Args:\n        img_path (str): The path to the input image.\n        text_queries (str): Comma-separated text queries for object detection (default is TEXT_QUERIES).\n        score_threshold (float): The score threshold for filtering low-probability predictions (default is SCORE_THRESHOLD).\n    &quot;&quot;&quot;\n    img = Image.open(img_path)\n    text_queries = text_queries.split(&quot;,&quot;)\n\n    size = max(img.size)\n    target_sizes = torch.Tensor([[size, size]])\n    inputs = processor(text=text_queries, images=img, return_tensors=&quot;pt&quot;).to(device)\n\n    with torch.no_grad():\n        outputs = model(**inputs)\n\n    outputs.logits = outputs.logits.cpu()\n    outputs.pred_boxes = outputs.pred_boxes.cpu()\n    results = processor.post_process_object_detection(outputs=outputs, target_sizes=target_sizes)\n    boxes, scores, labels = results[0][&quot;boxes&quot;], results[0][&quot;scores&quot;], results[0][&quot;labels&quot;]\n\n    result_labels = []\n    img_pil = img.copy()\n    draw = ImageDraw.Draw(img_pil)\n\n    for box, score, label in zip(boxes, scores, labels):\n        box = [int(i) for i in box.tolist()]\n        if score &lt; score_threshold:\n            continue\n        result_labels.append([box, text_queries[label.item()]])\n        draw.rectangle(box, outline=&quot;red&quot;, width=2)\n        # Do not draw the text label within the bounding box\n        # draw.text((box[0], box[1]), text_queries[label.item()], fill=&quot;red&quot;)\n\n    # Save the annotated image for debugging\n    annotated_image_path = &quot;annotated_image.png&quot;\n    img_pil.save(annotated_image_path)\n\n    # Save the JSON output to a file\n    json_output_path = &quot;output.json&quot;\n    with open(json_output_path, &#x27;w&#x27;) as f:\n        json.dump(result_labels, f)\n\n    # Unload the Owlv2 model to free up memory\n    del model\n    torch.cuda.empty_cache()\n\n    return annotated_image_path, json_output_path\n\ndef crop_images(image_path, json_output_path, output_dir=OUTPUT_DIR):\n    &quot;&quot;&quot;\n    Crops out the detected objects from the input image based on the bounding box coordinates.\n\n    Args:\n        image_path (str): The path to the input image.\n        json_output_path (str): The path to the JSON file containing the bounding box coordinates.\n        output_dir (str): The directory to save the cropped images (default is OUTPUT_DIR).\n    &quot;&quot;&quot;\n    with open(json_output_path, &#x27;r&#x27;) as f:\n        data = json.load(f)\n\n    img = Image.open(image_path)\n\n    results = []\n\n    for i, item in enumerate(data):\n        box = item[0]\n        label = item[1]\n        cropped_img = img.crop(box)\n        cropped_img_path = f&quot;{output_dir}/{label}_{i}.png&quot;\n        cropped_img.save(cropped_img_path)\n\n        # Process the cropped image with the miniCPM-lama3-v-2_5 model\n        vision_model_response = process_with_vision_model(cropped_img_path, VISION_MODEL_QUESTION)\n\n        # Store the results in a structured format\n        results.append({\n            &quot;image_name&quot;: f&quot;{label}_{i}.png&quot;,\n            &quot;coordinates&quot;: box,\n            &quot;description&quot;: vision_model_response\n        })\n\n        # Clear CUDA cache after processing each image\n        torch.cuda.empty_cache()\n\n    # Save the results to a JSON file\n    results_json_path = f&quot;{output_dir}/results.json&quot;\n    with open(results_json_path, &#x27;w&#x27;) as f:\n        json.dump(results, f, indent=4)\n\ndef process_with_vision_model(image_path, question):\n    &quot;&quot;&quot;\n    Processes the image with the miniCPM-lama3-v-2_5 model and returns the response.\n\n    Args:\n        image_path (str): The path to the input image.\n        question (str): The question to ask the vision model.\n\n    Returns:\n        str: The response from the vision model.\n    &quot;&quot;&quot;\n    # Load the miniCPM-lama3-v-2_5 model and tokenizer\n    minicpm_llama_model = AutoModel.from_pretrained(VISION_MODEL_ID).to(device)\n    minicpm_llama_tokenizer = AutoTokenizer.from_pretrained(VISION_MODEL_ID)\n\n    image = Image.open(image_path).convert(&quot;RGB&quot;)\n    messages = [\n        {&quot;role&quot;: &quot;user&quot;, &quot;content&quot;: f&quot;{question}&quot;}\n    ]\n    # Define the generation arguments\n    generation_args = {\n        &quot;max_new_tokens&quot;: 896,\n        &quot;repetition_penalty&quot;: 1.05,\n        &quot;num_beams&quot;: 3,\n        &quot;top_p&quot;: 0.8,\n        &quot;top_k&quot;: 1,\n        &quot;temperature&quot;: 0.7,\n        &quot;sampling&quot;: True,\n    }\n    if &quot;int4&quot; in VISION_MODEL_ID:\n        # Disable streaming for the 4-bit model\n        generation_args[&quot;stream&quot;] = False\n        # Use the model.chat method with streaming enabled\n        vision_model_response = &quot;&quot;\n        for new_text in minicpm_llama_model.chat(\n                image=image,\n                msgs=messages,\n                tokenizer=minicpm_llama_tokenizer,\n                **generation_args\n        ):\n            vision_model_response += new_text\n            print(new_text, flush=True, end=&#x27;&#x27;)\n    else:\n        minicpm_llama_model.to(device)\n        vision_model_response = minicpm_llama_model.chat(\n            image=image,\n            msgs=messages,\n            tokenizer=minicpm_llama_tokenizer,\n            **generation_args\n        )\n\n    # Unload the miniCPM-lama3-v-2_5 model to free up memory\n    del minicpm_llama_model\n    del minicpm_llama_tokenizer\n    torch.cuda.empty_cache()\n\n    return vision_model_response\n\nif __name__ == &quot;__main__&quot;:\n    # Take the screenshot\n    screenshot_path = take_screenshot()\n\n    # Process the image using the Owlv2 model\n    annotated_image_path, json_output_path = query_image(screenshot_path)\n\n    # Create the output directory if it doesn&#x27;t exist\n    os.makedirs(OUTPUT_DIR, exist_ok=True)\n\n    # Crop the detected objects and save them to the output directory\n    crop_images(annotated_image_path, json_output_path)\n\n    print(&quot;Processing complete. Check the output directory for the cropped images and results.&quot;)\n```\n\n### Explanation\n\n1. **Unload the Owlv2 Model**:\n   - After processing the image and obtaining the bounding boxes, the Owlv2 model is unloaded to free up memory.\n\n2. **Process Images Sequentially**:\n   - The `crop_images` function processes each cropped image one by one to reduce memory usage.\n\n3. **Clear CUDA Cache**:\n   - The CUDA cache is cleared after processing each image to free up memory.\n\nPlease review the updated script and let me know if you encounter any further issues or have additional requirements."
        ],
        [
            "I&#x27;m getting this error when I run the code:\n\n(Owlv2demo) myself@myself-System-Product-Name:~/Desktop/ComputerVision/Owlv2Models/owlv2$ python ImageTestV1.py\nScreenshot saved as screenshot.png\nTraceback (most recent call last):\n  File &quot;/home/myself/Desktop/ComputerVision/Owlv2Models/owlv2/ImageTestV1.py&quot;, line 215, in &lt;module&gt;\n    annotated_image_path, json_output_path = query_image(screenshot_path)\n                                             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n  File &quot;/home/myself/Desktop/ComputerVision/Owlv2Models/owlv2/ImageTestV1.py&quot;, line 76, in query_image\n    outputs = model(**inputs)\n              ^^^^^\nUnboundLocalError: cannot access local variable &#x27;model&#x27; where it is not associated with a value\n(Owlv2demo) myself@myself-System-Product-Name:~/Desktop/ComputerVision/Owlv2Models/owlv2$ \n\n\nAdditionally, I changed \n\nVISION_MODEL_ID = &quot;openbmb/MiniCPM-Llama3-V-2_5&quot;\n\nto \n\nVISION_MODEL_ID = &quot;/home/myself/Desktop/miniCPM_llava3V/MiniCPM-Llama3-V-2_5&quot;\n\n",
            "The error you&#x27;re encountering is due to the `model` variable being unloaded before it is used in the `query_image` function. To fix this, we need to ensure that the `model` variable is only unloaded after it has been used to process the image.\n\nAdditionally, I&#x27;ll update the script to handle the local path for the `VISION_MODEL_ID`.\n\nHere&#x27;s the corrected `ImageTestV1.py` script:\n\n```python\nimport torch\nimport pyautogui\nimport screeninfo\nfrom transformers import Owlv2Processor, Owlv2ForObjectDetection, AutoModel, AutoTokenizer\nfrom PIL import Image, ImageDraw\nimport json\nimport os\n\n# Configurable variables\nMONITOR_INDEX = 1  # Index of the monitor to capture (0 for the first monitor, 1 for the second monitor, etc.)\nSCREENSHOT_FILENAME = &quot;screenshot.png&quot;  # Filename to save the screenshot as\nTEXT_QUERIES = &quot;clickable icons,hyperlinks&quot;  # Comma-separated text queries for object detection\nSCORE_THRESHOLD = 0.16  # Score threshold for filtering low-probability predictions\nOUTPUT_DIR = &quot;ImageOutputTest&quot;  # Directory to save the cropped images\n\n# Vision model configuration\nVISION_MODEL_NAME = &quot;minicpm_llama3&quot;\nVISION_MODEL_ID = &quot;/home/myself/Desktop/miniCPM_llava3V/MiniCPM-Llama3-V-2_5&quot;\nVISION_MODEL_QUESTION = &quot;What is the text in this image or what does this image describe?&quot;\n\n# Use GPU if available\nif torch.cuda.is_available():\n    device = torch.device(&quot;cuda&quot;)\nelse:\n    device = torch.device(&quot;cpu&quot;)\n\n# Load the Owlv2 model and processor\nmodel = Owlv2ForObjectDetection.from_pretrained(&quot;google/owlv2-base-patch16-ensemble&quot;).to(device)\nprocessor = Owlv2Processor.from_pretrained(&quot;google/owlv2-base-patch16-ensemble&quot;)\n\ndef take_screenshot(monitor_index=MONITOR_INDEX, output_filename=SCREENSHOT_FILENAME):\n    &quot;&quot;&quot;\n    Takes a screenshot of a specific monitor in a multi-monitor setup.\n\n    Args:\n        monitor_index (int): The index of the monitor to capture (default is MONITOR_INDEX).\n        output_filename (str): The filename to save the screenshot as (default is SCREENSHOT_FILENAME).\n    &quot;&quot;&quot;\n    # Get information about all monitors\n    monitors = screeninfo.get_monitors()\n\n    if monitor_index &gt;= len(monitors):\n        raise ValueError(f&quot;Monitor index {monitor_index} is out of range. There are only {len(monitors)} monitors.&quot;)\n\n    # Get the region of the specified monitor\n    monitor = monitors[monitor_index]\n    region = (monitor.x, monitor.y, monitor.width, monitor.height)\n\n    # Take the screenshot\n    screenshot = pyautogui.screenshot(region=region)\n\n    # Save the screenshot\n    screenshot.save(output_filename)\n    print(f&quot;Screenshot saved as {output_filename}&quot;)\n\n    return output_filename\n\ndef query_image(img_path, text_queries=TEXT_QUERIES, score_threshold=SCORE_THRESHOLD):\n    &quot;&quot;&quot;\n    Processes the image using the Owlv2 model to detect objects based on text queries.\n\n    Args:\n        img_path (str): The path to the input image.\n        text_queries (str): Comma-separated text queries for object detection (default is TEXT_QUERIES).\n        score_threshold (float): The score threshold for filtering low-probability predictions (default is SCORE_THRESHOLD).\n    &quot;&quot;&quot;\n    img = Image.open(img_path)\n    text_queries = text_queries.split(&quot;,&quot;)\n\n    size = max(img.size)\n    target_sizes = torch.Tensor([[size, size]])\n    inputs = processor(text=text_queries, images=img, return_tensors=&quot;pt&quot;).to(device)\n\n    with torch.no_grad():\n        outputs = model(**inputs)\n\n    outputs.logits = outputs.logits.cpu()\n    outputs.pred_boxes = outputs.pred_boxes.cpu()\n    results = processor.post_process_object_detection(outputs=outputs, target_sizes=target_sizes)\n    boxes, scores, labels = results[0][&quot;boxes&quot;], results[0][&quot;scores&quot;], results[0][&quot;labels&quot;]\n\n    result_labels = []\n    img_pil = img.copy()\n    draw = ImageDraw.Draw(img_pil)\n\n    for box, score, label in zip(boxes, scores, labels):\n        box = [int(i) for i in box.tolist()]\n        if score &lt; score_threshold:\n            continue\n        result_labels.append([box, text_queries[label.item()]])\n        draw.rectangle(box, outline=&quot;red&quot;, width=2)\n        # Do not draw the text label within the bounding box\n        # draw.text((box[0], box[1]), text_queries[label.item()], fill=&quot;red&quot;)\n\n    # Save the annotated image for debugging\n    annotated_image_path = &quot;annotated_image.png&quot;\n    img_pil.save(annotated_image_path)\n\n    # Save the JSON output to a file\n    json_output_path = &quot;output.json&quot;\n    with open(json_output_path, &#x27;w&#x27;) as f:\n        json.dump(result_labels, f)\n\n    return annotated_image_path, json_output_path\n\ndef crop_images(image_path, json_output_path, output_dir=OUTPUT_DIR):\n    &quot;&quot;&quot;\n    Crops out the detected objects from the input image based on the bounding box coordinates.\n\n    Args:\n        image_path (str): The path to the input image.\n        json_output_path (str): The path to the JSON file containing the bounding box coordinates.\n        output_dir (str): The directory to save the cropped images (default is OUTPUT_DIR).\n    &quot;&quot;&quot;\n    with open(json_output_path, &#x27;r&#x27;) as f:\n        data = json.load(f)\n\n    img = Image.open(image_path)\n\n    results = []\n\n    for i, item in enumerate(data):\n        box = item[0]\n        label = item[1]\n        cropped_img = img.crop(box)\n        cropped_img_path = f&quot;{output_dir}/{label}_{i}.png&quot;\n        cropped_img.save(cropped_img_path)\n\n        # Process the cropped image with the miniCPM-lama3-v-2_5 model\n        vision_model_response = process_with_vision_model(cropped_img_path, VISION_MODEL_QUESTION)\n\n        # Store the results in a structured format\n        results.append({\n            &quot;image_name&quot;: f&quot;{label}_{i}.png&quot;,\n            &quot;coordinates&quot;: box,\n            &quot;description&quot;: vision_model_response\n        })\n\n        # Clear CUDA cache after processing each image\n        torch.cuda.empty_cache()\n\n    # Save the results to a JSON file\n    results_json_path = f&quot;{output_dir}/results.json&quot;\n    with open(results_json_path, &#x27;w&#x27;) as f:\n        json.dump(results, f, indent=4)\n\ndef process_with_vision_model(image_path, question):\n    &quot;&quot;&quot;\n    Processes the image with the miniCPM-lama3-v-2_5 model and returns the response.\n\n    Args:\n        image_path (str): The path to the input image.\n        question (str): The question to ask the vision model.\n\n    Returns:\n        str: The response from the vision model.\n    &quot;&quot;&quot;\n    # Load the miniCPM-lama3-v-2_5 model and tokenizer\n    minicpm_llama_model = AutoModel.from_pretrained(VISION_MODEL_ID).to(device)\n    minicpm_llama_tokenizer = AutoTokenizer.from_pretrained(VISION_MODEL_ID)\n\n    image = Image.open(image_path).convert(&quot;RGB&quot;)\n    messages = [\n        {&quot;role&quot;: &quot;user&quot;, &quot;content&quot;: f&quot;{question}&quot;}\n    ]\n    # Define the generation arguments\n    generation_args = {\n        &quot;max_new_tokens&quot;: 896,\n        &quot;repetition_penalty&quot;: 1.05,\n        &quot;num_beams&quot;: 3,\n        &quot;top_p&quot;: 0.8,\n        &quot;top_k&quot;: 1,\n        &quot;temperature&quot;: 0.7,\n        &quot;sampling&quot;: True,\n    }\n    if &quot;int4&quot; in VISION_MODEL_ID:\n        # Disable streaming for the 4-bit model\n        generation_args[&quot;stream&quot;] = False\n        # Use the model.chat method with streaming enabled\n        vision_model_response = &quot;&quot;\n        for new_text in minicpm_llama_model.chat(\n                image=image,\n                msgs=messages,\n                tokenizer=minicpm_llama_tokenizer,\n                **generation_args\n        ):\n            vision_model_response += new_text\n            print(new_text, flush=True, end=&#x27;&#x27;)\n    else:\n        minicpm_llama_model.to(device)\n        vision_model_response = minicpm_llama_model.chat(\n            image=image,\n            msgs=messages,\n            tokenizer=minicpm_llama_tokenizer,\n            **generation_args\n        )\n\n    # Unload the miniCPM-lama3-v-2_5 model to free up memory\n    del minicpm_llama_model\n    del minicpm_llama_tokenizer\n    torch.cuda.empty_cache()\n\n    return vision_model_response\n\nif __name__ == &quot;__main__&quot;:\n    # Take the screenshot\n    screenshot_path = take_screenshot()\n\n    # Process the image using the Owlv2 model\n    annotated_image_path, json_output_path = query_image(screenshot_path)\n\n    # Unload the Owlv2 model to free up memory\n    del model\n    torch.cuda.empty_cache()\n\n    # Create the output directory if it doesn&#x27;t exist\n    os.makedirs(OUTPUT_DIR, exist_ok=True)\n\n    # Crop the detected objects and save them to the output directory\n    crop_images(annotated_image_path, json_output_path)\n\n    print(&quot;Processing complete. Check the output directory for the cropped images and results.&quot;)\n```\n\n### Explanation\n\n1. **Unload the Owlv2 Model**:\n   - The Owlv2 model is unloaded after processing the image and obtaining the bounding boxes.\n\n2. **Process Images Sequentially**:\n   - The `crop_images` function processes each cropped image one by one to reduce memory usage.\n\n3. **Clear CUDA Cache**:\n   - The CUDA cache is cleared after processing each image to free up memory.\n\n4. **Handle Local Path for Vision Model**:\n   - The script now handles the local path for the `VISION_MODEL_ID`.\n\nPlease review the updated script and let me know if you encounter any further issues or have additional requirements."
        ],
        [
            "Hmm, I&#x27;m still getting oom errors:\n\n(Owlv2demo) myself@myself-System-Product-Name:~/Desktop/ComputerVision/Owlv2Models/owlv2$ python ImageTestV1.py\nScreenshot saved as screenshot.png\nThe repository for /home/myself/Desktop/miniCPM_llava3V/MiniCPM-Llama3-V-2_5 contains custom code which must be executed to correctly load the model. You can inspect the repository content at https://hf.co//home/myself/Desktop/miniCPM_llava3V/MiniCPM-Llama3-V-2_5.\nYou can avoid this prompt in future by passing the argument `trust_remote_code=True`.\n\nDo you wish to run the custom code? [y/N] y\nThe repository for /home/myself/Desktop/miniCPM_llava3V/MiniCPM-Llama3-V-2_5 contains custom code which must be executed to correctly load the model. You can inspect the repository content at https://hf.co//home/myself/Desktop/miniCPM_llava3V/MiniCPM-Llama3-V-2_5.\nYou can avoid this prompt in future by passing the argument `trust_remote_code=True`.\n\nDo you wish to run the custom code? [y/N] y\nLoading checkpoint shards: 100%|█████████████████████████████████████████████████████████| 7/7 [00:06&lt;00:00,  1.01it/s]\nTraceback (most recent call last):\n  File &quot;/home/myself/Desktop/ComputerVision/Owlv2Models/owlv2/ImageTestV1.py&quot;, line 220, in &lt;module&gt;\n    crop_images(annotated_image_path, json_output_path)\n  File &quot;/home/myself/Desktop/ComputerVision/Owlv2Models/owlv2/ImageTestV1.py&quot;, line 130, in crop_images\n    vision_model_response = process_with_vision_model(cropped_img_path, VISION_MODEL_QUESTION)\n                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n  File &quot;/home/myself/Desktop/ComputerVision/Owlv2Models/owlv2/ImageTestV1.py&quot;, line 159, in process_with_vision_model\n    minicpm_llama_model = AutoModel.from_pretrained(VISION_MODEL_ID).to(device)\n                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n  File &quot;/home/myself/miniconda3/envs/Owlv2demo/lib/python3.11/site-packages/transformers/modeling_utils.py&quot;, line 2884, in to\n    return super().to(*args, **kwargs)\n           ^^^^^^^^^^^^^^^^^^^^^^^^^^^\n  File &quot;/home/myself/miniconda3/envs/Owlv2demo/lib/python3.11/site-packages/torch/nn/modules/module.py&quot;, line 1174, in to\n    return self._apply(convert)\n           ^^^^^^^^^^^^^^^^^^^^\n  File &quot;/home/myself/miniconda3/envs/Owlv2demo/lib/python3.11/site-packages/torch/nn/modules/module.py&quot;, line 780, in _apply\n    module._apply(fn)\n  File &quot;/home/myself/miniconda3/envs/Owlv2demo/lib/python3.11/site-packages/torch/nn/modules/module.py&quot;, line 780, in _apply\n    module._apply(fn)\n  File &quot;/home/myself/miniconda3/envs/Owlv2demo/lib/python3.11/site-packages/torch/nn/modules/module.py&quot;, line 780, in _apply\n    module._apply(fn)\n  [Previous line repeated 3 more times]\n  File &quot;/home/myself/miniconda3/envs/Owlv2demo/lib/python3.11/site-packages/torch/nn/modules/module.py&quot;, line 805, in _apply\n    param_applied = fn(param)\n                    ^^^^^^^^^\n  File &quot;/home/myself/miniconda3/envs/Owlv2demo/lib/python3.11/site-packages/torch/nn/modules/module.py&quot;, line 1160, in convert\n    return t.to(\n           ^^^^^\ntorch.OutOfMemoryError: CUDA out of memory. Tried to allocate 224.00 MiB. GPU 0 has a total capacity of 23.65 GiB of which 50.00 MiB is free. Process 55561 has 452.00 MiB memory in use. Including non-PyTorch memory, this process has 23.14 GiB memory in use. Of the allocated memory 22.64 GiB is allocated by PyTorch, and 53.68 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)\n(Owlv2demo) myself@myself-System-Product-Name:~/Desktop/ComputerVision/Owlv2Models/owlv2$ \n\n\nWhen I run the Lucid_Vision extension by itself in textgen I can run that model just fine without oom errors.  That makes me wonder if you are sending the images one by one and unloading the owl model? \n\nLet&#x27;s look at the output of the error compared to the previous error: okay we can see that the screenshot is being saved.   Good, I can verify the file is being saved.  Now if we look at the code the screenshot happens and then the owl model should be loaded and fed the screenshot image with it outputting the .json and annotated_image.png.  That model is super small like 1 gig small, so there should not be any oom errors on my 24GB card that is empty and only running the code you are writing.  \n\nSo are you sure the code is following the correct sequence, screenshot, owl (annotated_image.png and output.json), minicpm model being fed one image at a time in sequence.",
            "Thank you for the detailed explanation. It seems that the sequence of operations might not be correctly handling the memory management, leading to out-of-memory errors. Let&#x27;s ensure that the Owlv2 model is unloaded after processing the image and that the miniCPM-lama3-v-2_5 model is loaded and unloaded correctly for each cropped image.\n\nHere&#x27;s the updated `ImageTestV1.py` script with improved memory management:\n\n```python\nimport torch\nimport pyautogui\nimport screeninfo\nfrom transformers import Owlv2Processor, Owlv2ForObjectDetection, AutoModel, AutoTokenizer\nfrom PIL import Image, ImageDraw\nimport json\nimport os\n\n# Configurable variables\nMONITOR_INDEX = 1  # Index of the monitor to capture (0 for the first monitor, 1 for the second monitor, etc.)\nSCREENSHOT_FILENAME = &quot;screenshot.png&quot;  # Filename to save the screenshot as\nTEXT_QUERIES = &quot;clickable icons,hyperlinks&quot;  # Comma-separated text queries for object detection\nSCORE_THRESHOLD = 0.16  # Score threshold for filtering low-probability predictions\nOUTPUT_DIR = &quot;ImageOutputTest&quot;  # Directory to save the cropped images\n\n# Vision model configuration\nVISION_MODEL_NAME = &quot;minicpm_llama3&quot;\nVISION_MODEL_ID = &quot;/home/myself/Desktop/miniCPM_llava3V/MiniCPM-Llama3-V-2_5&quot;\nVISION_MODEL_QUESTION = &quot;What is the text in this image or what does this image describe?&quot;\n\n# Use GPU if available\nif torch.cuda.is_available():\n    device = torch.device(&quot;cuda&quot;)\nelse:\n    device = torch.device(&quot;cpu&quot;)\n\ndef take_screenshot(monitor_index=MONITOR_INDEX, output_filename=SCREENSHOT_FILENAME):\n    &quot;&quot;&quot;\n    Takes a screenshot of a specific monitor in a multi-monitor setup.\n\n    Args:\n        monitor_index (int): The index of the monitor to capture (default is MONITOR_INDEX).\n        output_filename (str): The filename to save the screenshot as (default is SCREENSHOT_FILENAME).\n    &quot;&quot;&quot;\n    # Get information about all monitors\n    monitors = screeninfo.get_monitors()\n\n    if monitor_index &gt;= len(monitors):\n        raise ValueError(f&quot;Monitor index {monitor_index} is out of range. There are only {len(monitors)} monitors.&quot;)\n\n    # Get the region of the specified monitor\n    monitor = monitors[monitor_index]\n    region = (monitor.x, monitor.y, monitor.width, monitor.height)\n\n    # Take the screenshot\n    screenshot = pyautogui.screenshot(region=region)\n\n    # Save the screenshot\n    screenshot.save(output_filename)\n    print(f&quot;Screenshot saved as {output_filename}&quot;)\n\n    return output_filename\n\ndef query_image(img_path, text_queries=TEXT_QUERIES, score_threshold=SCORE_THRESHOLD):\n    &quot;&quot;&quot;\n    Processes the image using the Owlv2 model to detect objects based on text queries.\n\n    Args:\n        img_path (str): The path to the input image.\n        text_queries (str): Comma-separated text queries for object detection (default is TEXT_QUERIES).\n        score_threshold (float): The score threshold for filtering low-probability predictions (default is SCORE_THRESHOLD).\n    &quot;&quot;&quot;\n    # Load the Owlv2 model and processor\n    model = Owlv2ForObjectDetection.from_pretrained(&quot;google/owlv2-base-patch16-ensemble&quot;).to(device)\n    processor = Owlv2Processor.from_pretrained(&quot;google/owlv2-base-patch16-ensemble&quot;)\n\n    img = Image.open(img_path)\n    text_queries = text_queries.split(&quot;,&quot;)\n\n    size = max(img.size)\n    target_sizes = torch.Tensor([[size, size]])\n    inputs = processor(text=text_queries, images=img, return_tensors=&quot;pt&quot;).to(device)\n\n    with torch.no_grad():\n        outputs = model(**inputs)\n\n    outputs.logits = outputs.logits.cpu()\n    outputs.pred_boxes = outputs.pred_boxes.cpu()\n    results = processor.post_process_object_detection(outputs=outputs, target_sizes=target_sizes)\n    boxes, scores, labels = results[0][&quot;boxes&quot;], results[0][&quot;scores&quot;], results[0][&quot;labels&quot;]\n\n    result_labels = []\n    img_pil = img.copy()\n    draw = ImageDraw.Draw(img_pil)\n\n    for box, score, label in zip(boxes, scores, labels):\n        box = [int(i) for i in box.tolist()]\n        if score &lt; score_threshold:\n            continue\n        result_labels.append([box, text_queries[label.item()]])\n        draw.rectangle(box, outline=&quot;red&quot;, width=2)\n        # Do not draw the text label within the bounding box\n        # draw.text((box[0], box[1]), text_queries[label.item()], fill=&quot;red&quot;)\n\n    # Save the annotated image for debugging\n    annotated_image_path = &quot;annotated_image.png&quot;\n    img_pil.save(annotated_image_path)\n\n    # Save the JSON output to a file\n    json_output_path = &quot;output.json&quot;\n    with open(json_output_path, &#x27;w&#x27;) as f:\n        json.dump(result_labels, f)\n\n    # Unload the Owlv2 model to free up memory\n    del model\n    torch.cuda.empty_cache()\n\n    return annotated_image_path, json_output_path\n\ndef crop_images(image_path, json_output_path, output_dir=OUTPUT_DIR):\n    &quot;&quot;&quot;\n    Crops out the detected objects from the input image based on the bounding box coordinates.\n\n    Args:\n        image_path (str): The path to the input image.\n        json_output_path (str): The path to the JSON file containing the bounding box coordinates.\n        output_dir (str): The directory to save the cropped images (default is OUTPUT_DIR).\n    &quot;&quot;&quot;\n    with open(json_output_path, &#x27;r&#x27;) as f:\n        data = json.load(f)\n\n    img = Image.open(image_path)\n\n    results = []\n\n    for i, item in enumerate(data):\n        box = item[0]\n        label = item[1]\n        cropped_img = img.crop(box)\n        cropped_img_path = f&quot;{output_dir}/{label}_{i}.png&quot;\n        cropped_img.save(cropped_img_path)\n\n        # Process the cropped image with the miniCPM-lama3-v-2_5 model\n        vision_model_response = process_with_vision_model(cropped_img_path, VISION_MODEL_QUESTION)\n\n        # Store the results in a structured format\n        results.append({\n            &quot;image_name&quot;: f&quot;{label}_{i}.png&quot;,\n            &quot;coordinates&quot;: box,\n            &quot;description&quot;: vision_model_response\n        })\n\n        # Clear CUDA cache after processing each image\n        torch.cuda.empty_cache()\n\n    # Save the results to a JSON file\n    results_json_path = f&quot;{output_dir}/results.json&quot;\n    with open(results_json_path, &#x27;w&#x27;) as f:\n        json.dump(results, f, indent=4)\n\ndef process_with_vision_model(image_path, question):\n    &quot;&quot;&quot;\n    Processes the image with the miniCPM-lama3-v-2_5 model and returns the response.\n\n    Args:\n        image_path (str): The path to the input image.\n        question (str): The question to ask the vision model.\n\n    Returns:\n        str: The response from the vision model.\n    &quot;&quot;&quot;\n    # Load the miniCPM-lama3-v-2_5 model and tokenizer\n    minicpm_llama_model = AutoModel.from_pretrained(VISION_MODEL_ID, trust_remote_code=True).to(device)\n    minicpm_llama_tokenizer = AutoTokenizer.from_pretrained(VISION_MODEL_ID, trust_remote_code=True)\n\n    image = Image.open(image_path).convert(&quot;RGB&quot;)\n    messages = [\n        {&quot;role&quot;: &quot;user&quot;, &quot;content&quot;: f&quot;{question}&quot;}\n    ]\n    # Define the generation arguments\n    generation_args = {\n        &quot;max_new_tokens&quot;: 896,\n        &quot;repetition_penalty&quot;: 1.05,\n        &quot;num_beams&quot;: 3,\n        &quot;top_p&quot;: 0.8,\n        &quot;top_k&quot;: 1,\n        &quot;temperature&quot;: 0.7,\n        &quot;sampling&quot;: True,\n    }\n    if &quot;int4&quot; in VISION_MODEL_ID:\n        # Disable streaming for the 4-bit model\n        generation_args[&quot;stream&quot;] = False\n        # Use the model.chat method with streaming enabled\n        vision_model_response = &quot;&quot;\n        for new_text in minicpm_llama_model.chat(\n                image=image,\n                msgs=messages,\n                tokenizer=minicpm_llama_tokenizer,\n                **generation_args\n        ):\n            vision_model_response += new_text\n            print(new_text, flush=True, end=&#x27;&#x27;)\n    else:\n        minicpm_llama_model.to(device)\n        vision_model_response = minicpm_llama_model.chat(\n            image=image,\n            msgs=messages,\n            tokenizer=minicpm_llama_tokenizer,\n            **generation_args\n        )\n\n    # Unload the miniCPM-lama3-v-2_5 model to free up memory\n    del minicpm_llama_model\n    del minicpm_llama_tokenizer\n    torch.cuda.empty_cache()\n\n    return vision_model_response\n\nif __name__ == &quot;__main__&quot;:\n    # Take the screenshot\n    screenshot_path = take_screenshot()\n\n    # Process the image using the Owlv2 model\n    annotated_image_path, json_output_path = query_image(screenshot_path)\n\n    # Create the output directory if it doesn&#x27;t exist\n    os.makedirs(OUTPUT_DIR, exist_ok=True)\n\n    # Crop the detected objects and save them to the output directory\n    crop_images(annotated_image_path, json_output_path)\n\n    print(&quot;Processing complete. Check the output directory for the cropped images and results.&quot;)\n```\n\n### Explanation\n\n1. **Unload the Owlv2 Model**:\n   - The Owlv2 model is unloaded after processing the image and obtaining the bounding boxes.\n\n2. **Process Images Sequentially**:\n   - The `crop_images` function processes each cropped image one by one to reduce memory usage.\n\n3. **Clear CUDA Cache**:\n   - The CUDA cache is cleared after processing each image to free up memory.\n\n4. **Handle Local Path for Vision Model**:\n   - The script now handles the local path for the `VISION_MODEL_ID` and includes `trust_remote_code=True` to avoid the prompt for running custom code.\n\nPlease review the updated script and let me know if you encounter any further issues or have additional requirements."
        ]
    ]
}