From 310ec39320d65f4170cad0994c8a04b30e409b81 Mon Sep 17 00:00:00 2001 From: Raghavan Muthuregunathan Date: Sat, 23 Dec 2023 19:12:34 -0800 Subject: [PATCH 1/6] Article on how to build an application talking to images with Gemini and Spire --- ...o-images-in-pdfs-with-Gemini-and-spire.idx | 104 ++++++++++++++++++ 1 file changed, 104 insertions(+) create mode 100644 tutorials/en/talk-to-images-in-pdfs-with-Gemini-and-spire.idx diff --git a/tutorials/en/talk-to-images-in-pdfs-with-Gemini-and-spire.idx b/tutorials/en/talk-to-images-in-pdfs-with-Gemini-and-spire.idx new file mode 100644 index 00000000..48454358 --- /dev/null +++ b/tutorials/en/talk-to-images-in-pdfs-with-Gemini-and-spire.idx @@ -0,0 +1,104 @@ +--- +title: "Talk to the images in PDFs with Gemini" +description: "Leverage the power of Gemini Vision Pro and Spire to answer natural language questions based on images in PDFs" +image: "" +authorUsername: "raghavan848" +--- + +Talk to the images in PDFs with Gemini +Introduction +With the advent of Retrieval Augmented Generation, there are several applications using Vector Search + Large Language models to answer natural language questions based on the text in the PDFs. In this article, we will show how to use Gemini and answer natural language questions based on the images in a PDF. + +This article leverages spire and Gemini-Pro-Vision to build a multi-modal RAG application that can talk to images in the PDF to answer natural language questions. +Existing solutions: +There are example solutions of Multi Modal RAG solutions [1] that uses GPT-4V and llamaindex. There is an example that uses langchain [2]. Another recent tutorial is that is based on Gemini and llama index [3] However, there are very few examples that leverage spire and Gemini-Pro-Vision. For scholarly purposes, it is important to learn to extract images for ourselves from PDFs and be able to use Gemini without having to rely on popular off-the-shelf orchestration libraries. + + +### Step-by-Step Tutorial: Flask Web Application for Image-Based AI Responses + +This tutorial explains a Flask web application designed to process images and text inputs, extract information from PDFs, and generate responses using the Google Gemini AI model. + +#### Requirements +To run this application, certain Python libraries and a Google API key are required. Here is an example of what your `requirements.txt` file should include: + +``` +Flask +google-generativeai +Pillow +Spire.PDF +werkzeug +``` + +#### Gemini API Key +The Gemini API Key is crucial for accessing Google's Gemini AI model. This key should be kept secure and not shared publicly. In the code, it's set as: + +```python +GOOGLE_API_KEY="YOUR_GOOGLE_API_KEY" +``` + +Replace `"YOUR_GOOGLE_API_KEY"` with your actual Google API key. + +#### High-Level Pseudo Code +1. **Import necessary libraries**: Flask for web framework, Pillow for image processing, and Google's Gemini AI and Spire.PDF for PDF processing. +2. **Initialize Flask app** and configure upload folder and allowed file extensions. +3. **Define helper functions**: + - `allowed_file(filename)`: Check if the uploaded file is of an allowed type. + - `extract_images_from_pdf(pdf_path, output_directory)`: Extract images from a PDF and save them. +4. **Set up the Gemini AI model** with the API key. +5. **Define Flask routes**: + - `index()`: Handle GET and POST requests. On POST, process uploaded files, extract text and images, and use the Gemini model to generate a response. +6. **Run the Flask app**. + +#### Explaining the Prompt +The prompt is a string that guides the Gemini AI model on how to process the input images and text. It sets the context for the AI to understand that it should answer questions based on the provided images. The prompt includes instructions on how the response should be formatted and what it should contain, like citing the image used for the answer. + +#### Code Snippets + +##### Configure Flask and File Uploads +```python +app = Flask(__name__) +UPLOAD_FOLDER = 'uploads/' +ALLOWED_EXTENSIONS = {'png', 'jpg', 'jpeg', 'gif', 'png'} +app.config['UPLOAD_FOLDER'] = UPLOAD_FOLDER +``` + +##### Extract Images from PDF +```python +def extract_images_from_pdf(pdf_path, output_directory): + # PDF processing code here... +``` + +##### Gemini AI Model Configuration +```python +genai.configure(api_key=GOOGLE_API_KEY) +model = genai.GenerativeModel('gemini-pro-vision') +``` + +##### Flask Route for Handling Requests +```python +@app.route('/', methods=['GET', 'POST']) +def index(): + # Code to handle file uploads and text processing... +``` + +##### Running the Flask App +```python +if __name__ == '__main__': + if not os.path.exists(UPLOAD_FOLDER): + os.makedirs(UPLOAD_FOLDER) + app.run(debug=True) +``` + +This code constructs a web application where users can upload images or PDFs, input a text question, and receive an AI-generated response based on the images. The application uses Google's Gemini AI model for generating responses and Flask as the web framework. + +Github link with working code: https://github.com/Raghavan1988/talk_to_images_with_gemini/ + +### Conclusion +In summary, Gemini-pro-vision provides a powerful capability to understand images. If we can extract images from PDFs and building a wide range of multimodal RAG applications for healthcare, researchers becomes easy to implement. + + +### Reference +[1] https://blog.llamaindex.ai/multi-modal-rag-621de7525fea +[2] https://python.langchain.com/docs/templates/rag-chroma-multi-modal +[3] https://blog.llamaindex.ai/llamaindex-gemini-8d7c3b9ea97e +[4] https://github.com/Raghavan1988/talk_to_images_with_gemini/ From 6bad68a9387d7cd8bbd824eba4a1965b6d2c5f30 Mon Sep 17 00:00:00 2001 From: Raghavan Date: Sat, 23 Dec 2023 19:59:37 -0800 Subject: [PATCH 2/6] Update talk-to-images-in-pdfs-with-Gemini-and-spire.idx --- .../talk-to-images-in-pdfs-with-Gemini-and-spire.idx | 11 +++++++++++ 1 file changed, 11 insertions(+) diff --git a/tutorials/en/talk-to-images-in-pdfs-with-Gemini-and-spire.idx b/tutorials/en/talk-to-images-in-pdfs-with-Gemini-and-spire.idx index 48454358..9a31828f 100644 --- a/tutorials/en/talk-to-images-in-pdfs-with-Gemini-and-spire.idx +++ b/tutorials/en/talk-to-images-in-pdfs-with-Gemini-and-spire.idx @@ -93,6 +93,15 @@ This code constructs a web application where users can upload images or PDFs, in Github link with working code: https://github.com/Raghavan1988/talk_to_images_with_gemini/ +## Demo screenshots +Contrastive Chain of thought is a popular prompt engineering technique. [5] After uploading a survey paper about various prompt engineering techniques, let's ask the question "How much did Contrastive COT score in Arithematic Reasoning?" +We can see that application was able to extract the image and answer the question using Gemini Vision Pro. +Answer: The image shows a table with the results of the Contrastive COT model on the Arithmetic Reasoning task. The table shows that the model achieves a score of 79.0, which is significantly higher than the score of 27.4 achieved by the standard prompting method. This demonstrates that the Contrastive COT model is better able to solve arithmetic reasoning problems than the standard prompting method. + +It's important to note, LLMs may hallucinate. + + + ### Conclusion In summary, Gemini-pro-vision provides a powerful capability to understand images. If we can extract images from PDFs and building a wide range of multimodal RAG applications for healthcare, researchers becomes easy to implement. @@ -102,3 +111,5 @@ In summary, Gemini-pro-vision provides a powerful capability to understand image [2] https://python.langchain.com/docs/templates/rag-chroma-multi-modal [3] https://blog.llamaindex.ai/llamaindex-gemini-8d7c3b9ea97e [4] https://github.com/Raghavan1988/talk_to_images_with_gemini/ +[5] https://arxiv.org/abs/2311.09277 + From 2770ce0320350b7ed89929d04069c10ffd130463 Mon Sep 17 00:00:00 2001 From: Raghavan Date: Sat, 23 Dec 2023 20:04:55 -0800 Subject: [PATCH 3/6] Update talk-to-images-in-pdfs-with-Gemini-and-spire.idx --- .../en/talk-to-images-in-pdfs-with-Gemini-and-spire.idx | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/tutorials/en/talk-to-images-in-pdfs-with-Gemini-and-spire.idx b/tutorials/en/talk-to-images-in-pdfs-with-Gemini-and-spire.idx index 9a31828f..0517520a 100644 --- a/tutorials/en/talk-to-images-in-pdfs-with-Gemini-and-spire.idx +++ b/tutorials/en/talk-to-images-in-pdfs-with-Gemini-and-spire.idx @@ -98,9 +98,10 @@ Contrastive Chain of thought is a popular prompt engineering technique. [5] Afte We can see that application was able to extract the image and answer the question using Gemini Vision Pro. Answer: The image shows a table with the results of the Contrastive COT model on the Arithmetic Reasoning task. The table shows that the model achieves a score of 79.0, which is significantly higher than the score of 27.4 achieved by the standard prompting method. This demonstrates that the Contrastive COT model is better able to solve arithmetic reasoning problems than the standard prompting method. -It's important to note, LLMs may hallucinate. - +![Question](https://freeimage.host/i/JRdxrxt "Optional Title") +![Answer](https://freeimage.host/i/JRdxUbI "Optional Title") +It's important to note, LLMs may hallucinate. ### Conclusion In summary, Gemini-pro-vision provides a powerful capability to understand images. If we can extract images from PDFs and building a wide range of multimodal RAG applications for healthcare, researchers becomes easy to implement. From 608dadd7ede864444034fedd8362ed9de5942213 Mon Sep 17 00:00:00 2001 From: Raghavan Date: Sat, 23 Dec 2023 22:59:05 -0800 Subject: [PATCH 4/6] Rename talk-to-images-in-pdfs-with-Gemini-and-spire.idx to talk-to-images-in-pdfs-with-Gemini-and-spire.mdx Changing file name --- ...spire.idx => talk-to-images-in-pdfs-with-Gemini-and-spire.mdx} | 0 1 file changed, 0 insertions(+), 0 deletions(-) rename tutorials/en/{talk-to-images-in-pdfs-with-Gemini-and-spire.idx => talk-to-images-in-pdfs-with-Gemini-and-spire.mdx} (100%) diff --git a/tutorials/en/talk-to-images-in-pdfs-with-Gemini-and-spire.idx b/tutorials/en/talk-to-images-in-pdfs-with-Gemini-and-spire.mdx similarity index 100% rename from tutorials/en/talk-to-images-in-pdfs-with-Gemini-and-spire.idx rename to tutorials/en/talk-to-images-in-pdfs-with-Gemini-and-spire.mdx From f970b0a2f8df88a54d90ee9203cede99be39c80b Mon Sep 17 00:00:00 2001 From: Raghavan Muthuregunathan Date: Mon, 5 Feb 2024 07:27:47 -0800 Subject: [PATCH 5/6] Addressed reviewer comments 1) appropriate title 2) indentatation --- ...o-images-in-pdfs-with-Gemini-and-spire.mdx | 31 ++++++++++--------- 1 file changed, 16 insertions(+), 15 deletions(-) diff --git a/tutorials/en/talk-to-images-in-pdfs-with-Gemini-and-spire.mdx b/tutorials/en/talk-to-images-in-pdfs-with-Gemini-and-spire.mdx index 0517520a..8f11327b 100644 --- a/tutorials/en/talk-to-images-in-pdfs-with-Gemini-and-spire.mdx +++ b/tutorials/en/talk-to-images-in-pdfs-with-Gemini-and-spire.mdx @@ -5,8 +5,9 @@ image: "" authorUsername: "raghavan848" --- -Talk to the images in PDFs with Gemini -Introduction +# Talk to the images in PDFs with Gemini + +## Introduction With the advent of Retrieval Augmented Generation, there are several applications using Vector Search + Large Language models to answer natural language questions based on the text in the PDFs. In this article, we will show how to use Gemini and answer natural language questions based on the images in a PDF. This article leverages spire and Gemini-Pro-Vision to build a multi-modal RAG application that can talk to images in the PDF to answer natural language questions. @@ -14,11 +15,11 @@ Existing solutions: There are example solutions of Multi Modal RAG solutions [1] that uses GPT-4V and llamaindex. There is an example that uses langchain [2]. Another recent tutorial is that is based on Gemini and llama index [3] However, there are very few examples that leverage spire and Gemini-Pro-Vision. For scholarly purposes, it is important to learn to extract images for ourselves from PDFs and be able to use Gemini without having to rely on popular off-the-shelf orchestration libraries. -### Step-by-Step Tutorial: Flask Web Application for Image-Based AI Responses +## Step-by-Step Tutorial: Flask Web Application for Image-Based AI Responses This tutorial explains a Flask web application designed to process images and text inputs, extract information from PDFs, and generate responses using the Google Gemini AI model. -#### Requirements +### Requirements To run this application, certain Python libraries and a Google API key are required. Here is an example of what your `requirements.txt` file should include: ``` @@ -29,7 +30,7 @@ Spire.PDF werkzeug ``` -#### Gemini API Key +### Gemini API Key The Gemini API Key is crucial for accessing Google's Gemini AI model. This key should be kept secure and not shared publicly. In the code, it's set as: ```python @@ -38,7 +39,7 @@ GOOGLE_API_KEY="YOUR_GOOGLE_API_KEY" Replace `"YOUR_GOOGLE_API_KEY"` with your actual Google API key. -#### High-Level Pseudo Code +### High-Level Pseudo Code 1. **Import necessary libraries**: Flask for web framework, Pillow for image processing, and Google's Gemini AI and Spire.PDF for PDF processing. 2. **Initialize Flask app** and configure upload folder and allowed file extensions. 3. **Define helper functions**: @@ -49,12 +50,12 @@ Replace `"YOUR_GOOGLE_API_KEY"` with your actual Google API key. - `index()`: Handle GET and POST requests. On POST, process uploaded files, extract text and images, and use the Gemini model to generate a response. 6. **Run the Flask app**. -#### Explaining the Prompt +### Explaining the Prompt The prompt is a string that guides the Gemini AI model on how to process the input images and text. It sets the context for the AI to understand that it should answer questions based on the provided images. The prompt includes instructions on how the response should be formatted and what it should contain, like citing the image used for the answer. -#### Code Snippets +### Code Snippets -##### Configure Flask and File Uploads +#### Configure Flask and File Uploads ```python app = Flask(__name__) UPLOAD_FOLDER = 'uploads/' @@ -62,26 +63,26 @@ ALLOWED_EXTENSIONS = {'png', 'jpg', 'jpeg', 'gif', 'png'} app.config['UPLOAD_FOLDER'] = UPLOAD_FOLDER ``` -##### Extract Images from PDF +#### Extract Images from PDF ```python def extract_images_from_pdf(pdf_path, output_directory): # PDF processing code here... ``` -##### Gemini AI Model Configuration +#### Gemini AI Model Configuration ```python genai.configure(api_key=GOOGLE_API_KEY) model = genai.GenerativeModel('gemini-pro-vision') ``` -##### Flask Route for Handling Requests +#### Flask Route for Handling Requests ```python @app.route('/', methods=['GET', 'POST']) def index(): # Code to handle file uploads and text processing... ``` -##### Running the Flask App +#### Running the Flask App ```python if __name__ == '__main__': if not os.path.exists(UPLOAD_FOLDER): @@ -103,11 +104,11 @@ Answer: The image shows a table with the results of the Contrastive COT model o It's important to note, LLMs may hallucinate. -### Conclusion +## Conclusion In summary, Gemini-pro-vision provides a powerful capability to understand images. If we can extract images from PDFs and building a wide range of multimodal RAG applications for healthcare, researchers becomes easy to implement. -### Reference +## Reference [1] https://blog.llamaindex.ai/multi-modal-rag-621de7525fea [2] https://python.langchain.com/docs/templates/rag-chroma-multi-modal [3] https://blog.llamaindex.ai/llamaindex-gemini-8d7c3b9ea97e From 690cae2f8e0c986ff7f3186deaabf5d44bea0dc6 Mon Sep 17 00:00:00 2001 From: Raghavan Muthuregunathan Date: Mon, 5 Feb 2024 07:37:30 -0800 Subject: [PATCH 6/6] removing redundant title --- tutorials/en/talk-to-images-in-pdfs-with-Gemini-and-spire.mdx | 2 -- 1 file changed, 2 deletions(-) diff --git a/tutorials/en/talk-to-images-in-pdfs-with-Gemini-and-spire.mdx b/tutorials/en/talk-to-images-in-pdfs-with-Gemini-and-spire.mdx index 8f11327b..402ac1d7 100644 --- a/tutorials/en/talk-to-images-in-pdfs-with-Gemini-and-spire.mdx +++ b/tutorials/en/talk-to-images-in-pdfs-with-Gemini-and-spire.mdx @@ -5,8 +5,6 @@ image: "" authorUsername: "raghavan848" --- -# Talk to the images in PDFs with Gemini - ## Introduction With the advent of Retrieval Augmented Generation, there are several applications using Vector Search + Large Language models to answer natural language questions based on the text in the PDFs. In this article, we will show how to use Gemini and answer natural language questions based on the images in a PDF.