Repository Link: https://github.com/iiteen/Health-Contracts
Team Name: Wall-E
Members:
- Krish Khadria
- Kundan Kumar
- Dhruv Goyal
This project is a chatbot designed to answer user queries by extracting and processing information from PDF documents. It uses OCR to convert PDF pages to text, then creates semantic chunks for efficient information retrieval. Finally, the chatbot uses a language model to generate precise answers based on the most relevant chunks of the document.
- OCR Text Extraction: Convert PDF pages into images and extract text using Tesseract OCR.
- Semantic Chunking: Create semantic chunks from the extracted text for better context understanding.
- Similarity Search: Search for the most relevant chunks in the document based on user queries.
- Answer Generation: Generate answers using the Ollama LLM based on the identified relevant chunks.
-
Clone the Github Repo:
git clone https://github.com/iiteen/Health-Contracts.git cd Health-Contracts
-
Start with creating a conda environment. Open your Anaconda Prompt Terminal.
conda create -n hilabs python=3.9 conda activate hilabs conda install -c conda-forge pdf2image conda install -c conda-forge pytesseract conda install -c conda-forge tesseract conda install -c conda-forge langchain conda install -c conda-forge langchain-community conda install -c conda-forge langchain-experimental pip install -U langchain-ollama pip install sentence-transformers conda install -c conda-forge chromadb
-
Download Ollama from the given site: Ollama Download
-
Run the following command in your terminal:
ollama run llama3.1
-
If you are using Windows then run the following command in the Python terminal otherwise you will get error from tesseract:
Note: Replace the path to your
tesseract.exe
andtessdata
import pytesseract import os # this may be needed to set these env variables in windows pytesseract.pytesseract.tesseract_cmd = ( r"C:\Users\sunil\miniconda3\envs\hilabs\Library\bin\tesseract.exe" ) os.environ["TESSDATA_PREFIX"] = r"C:\Users\sunil\miniconda3\envs\hilabs\share\tessdata"
-
Download Node.js from the following site: Node.js Download
-
In the Health-Contracts folder (our Cloned Repo folder), go to
chat-front
:cd Health-Contracts cd chat-front npm install npm start
-
In the Conda terminal, activate your previously created environment (
hilabs
):conda activate hilabs cd C:\Users\HP\Downloads\Health-Contracts\flask_server # Replace this with your path of flask_server folder present in our cloned repo pip install flask pip install flask_cors python app.py
- Step 7 will start your Front-end for ChatBot
- Step 8 will start Backend for ChatBot
- Go to ChatBot Interface(front-end)
- Click on the upload pdf button and upload your pdf (which may take upto 10 secs)
- Now enter your queries in the message field and click on send.
- Wait for the ChatBot to answer your Query
- Incase you doubt ChatBot isn't working, have a look on the conda terminal in which you ran Step 8.(Those logs may help you to know that it is still processing and would take time to answer the query.)
- Note: To ask queries from another Pdf, you need to start a New Chat and upload the new PDF.
Screen.Recording.2024-09-01.235143.mp4
In case of any queries, feel free to contact us on: