Skip to content

A PDF query answering system that uses OCR to extract text from PDFs, creates semantic chunks for efficient retrieval, and generates answers using a language model based on relevant document sections.

Notifications You must be signed in to change notification settings

iiteen/Medlens-Ai

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

35 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Medlens Ai : Smart Querying for Healthcare PDFs

Repository Link: https://github.com/iiteen/Health-Contracts

Team Name: Wall-E

Members:

  • Krish Khadria
  • Kundan Kumar
  • Dhruv Goyal

Overview

This project is a chatbot designed to answer user queries by extracting and processing information from PDF documents. It uses OCR to convert PDF pages to text, then creates semantic chunks for efficient information retrieval. Finally, the chatbot uses a language model to generate precise answers based on the most relevant chunks of the document.

Features

  • OCR Text Extraction: Convert PDF pages into images and extract text using Tesseract OCR.
  • Semantic Chunking: Create semantic chunks from the extracted text for better context understanding.
  • Similarity Search: Search for the most relevant chunks in the document based on user queries.
  • Answer Generation: Generate answers using the Ollama LLM based on the identified relevant chunks.

Get Started

  1. Clone the Github Repo:

    git clone https://github.com/iiteen/Health-Contracts.git
    cd Health-Contracts
  2. Start with creating a conda environment. Open your Anaconda Prompt Terminal.

    conda create -n hilabs python=3.9
    conda activate hilabs
    conda install -c conda-forge pdf2image
    conda install -c conda-forge pytesseract
    conda install -c conda-forge tesseract
    conda install -c conda-forge langchain
    conda install -c conda-forge langchain-community
    conda install -c conda-forge langchain-experimental
    pip install -U langchain-ollama
    pip install sentence-transformers
    conda install -c conda-forge chromadb
  3. Download Ollama from the given site: Ollama Download

  4. Run the following command in your terminal:

    ollama run llama3.1
  5. If you are using Windows then run the following command in the Python terminal otherwise you will get error from tesseract:

    Note: Replace the path to your tesseract.exe and tessdata

    import pytesseract
    import os
    # this may be needed to set these env variables in windows
    pytesseract.pytesseract.tesseract_cmd = (
        r"C:\Users\sunil\miniconda3\envs\hilabs\Library\bin\tesseract.exe"
    )
    os.environ["TESSDATA_PREFIX"] = r"C:\Users\sunil\miniconda3\envs\hilabs\share\tessdata"
  6. Download Node.js from the following site: Node.js Download

  7. In the Health-Contracts folder (our Cloned Repo folder), go to chat-front:

    cd Health-Contracts
    cd chat-front
    npm install
    npm start
  8. In the Conda terminal, activate your previously created environment (hilabs):

    conda activate hilabs
    cd C:\Users\HP\Downloads\Health-Contracts\flask_server  # Replace this with your path of flask_server folder present in our cloned repo
    pip install flask
    pip install flask_cors
    python app.py

How to use the ChatBot

  • Step 7 will start your Front-end for ChatBot
  • Step 8 will start Backend for ChatBot
  • Go to ChatBot Interface(front-end)
  • Click on the upload pdf button and upload your pdf (which may take upto 10 secs)
  • Now enter your queries in the message field and click on send.
  • Wait for the ChatBot to answer your Query
  • Incase you doubt ChatBot isn't working, have a look on the conda terminal in which you ran Step 8.(Those logs may help you to know that it is still processing and would take time to answer the query.)
  • Note: To ask queries from another Pdf, you need to start a New Chat and upload the new PDF.

Checkout our PDF ChatBot Demo

Screen.Recording.2024-09-01.235143.mp4

Connect with us:

In case of any queries, feel free to contact us on:

About

A PDF query answering system that uses OCR to extract text from PDFs, creates semantic chunks for efficient retrieval, and generates answers using a language model based on relevant document sections.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 4

  •  
  •  
  •  
  •