Skip to content

Commit 1e3d8f4

Browse files
authored
Update README.md
Added reference to video, that you can use a private trainer model.
1 parent 81cadf2 commit 1e3d8f4

File tree

1 file changed

+15
-6
lines changed

1 file changed

+15
-6
lines changed

advanced_tutorials/llm_pdfs/README.md

+15-6
Original file line numberDiff line numberDiff line change
@@ -1,25 +1,32 @@
1-
# ⚙️ Index Private PDFs for RAG and create Fine-Tuning Datasets from them
1+
# ⚙️ RAG and Fine-Tuning in Hopsworks - build a private PDF search system
2+
* [Helper video describing how to implement this LLM PDF system](https://www.youtube.com/watch?v=8YDANJ4Gbis)
23

3-
This project will take a google drive folder of PDF files that you provide and read them, index them in vector embeddings in Hopsworks for retrieval augmented generation (RAG) and create an instruction dataset for fine-tuning using a teacher model (GPT).
4+
# ⚙️ Index Private PDFs for RAG, create and serve fine-tuned models from them, and include UI for querying
45

6+
This project is an AI system built on Hopsworks that
7+
* creates vector embeddings for PDF files in a google drive folder (you can also use local/network directories) and indexes them for retrieval augmented generation (RAG) in Hopsworks Feature Store with Vector Indexing
8+
* creates an instruction dataset for fine-tuning using a teacher model (GPT by default, but you can easily configure to use a powerful private model such as Llama-3-70b)
9+
* trains and hosts in the model registry a fine-tuned open-source foundation model (Mistral 7b by default, but can be easily changed for other models such as Llama-3-8b)
10+
* provides a UI, written in Streamlit/Python, for querying your PDFs that returns answers, citing the page/paragraph/url-to-pdf in its answer.
511

612
![Hopsworks Architecture for Private PDFs Indexed for LLMs](../..//images/llm-pdfs-architecture.gif)
713

814
## 📖 Feature Pipeline
915
The Feature Pipeline does the following:
1016

1117
* Download any new PDFs from the google drive.
12-
* Extract chunks of text from the PDFs and store them in a Feature Group in Hopsworks.
13-
* Use GPT to generate an instruction set for the fine-tuning a foundation LLM and store as a feature group in Hopsworks.
18+
* Extract chunks of text from the PDFs and store them in a Vector-Index enabled Feature Group in Hopsworks.
19+
* Use GPT (or Llama-3-70b) to generate an instruction set for the fine-tuning of a foundation LLM and store the instruction dataset as a feature group in Hopsworks.
1420

1521
## 🏃🏻‍♂️Training Pipeline
22+
This step is optional if you also want to create a fine-tuned model.
1623
The Training Pipeline does the following:
1724

18-
* Uses the instruction dataset and LoRA to fine-tune the open-source LLM (Mistral-7B-Instruct-v0.2 by default) .
25+
* Uses the instruction dataset and LoRA to fine-tune the open-source LLM (Mistral-7B-Instruct-v0.2 by default).
1926
* Saves the fine-tuned model to Hopsworks Model Registry.
2027

2128
## 🚀 Inference Pipeline
22-
* A chatbot written in Streamlit that answers questions about the PDFs you uploaded using RAG and an embedded LLM.
29+
* A chatbot written in Streamlit that answers questions about the PDFs you uploaded using RAG and your embedded LLM (either an off-the-shelf model, like Mistral-7B-Instruct-v0.2, or your fine-tuned LLM.
2330

2431
## 🕵🏻‍♂️ Google Drive Credentials Creation
2532

@@ -34,3 +41,5 @@ Next, integrate these files into your project:
3441
2. Place both `credentials.json` and `client_secret.json` files inside this credentials directory.
3542

3643
Now, you are ready to download your PDFs from the Google Drive!
44+
45+

0 commit comments

Comments
 (0)