Skip to content

Commit c9ae5a1

Browse files
authored
[FSTORE-1404] LLM PDF Tutorial (#266)
* LLM PDF Search Tutorial using RAG and Fine-Tuning
1 parent d67d459 commit c9ae5a1

14 files changed

+2026
-1
lines changed

README.md

+3-1
Original file line numberDiff line numberDiff line change
@@ -42,14 +42,16 @@ In order to understand the tutorials you need to be familiar with general concep
4242
- [Iris](https://github.com/logicalclocks/hopsworks-tutorials/tree/master/iris): Classify iris flower species.
4343
- [Loan Approval](https://github.com/logicalclocks/hopsworks-tutorials/tree/master/loan_approval): Predict loan approvals.
4444
- Advanced Tutorials:
45-
- [Air Quality](https://github.com/logicalclocks/hopsworks-tutorials/tree/master/advanced_tutorials/air_quality): Predict the Air Quality value (PM2.5) in Europe and USA using weather features and air quality features of the previous days.
45+
- [Air Quality](https://github.com/logicalclocks/hopsworks-tutorials/tree/master/advanced_tutorials/air_quality): Creating an air quality AI assistant that displays and explains air quality indicators for specific dates or periods, using Function Calling for LLMs and a RAG approach without a vector database.
4646
- [Bitcoin](https://github.com/logicalclocks/hopsworks-tutorials/tree/master/advanced_tutorials/bitcoin): Predict Bitcoin price using timeseries features and tweets sentiment analysis.
4747
- [Citibike](https://github.com/logicalclocks/hopsworks-tutorials/tree/master/advanced_tutorials/citibike): Predict the number of citibike users on each citibike station in the New York City.
4848
- [Credit Scores](https://github.com/logicalclocks/hopsworks-tutorials/tree/master/advanced_tutorials/credit_scores): Predict clients' repayment abilities.
4949
- [Electricity](https://github.com/logicalclocks/hopsworks-tutorials/tree/master/advanced_tutorials/electricity): Predict the electricity prices in several Swedish cities based on weather conditions, previous prices, and Swedish holidays.
5050
- [NYC Taxi Fares](https://github.com/logicalclocks/hopsworks-tutorials/tree/master/advanced_tutorials/nyc_taxi_fares): Predict the fare amount for a taxi ride in New York City given the pickup and dropoff locations.
5151
- [Recommender System](https://github.com/logicalclocks/hopsworks-tutorials/tree/master/advanced_tutorials/recommender-system): Build a recommender system for fashion items.
5252
- [TimeSeries](https://github.com/logicalclocks/hopsworks-tutorials/tree/master/advanced_tutorials/timeseries): Timeseries price prediction.
53+
- [LLM PDF](https://github.com/logicalclocks/hopsworks-tutorials/tree/master/advanced_tutorials/llm_pdfs): An AI assistant that utilizes a Retrieval-Augmented Generation (RAG) system to provide accurate answers to user questions by retrieving relevant context from PDF documents.
54+
- [Fraud Cheque Detection](https://github.com/logicalclocks/hopsworks-tutorials/tree/master/advanced_tutorials/fraud_cheque_detection): Building an AI assistant that detects fraudulent scanned cheque images and generates explanations for the fraud classification, using a fine-tuned open-source LLM.
5355
- [Keras model and Sklearn Transformation Functions with Hopsworks Model Registry](https://github.com/logicalclocks/hopsworks-tutorials/tree/master/advanced_tutorials/transformation_functions/keras): How to register Sklearn Transformation Functions and Keras model in the Hopsworks Model Registry, how to retrieve them and then use in training and inference pipelines.
5456
- [PyTorch model and Sklearn Transformation Functions with Hopsworks Model Registry](https://github.com/logicalclocks/hopsworks-tutorials/tree/master/advanced_tutorials/transformation_functions/pytorch): How to register Sklearn Transformation Functions and PyTorch model in the Hopsworks Model Registry, how to retrieve them and then use in training and inference pipelines.
5557
- [Sklearn Transformation Functions With Hopsworks Model Registy](https://github.com/logicalclocks/hopsworks-tutorials/tree/master/advanced_tutorials/transformation_functions/sklearn): How to register sklearn.pipeline with transformation functions and classifier in Hopsworks Model Registry and use it in training and inference pipelines.
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,285 @@
1+
{
2+
"cells": [
3+
{
4+
"cell_type": "markdown",
5+
"id": "82622ee3",
6+
"metadata": {},
7+
"source": [
8+
"## <span style=\"color:#ff5f27\">📝 Imports </span>"
9+
]
10+
},
11+
{
12+
"cell_type": "code",
13+
"execution_count": null,
14+
"id": "ade7fe1f",
15+
"metadata": {},
16+
"outputs": [],
17+
"source": [
18+
"!pip install -r requirements.txt -q"
19+
]
20+
},
21+
{
22+
"cell_type": "code",
23+
"execution_count": null,
24+
"id": "7ab771e2",
25+
"metadata": {},
26+
"outputs": [],
27+
"source": [
28+
"import PyPDF2\n",
29+
"import pandas as pd\n",
30+
"from sentence_transformers import SentenceTransformer\n",
31+
"\n",
32+
"from functions.pdf_preprocess import (\n",
33+
" download_files_to_folder, \n",
34+
" process_pdf_file,\n",
35+
")\n",
36+
"from functions.text_preprocess import process_text_data\n",
37+
"import config\n",
38+
"\n",
39+
"import warnings\n",
40+
"warnings.filterwarnings('ignore')"
41+
]
42+
},
43+
{
44+
"cell_type": "markdown",
45+
"id": "7e8f1796",
46+
"metadata": {},
47+
"source": [
48+
"## <span style=\"color:#ff5f27\">💾 Download files from Google Drive </span>"
49+
]
50+
},
51+
{
52+
"cell_type": "code",
53+
"execution_count": null,
54+
"id": "ea8c756e",
55+
"metadata": {},
56+
"outputs": [],
57+
"source": [
58+
"# Call the function to download files\n",
59+
"new_files = download_files_to_folder(\n",
60+
" config.FOLDER_ID, \n",
61+
" config.DOWNLOAD_PATH,\n",
62+
")"
63+
]
64+
},
65+
{
66+
"cell_type": "markdown",
67+
"id": "f783e27e",
68+
"metadata": {},
69+
"source": [
70+
"## <span style=\"color:#ff5f27\">🧬 Text Extraction </span>"
71+
]
72+
},
73+
{
74+
"cell_type": "code",
75+
"execution_count": null,
76+
"id": "0b3b6715",
77+
"metadata": {},
78+
"outputs": [],
79+
"source": [
80+
"# Initialize an empty list\n",
81+
"document_text = []\n",
82+
"\n",
83+
"for file in new_files:\n",
84+
" process_pdf_file(\n",
85+
" file, \n",
86+
" document_text, \n",
87+
" config.DOWNLOAD_PATH,\n",
88+
" )"
89+
]
90+
},
91+
{
92+
"cell_type": "code",
93+
"execution_count": null,
94+
"id": "348b723e",
95+
"metadata": {},
96+
"outputs": [],
97+
"source": [
98+
"# Create a DataFrame\n",
99+
"columns = [\"file_name\", \"file_link\", \"page_number\", \"text\"]\n",
100+
"df_text = pd.DataFrame(\n",
101+
" data=document_text,\n",
102+
" columns=columns,\n",
103+
")\n",
104+
"# Display the DataFrame\n",
105+
"df_text"
106+
]
107+
},
108+
{
109+
"cell_type": "code",
110+
"execution_count": null,
111+
"id": "62a70763",
112+
"metadata": {},
113+
"outputs": [],
114+
"source": [
115+
"# Process text data using the process_text_data function\n",
116+
"df_text_processed = process_text_data(df_text)\n",
117+
"\n",
118+
"# Display the processed DataFrame\n",
119+
"df_text_processed"
120+
]
121+
},
122+
{
123+
"cell_type": "markdown",
124+
"id": "10f9ea36",
125+
"metadata": {},
126+
"source": [
127+
"## <span style=\"color:#ff5f27\">⚙️ Embeddings Creation </span>"
128+
]
129+
},
130+
{
131+
"cell_type": "code",
132+
"execution_count": null,
133+
"id": "9805c689",
134+
"metadata": {},
135+
"outputs": [],
136+
"source": [
137+
"# Load the SentenceTransformer model\n",
138+
"model = SentenceTransformer(\n",
139+
" config.MODEL_SENTENCE_TRANSFORMER,\n",
140+
").to(config.DEVICE)\n",
141+
"model.device"
142+
]
143+
},
144+
{
145+
"cell_type": "code",
146+
"execution_count": null,
147+
"id": "c1b7a89a",
148+
"metadata": {},
149+
"outputs": [],
150+
"source": [
151+
"# Generate embeddings for the 'text' column using the SentenceTransformer model\n",
152+
"df_text_processed['embeddings'] = pd.Series(\n",
153+
" model.encode(df_text_processed['text']).tolist(),\n",
154+
")\n",
155+
"\n",
156+
"# Create a new column 'context_id' with values ranging from 0 to the number of rows in the DataFrame\n",
157+
"df_text_processed['context_id'] = [*range(df_text_processed.shape[0])]\n",
158+
"\n",
159+
"# Display the resulting DataFrame with the added 'embeddings' and 'context_id' columns\n",
160+
"df_text_processed"
161+
]
162+
},
163+
{
164+
"cell_type": "markdown",
165+
"id": "d2bced31",
166+
"metadata": {},
167+
"source": [
168+
"## <span style=\"color:#ff5f27;\"> 🔮 Connecting to Hopsworks Feature Store </span>"
169+
]
170+
},
171+
{
172+
"cell_type": "code",
173+
"execution_count": null,
174+
"id": "7caf764d",
175+
"metadata": {},
176+
"outputs": [],
177+
"source": [
178+
"import hopsworks\n",
179+
"\n",
180+
"project = hopsworks.login()\n",
181+
"\n",
182+
"fs = project.get_feature_store() "
183+
]
184+
},
185+
{
186+
"cell_type": "markdown",
187+
"id": "0ed9ac69",
188+
"metadata": {},
189+
"source": [
190+
"## <span style=\"color:#ff5f27;\"> 🪄 Feature Group Creation </span>"
191+
]
192+
},
193+
{
194+
"cell_type": "code",
195+
"execution_count": null,
196+
"id": "9f5e486b",
197+
"metadata": {},
198+
"outputs": [],
199+
"source": [
200+
"from hsfs import embedding\n",
201+
"\n",
202+
"# Create the Embedding Index\n",
203+
"emb = embedding.EmbeddingIndex()\n",
204+
"\n",
205+
"emb.add_embedding(\n",
206+
" \"embeddings\", \n",
207+
" model.get_sentence_embedding_dimension(),\n",
208+
")"
209+
]
210+
},
211+
{
212+
"cell_type": "code",
213+
"execution_count": null,
214+
"id": "6e32b548",
215+
"metadata": {},
216+
"outputs": [],
217+
"source": [
218+
"# Get or create the 'documents_fg' feature group\n",
219+
"documents_fg = fs.get_or_create_feature_group(\n",
220+
" name=\"documents_fg\",\n",
221+
" embedding_index=emb,\n",
222+
" primary_key=['context_id'],\n",
223+
" version=1,\n",
224+
" description='Information from various files, presenting details like file names, source links, and structured text excerpts from different pages and paragraphs.',\n",
225+
" online_enabled=True,\n",
226+
")\n",
227+
"\n",
228+
"documents_fg.insert(df_text_processed)"
229+
]
230+
},
231+
{
232+
"cell_type": "markdown",
233+
"id": "d39a9ed6",
234+
"metadata": {},
235+
"source": [
236+
"## <span style=\"color:#ff5f27;\">🪄 Feature View Creation </span>\n"
237+
]
238+
},
239+
{
240+
"cell_type": "code",
241+
"execution_count": null,
242+
"id": "7a7bc2f0",
243+
"metadata": {},
244+
"outputs": [],
245+
"source": [
246+
"# Get or create the 'documents' feature view\n",
247+
"feature_view = fs.get_or_create_feature_view(\n",
248+
" name=\"documents\",\n",
249+
" version=1,\n",
250+
" description='Chunked context for RAG system',\n",
251+
" query=documents_fg.select([\"file_name\", \"file_link\", \"page_number\", \"paragraph\", \"text\"]),\n",
252+
")"
253+
]
254+
},
255+
{
256+
"cell_type": "markdown",
257+
"id": "708b9a5f",
258+
"metadata": {},
259+
"source": [
260+
"---"
261+
]
262+
}
263+
],
264+
"metadata": {
265+
"kernelspec": {
266+
"display_name": "Python 3 (ipykernel)",
267+
"language": "python",
268+
"name": "python3"
269+
},
270+
"language_info": {
271+
"codemirror_mode": {
272+
"name": "ipython",
273+
"version": 3
274+
},
275+
"file_extension": ".py",
276+
"mimetype": "text/x-python",
277+
"name": "python",
278+
"nbconvert_exporter": "python",
279+
"pygments_lexer": "ipython3",
280+
"version": "3.11.7"
281+
}
282+
},
283+
"nbformat": 4,
284+
"nbformat_minor": 5
285+
}
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,69 @@
1+
import PyPDF2
2+
import pandas as pd
3+
from sentence_transformers import SentenceTransformer
4+
5+
from functions.pdf_preprocess import download_files_to_folder, process_pdf_file
6+
from functions.text_preprocess import process_text_data
7+
import config
8+
9+
import hopsworks
10+
11+
def pipeline():
12+
# Call the function to download files
13+
new_files = download_files_to_folder(
14+
config.FOLDER_ID,
15+
config.DOWNLOAD_PATH,
16+
)
17+
18+
if len(new_files) == 0:
19+
print('⛳️ Your folder is up to date!')
20+
return
21+
22+
# Initialize an empty list
23+
document_text = []
24+
25+
for file in new_files:
26+
process_pdf_file(
27+
file,
28+
document_text,
29+
config.DOWNLOAD_PATH,
30+
)
31+
32+
# Create a DataFrame
33+
columns = ["file_name", "page_number", "text"]
34+
df_text = pd.DataFrame(
35+
data=document_text,
36+
columns=columns,
37+
)
38+
39+
# Process text data using the process_text_data function
40+
df_text_processed = process_text_data(df_text)
41+
42+
# Retrieve a SentenceTransformer
43+
model = SentenceTransformer(
44+
config.MODEL_SENTENCE_TRANSFORMER,
45+
).to(config.DEVICE)
46+
47+
# Generate embeddings for the 'text' column using the SentenceTransformer model
48+
df_text_processed['embeddings'] = pd.Series(
49+
model.encode(df_text_processed['text']).tolist(),
50+
)
51+
52+
# Create a new column 'context_id' with values ranging from 0 to the number of rows in the DataFrame
53+
df_text_processed['context_id'] = [*range(df_text_processed.shape[0])]
54+
55+
56+
project = hopsworks.login()
57+
58+
fs = project.get_feature_store()
59+
60+
documents_fg = fs.get_feature_group(
61+
name="documents_fg",
62+
version=1,
63+
)
64+
65+
documents_fg.insert(df_text_processed)
66+
return
67+
68+
if __name__ == '__main__':
69+
pipeline()

0 commit comments

Comments
 (0)