Skip to content

Commit e8b972e

Browse files
authored
Merge branch 'cocoindex-io:main' into expr-union-type-impl
2 parents 534c791 + cf3634c commit e8b972e

File tree

29 files changed

+231
-72
lines changed

29 files changed

+231
-72
lines changed

README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -137,7 +137,7 @@ It defines an index flow like this:
137137
| [Docs to Knowledge Graph](examples/docs_to_knowledge_graph) | Extract relationships from Markdown documents and build a knowledge graph |
138138
| [Embeddings to Qdrant](examples/text_embedding_qdrant) | Index documents in a Qdrant collection for semantic search |
139139
| [FastAPI Server with Docker](examples/fastapi_server_docker) | Run the semantic search server in a Dockerized FastAPI setup |
140-
| [Product_Taxonomy_Knowledge_Graph](examples/product_taxonomy_knowledge_graph) | Build knowledge graph for product recommendations |
140+
| [Product Recommendation](examples/product_recommendation) | Build real-time product recommendations with LLM and graph database|
141141
| [Image Search with Vision API](examples/image_search_example) | Generates detailed captions for images using a vision model, embeds them, enables live-updating semantic search via FastAPI and served on a React frontend|
142142

143143
More coming and stay tuned 👀!

docs/docs/getting_started/quickstart.md

Lines changed: 127 additions & 44 deletions
Original file line numberDiff line numberDiff line change
@@ -54,11 +54,7 @@ Create a new file `quickstart.py` and import the `cocoindex` library:
5454
import cocoindex
5555
```
5656
57-
Then we'll put the following pieces into the file:
58-
59-
* Define an indexing flow, which specifies a data flow to transform data from specified data source into a vector index.
60-
* Define a query handler, which can be used to query the vector index.
61-
* A main function, to interact with users and run queries using the query handler above.
57+
Then we'll create the indexing flow.
6258

6359
### Step 2.1: Define the indexing flow
6460

@@ -121,46 +117,14 @@ Notes:
121117

122118
6. In CocoIndex, a *collector* collects multiple entries of data together. In this example, the `doc_embeddings` collector collects data from all `chunk`s across all `doc`s, and using the collected data to build a vector index `"doc_embeddings"`, using `Postgres`.
123119

124-
### Step 2.2: Define the query handler
125-
126-
Starting from the query handler:
127-
128-
```python title="quickstart.py"
129-
query_handler = cocoindex.query.SimpleSemanticsQueryHandler(
130-
name="SemanticsSearch",
131-
flow=text_embedding_flow,
132-
target_name="doc_embeddings",
133-
query_transform_flow=lambda text: text.transform(
134-
cocoindex.functions.SentenceTransformerEmbed(
135-
model="sentence-transformers/all-MiniLM-L6-v2")),
136-
default_similarity_metric=cocoindex.VectorSimilarityMetric.COSINE_SIMILARITY)
137-
```
138-
139-
This handler queries the vector index `"doc_embeddings"`, and uses the same embedding model `"sentence-transformers/all-MiniLM-L6-v2"` to transform query text into vectors for similarity matching.
140-
141-
142-
### Step 2.3: Define the main function
120+
### Step 2.2: Define the main function
143121

144-
The main function is used to interact with users and run queries using the query handler above.
122+
We can provide an empty main function for now, with a `@cocoindex.main_fn()` decorator:
145123
146124
```python title="quickstart.py"
147125
@cocoindex.main_fn()
148126
def _main():
149-
# Run queries to demonstrate the query capabilities.
150-
while True:
151-
try:
152-
query = input("Enter search query (or Enter to quit): ")
153-
if query == '':
154-
break
155-
results, _ = query_handler.search(query, 10)
156-
print("\nSearch results:")
157-
for result in results:
158-
print(f"[{result.score:.3f}] {result.data['filename']}")
159-
print(f" {result.data['text']}")
160-
print("---")
161-
print()
162-
except KeyboardInterrupt:
163-
break
127+
pass
164128

165129
if __name__ == "__main__":
166130
_main()
@@ -171,7 +135,6 @@ The `@cocoindex.main_fn` declares a function as the main function for an indexin
171135
* Initialize the CocoIndex librart states. Settings (e.g. database URL) are loaded from environment variables by default.
172136
* When the CLI is invoked with `cocoindex` subcommand, `cocoindex CLI` takes over the control, which provides convenient ways to manage the index. See the next step for more details.
173137
174-
175138
## Step 3: Run the indexing pipeline and queries
176139
177140
Specify the database URL by environment variable:
@@ -206,9 +169,129 @@ It will run for a few seconds and output the following statistics:
206169
documents: 3 added, 0 removed, 0 updated
207170
```
208171
209-
### Step 3.3: Run queries against the index
172+
## Step 4 (optional): Run queries against the index
173+
174+
CocoIndex excels at transforming your data and storing it (a.k.a. indexing).
175+
The goal of transforming your data is usually to query against it.
176+
Once you already have your index built, you can directly access the transformed data in the target database.
177+
CocoIndex also provides utilities for you to do this more seamlessly.
178+
179+
In this example, we'll use the [`psycopg` library](https://www.psycopg.org/) to connect to the database and run queries.
180+
Please make sure it's installed:
181+
182+
```bash
183+
pip install psycopg[binary,pool]
184+
```
185+
186+
### Step 4.1: Extract common transformations
187+
188+
Between your indexing flow and the query logic, one piece of transformation is shared: compute the embedding of a text.
189+
i.e. they should use exactly the same embedding model and parameters.
190+
191+
Let's extract that into a function:
192+
193+
```python title="quickstart.py"
194+
@cocoindex.transform_flow()
195+
def text_to_embedding(text: cocoindex.DataSlice[str]) -> cocoindex.DataSlice[list[float]]:
196+
return text.transform(
197+
cocoindex.functions.SentenceTransformerEmbed(
198+
model="sentence-transformers/all-MiniLM-L6-v2"))
199+
```
200+
201+
`cocoindex.DataSlice[str]` represents certain data in the flow (e.g. a field in a data scope), with type `str` at runtime.
202+
Similar to the `text_embedding_flow()` above, the `text_to_embedding()` is also to constructing the flow instead of directly doing computation,
203+
so the type it takes is `cocoindex.DataSlice[str]` instead of `str`.
204+
See [Data Slice](../core/flow_def#data-slice) for more details.
205+
206+
207+
Then the corresponding code in the indexing flow can be simplified by calling this function:
208+
209+
```python title="quickstart.py"
210+
...
211+
# Transform data of each chunk
212+
with doc["chunks"].row() as chunk:
213+
# Embed the chunk, put into `embedding` field
214+
chunk["embedding"] = text_to_embedding(chunk["text"])
215+
216+
# Collect the chunk into the collector.
217+
doc_embeddings.collect(filename=doc["filename"], location=chunk["location"],
218+
text=chunk["text"], embedding=chunk["embedding"])
219+
...
220+
```
221+
222+
The function decorator `@cocoindex.transform_flow()` is used to declare a function as a CocoIndex transform flow,
223+
i.e., a sub flow only performing transformations, without importing data from sources or exporting data to targets.
224+
The decorator is needed for evaluating the flow with specific input data in Step 4.2 below.
225+
226+
### Step 4.2: Provide the query logic
227+
228+
Now we can create a function to query the index upon a given input query:
229+
230+
```python title="quickstart.py"
231+
from psycopg_pool import ConnectionPool
232+
233+
def search(pool: ConnectionPool, query: str, top_k: int = 5):
234+
# Get the table name, for the export target in the text_embedding_flow above.
235+
table_name = cocoindex.utils.get_target_storage_default_name(text_embedding_flow, "doc_embeddings")
236+
# Evaluate the transform flow defined above with the input query, to get the embedding.
237+
query_vector = text_to_embedding.eval(query)
238+
# Run the query and get the results.
239+
with pool.connection() as conn:
240+
with conn.cursor() as cur:
241+
cur.execute(f"""
242+
SELECT filename, text, embedding <=> %s::vector AS distance
243+
FROM {table_name} ORDER BY distance LIMIT %s
244+
""", (query_vector, top_k))
245+
return [
246+
{"filename": row[0], "text": row[1], "score": 1.0 - row[2]}
247+
for row in cur.fetchall()
248+
]
249+
```
250+
251+
In the function above, most parts are standard query logic - you can use any libraries you like.
252+
There're two CocoIndex-specific logic:
253+
254+
1. Get the table name from the export target in the `text_embedding_flow` above.
255+
Since the table name for the `Postgres` target is not explicitly specified in the `export()` call,
256+
CocoIndex uses a default name.
257+
`cocoindex.utils.get_target_storage_default_name()` is a utility function to get the default table name for this case.
258+
259+
2. Evaluate the transform flow defined above with the input query, to get the embedding.
260+
It's done by the `eval()` method of the transform flow `text_to_embedding`.
261+
The return type of this method is `list[float]` as declared in the `text_to_embedding()` function (`cocoindex.DataSlice[list[float]]`).
262+
263+
### Step 4.3: Update the main function
264+
265+
Now we can update the main function to use the query function we just defined:
266+
267+
```python title="quickstart.py"
268+
@cocoindex.main_fn()
269+
def _run():
270+
# Initialize the database connection pool.
271+
pool = ConnectionPool(os.getenv("COCOINDEX_DATABASE_URL"))
272+
# Run queries in a loop to demonstrate the query capabilities.
273+
while True:
274+
try:
275+
query = input("Enter search query (or Enter to quit): ")
276+
if query == '':
277+
break
278+
# Run the query function with the database connection pool and the query.
279+
results = search(pool, query)
280+
print("\nSearch results:")
281+
for result in results:
282+
print(f"[{result['score']:.3f}] {result['filename']}")
283+
print(f" {result['text']}")
284+
print("---")
285+
print()
286+
except KeyboardInterrupt:
287+
break
288+
```
289+
290+
It interacts with users and search the database by calling the `search()` method created in Step 4.2.
291+
292+
### Step 4.4: Run queries against the index
210293
211-
Now we have the index built. We can run the same Python file without additional arguments, which will run the main function defined in Step 2.3:
294+
Now we can run the same Python file, which will run the new main function:
212295
213296
```bash
214297
python quickstart.py
@@ -222,5 +305,5 @@ Next, you may want to:
222305
223306
* Learn about [CocoIndex Basics](../core/basics.md).
224307
* Learn about other examples in the [examples](https://github.com/cocoindex-io/cocoindex/tree/main/examples) directory.
225-
* The `text_embedding` example is this quickstart with some polishing (loading environment variables from `.env` file, extract pieces shared by the indexing flow and query handler into a function).
308+
* The `text_embedding` example is this quickstart.
226309
* Pick other examples to learn upon your interest.

examples/amazon_s3_embedding/pyproject.toml

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -3,4 +3,7 @@ name = "amazon-s3-text-embedding"
33
version = "0.1.0"
44
description = "Simple example for cocoindex: build embedding index based on Amazon S3 files."
55
requires-python = ">=3.11"
6-
dependencies = ["cocoindex>=0.1.35", "python-dotenv>=1.0.1"]
6+
dependencies = ["cocoindex>=0.1.39", "python-dotenv>=1.0.1"]
7+
8+
[tool.setuptools]
9+
packages = []

examples/code_embedding/pyproject.toml

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -3,4 +3,7 @@ name = "code-embedding"
33
version = "0.1.0"
44
description = "Simple example for cocoindex: build embedding index based on source code."
55
requires-python = ">=3.10"
6-
dependencies = ["cocoindex>=0.1.35", "python-dotenv>=1.0.1"]
6+
dependencies = ["cocoindex>=0.1.39", "python-dotenv>=1.0.1"]
7+
8+
[tool.setuptools]
9+
packages = []

examples/docs_to_knowledge_graph/pyproject.toml

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -3,4 +3,7 @@ name = "manuals-to-kg"
33
version = "0.1.0"
44
description = "Simple example for cocoindex: extract triples from files and build knowledge graph."
55
requires-python = ">=3.10"
6-
dependencies = ["cocoindex>=0.1.35", "python-dotenv>=1.0.1"]
6+
dependencies = ["cocoindex>=0.1.39", "python-dotenv>=1.0.1"]
7+
8+
[tool.setuptools]
9+
packages = []

examples/gdrive_text_embedding/pyproject.toml

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -3,4 +3,7 @@ name = "gdrive-text-embedding"
33
version = "0.1.0"
44
description = "Simple example for cocoindex: build embedding index based on Google Drive files."
55
requires-python = ">=3.11"
6-
dependencies = ["cocoindex>=0.1.35", "python-dotenv>=1.0.1"]
6+
dependencies = ["cocoindex>=0.1.39", "python-dotenv>=1.0.1"]
7+
8+
[tool.setuptools]
9+
packages = []

examples/manuals_llm_extraction/pyproject.toml

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,10 @@ version = "0.1.0"
44
description = "Simple example for cocoindex: extract structured information from a Markdown file using LLM."
55
requires-python = ">=3.10"
66
dependencies = [
7-
"cocoindex>=0.1.35",
7+
"cocoindex>=0.1.39",
88
"python-dotenv>=1.0.1",
99
"marker-pdf>=1.5.2",
1010
]
11+
12+
[tool.setuptools]
13+
packages = []

examples/pdf_embedding/pyproject.toml

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,10 @@ version = "0.1.0"
44
description = "Simple example for cocoindex: build embedding index based on local PDF files."
55
requires-python = ">=3.10"
66
dependencies = [
7-
"cocoindex>=0.1.35",
7+
"cocoindex>=0.1.39",
88
"python-dotenv>=1.0.1",
99
"marker-pdf>=1.5.2",
1010
]
11+
12+
[tool.setuptools]
13+
packages = []

examples/product_taxonomy_knowledge_graph/README.md renamed to examples/product_recommendation/README.md

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,8 @@
1-
# Build Real-Time Product Recommendation based on LLM Taxonomy Extraction and Knowledge Graph
1+
# Build Real-Time Recommendation Engine with LLM and Graph Database
22

3-
We will process a list of products and use LLM to extract the taxonomy and complimentary taxonomy for each product.
3+
We will build a real-time product recommendation engine with LLM and graph database. In particular, we will use LLM to understand the category (taxonomy) of a product. In addition, we will use LLM to enumerate the complementary products - users are likely to buy together with the current product (pencil and notebook).
4+
5+
We will use Graph to explore the relationships between products that can be further used for product recommendations or labeling.
46

57
Please drop [CocoIndex on Github](https://github.com/cocoindex-io/cocoindex) a star to support us and stay tuned for more updates. Thank you so much 🥥🤗. [![GitHub](https://img.shields.io/github/stars/cocoindex-io/cocoindex?color=5B5BD6)](https://github.com/cocoindex-io/cocoindex)
68

examples/product_taxonomy_knowledge_graph/pyproject.toml renamed to examples/product_recommendation/pyproject.toml

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -3,4 +3,7 @@ name = "cocoindex-ecommerce-taxonomy"
33
version = "0.1.0"
44
description = "Simple example for CocoIndex: extract taxonomy from e-commerce products and build knowledge graph."
55
requires-python = ">=3.10"
6-
dependencies = ["cocoindex>=0.1.35", "python-dotenv>=1.0.1", "jinja2>=3.1.6"]
6+
dependencies = ["cocoindex>=0.1.39", "python-dotenv>=1.0.1", "jinja2>=3.1.6"]
7+
8+
[tool.setuptools]
9+
packages = []

examples/text_embedding/main.py

Lines changed: 28 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,11 @@
1+
import os
12
from dotenv import load_dotenv
3+
from psycopg_pool import ConnectionPool
24

35
import cocoindex
46

5-
def text_to_embedding(text: cocoindex.DataSlice) -> cocoindex.DataSlice:
7+
@cocoindex.transform_flow()
8+
def text_to_embedding(text: cocoindex.DataSlice[str]) -> cocoindex.DataSlice[list[float]]:
69
"""
710
Embed the text using a SentenceTransformer model.
811
This is a shared logic between indexing and querying, so extract it as a function.
@@ -17,7 +20,7 @@ def text_embedding_flow(flow_builder: cocoindex.FlowBuilder, data_scope: cocoind
1720
Define an example flow that embeds text into a vector database.
1821
"""
1922
data_scope["documents"] = flow_builder.add_source(
20-
cocoindex.sources.LocalFile(path="markdown_files"))
23+
cocoindex.sources.LocalFile(path="markdown_files", included_patterns=["*.md"]))
2124

2225
doc_embeddings = data_scope.add_collector()
2326

@@ -40,26 +43,45 @@ def text_embedding_flow(flow_builder: cocoindex.FlowBuilder, data_scope: cocoind
4043
field_name="embedding",
4144
metric=cocoindex.VectorSimilarityMetric.COSINE_SIMILARITY)])
4245

43-
query_handler = cocoindex.query.SimpleSemanticsQueryHandler(
46+
# Keep for now to allow CocoInsight to query.
47+
# Will be removed later after we expose `search()` below as a query function (https://github.com/cocoindex-io/cocoindex/issues/502).
48+
cocoindex.query.SimpleSemanticsQueryHandler(
4449
name="SemanticsSearch",
4550
flow=text_embedding_flow,
4651
target_name="doc_embeddings",
4752
query_transform_flow=text_to_embedding,
4853
default_similarity_metric=cocoindex.VectorSimilarityMetric.COSINE_SIMILARITY)
4954

55+
def search(pool: ConnectionPool, query: str, top_k: int = 5):
56+
table_name = cocoindex.utils.get_target_storage_default_name(text_embedding_flow, "doc_embeddings")
57+
query_vector = text_to_embedding.eval(query)
58+
with pool.connection() as conn:
59+
with conn.cursor() as cur:
60+
cur.execute(f"""
61+
SELECT filename, location, text, embedding <=> %s::vector AS distance
62+
FROM {table_name}
63+
ORDER BY distance
64+
LIMIT %s
65+
""", (query_vector, top_k))
66+
return [
67+
{"filename": row[0], "location": row[1], "text": row[2], "score": 1.0 - row[3]}
68+
for row in cur.fetchall()
69+
]
70+
5071
@cocoindex.main_fn()
5172
def _run():
73+
pool = ConnectionPool(os.getenv("COCOINDEX_DATABASE_URL"))
5274
# Run queries in a loop to demonstrate the query capabilities.
5375
while True:
5476
try:
5577
query = input("Enter search query (or Enter to quit): ")
5678
if query == '':
5779
break
58-
results, _ = query_handler.search(query, 10)
80+
results = search(pool, query)
5981
print("\nSearch results:")
6082
for result in results:
61-
print(f"[{result.score:.3f}] {result.data['filename']}")
62-
print(f" {result.data['text']}")
83+
print(f"[{result['score']:.3f}] {result['filename']} location:{result['location']}")
84+
print(f" {result['text']}")
6385
print("---")
6486
print()
6587
except KeyboardInterrupt:

examples/text_embedding/pyproject.toml

Lines changed: 8 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -3,4 +3,11 @@ name = "text-embedding"
33
version = "0.1.0"
44
description = "Simple example for cocoindex: build embedding index based on local text files."
55
requires-python = ">=3.10"
6-
dependencies = ["cocoindex>=0.1.35", "python-dotenv>=1.0.1"]
6+
dependencies = [
7+
"cocoindex>=0.1.39",
8+
"python-dotenv>=1.0.1",
9+
"psycopg[binary,pool]",
10+
]
11+
12+
[tool.setuptools]
13+
packages = []

examples/text_embedding_qdrant/pyproject.toml

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -3,4 +3,7 @@ name = "text-embedding-qdrant"
33
version = "0.1.0"
44
description = "Simple example for cocoindex: build embedding index based on local text files."
55
requires-python = ">=3.10"
6-
dependencies = ["cocoindex>=0.1.35", "python-dotenv>=1.0.1"]
6+
dependencies = ["cocoindex>=0.1.39", "python-dotenv>=1.0.1"]
7+
8+
[tool.setuptools]
9+
packages = []

0 commit comments

Comments
 (0)