Skip to content

Commit 8326ea5

Browse files
authored
Merge branch 'main' into standalone-cli
2 parents 75e2316 + 482750b commit 8326ea5

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

80 files changed

+1320
-1812
lines changed

Cargo.toml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -3,7 +3,7 @@ name = "cocoindex"
33
# Version used for local development is always higher than others to take precedence.
44
# Will be overridden for specific release versions.
55
version = "999.0.0"
6-
edition = "2021"
6+
edition = "2024"
77

88
[profile.release]
99
codegen-units = 1

README.md

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -132,11 +132,12 @@ It defines an index flow like this:
132132
| [Code Embedding](examples/code_embedding) | Index code embeddings for semantic search |
133133
| [PDF Embedding](examples/pdf_embedding) | Parse PDF and index text embeddings for semantic search |
134134
| [Manuals LLM Extraction](examples/manuals_llm_extraction) | Extract structured information from a manual using LLM |
135+
| [Amazon S3 Embedding](examples/amazon_s3_embedding) | Index text documents from Amazon S3 |
135136
| [Google Drive Text Embedding](examples/gdrive_text_embedding) | Index text documents from Google Drive |
136137
| [Docs to Knowledge Graph](examples/docs_to_knowledge_graph) | Extract relationships from Markdown documents and build a knowledge graph |
137138
| [Embeddings to Qdrant](examples/text_embedding_qdrant) | Index documents in a Qdrant collection for semantic search |
138139
| [FastAPI Server with Docker](examples/fastapi_server_docker) | Run the semantic search server in a Dockerized FastAPI setup |
139-
| [Product_Taxonomy_Knowledge_Graph](examples/product_taxonomy_knowledge_graph) | Build knowledge graph for product recommendations |
140+
| [Product Recommendation](examples/product_recommendation) | Build real-time product recommendations with LLM and graph database|
140141
| [Image Search with Vision API](examples/image_search_example) | Generates detailed captions for images using a vision model, embeds them, enables live-updating semantic search via FastAPI and served on a React frontend|
141142

142143
More coming and stay tuned 👀!

docs/docs/core/basics.md

Lines changed: 10 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -1,17 +1,17 @@
11
---
2-
title: Basics
3-
description: "CocoIndex basic concepts: indexing flow, data, operations, data updates, etc."
2+
title: Indexing Basics
3+
description: "CocoIndex basic concepts for indexing: indexing flow, data, operations, data updates, etc."
44
---
55

6-
# CocoIndex Basics
6+
# CocoIndex Indexing Basics
77

88
An **index** is a collection of data stored in a way that is easy for retrieval.
99

10-
CocoIndex is an ETL framework for building indexes from specified data sources, a.k.a. indexing. It also offers utilities for users to retrieve data from the indexes.
10+
CocoIndex is an ETL framework for building indexes from specified data sources, a.k.a. **indexing**. It also offers utilities for users to retrieve data from the indexes.
1111

12-
## Indexing flow
12+
An **indexing flow** extracts data from specified data sources, upon specified transformations, and puts the transformed data into specified storage for later retrieval.
1313

14-
An indexing flow extracts data from specified data sources, upon specified transformations, and puts the transformed data into specified storage for later retrieval.
14+
## Indexing flow elements
1515

1616
An indexing flow has two aspects: data and operations on data.
1717

@@ -42,7 +42,7 @@ An **operation** in an indexing flow defines a step in the flow. An operation is
4242

4343
"import" and "transform" operations produce output data, whose data type is determined based on the operation spec and data types of input data (for "transform" operation only).
4444

45-
### Example
45+
## An indexing flow example
4646

4747
For the example shown in the [Quickstart](../getting_started/quickstart) section, the indexing flow is as follows:
4848

@@ -60,7 +60,7 @@ This shows schema and example data for the indexing flow:
6060

6161
![Data Example](data_example.svg)
6262

63-
### Life cycle of an indexing flow
63+
## Life cycle of an indexing flow
6464

6565
An indexing flow, once set up, maintains a long-lived relationship between data source and data in target storage. This means:
6666

@@ -95,19 +95,10 @@ CocoIndex works the same way, but with more powerful capabilities:
9595

9696
This means when writing your flow operations, you can treat source data as if it were static - focusing purely on defining the transformation logic. CocoIndex takes care of maintaining the dynamic relationship between sources and target data behind the scenes.
9797

98-
### Internal storage
98+
## Internal storage
9999

100100
As an indexing flow is long-lived, it needs to store intermediate data to keep track of the states.
101101
CocoIndex uses internal storage for this purpose.
102102

103103
Currently, CocoIndex uses Postgres database as the internal storage.
104-
See [Initialization](initialization) for configuring its location, and `cocoindex setup` CLI command (see [CocoIndex CLI](cli)) creates tables for the internal storage.
105-
106-
## Retrieval
107-
108-
There are two ways to retrieve data from target storage built by an indexing flow:
109-
110-
* Query the underlying target storage directly for maximum flexibility.
111-
* Use CocoIndex *query handlers* for a more convenient experience with built-in tooling support (e.g. CocoInsight) to understand query performance against the target data.
112-
113-
Query handlers are tied to specific indexing flows. They accept query inputs, transform them by defined operations, and retrieve matching data from the target storage that was created by the flow.
104+
See [Initialization](initialization) for configuring its location, and `cocoindex setup` CLI command (see [CocoIndex CLI](cli)) creates tables for the internal storage.

docs/docs/core/cli.mdx

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -41,7 +41,7 @@ You may also provide a `cocoindex_cmd` argument to the `main_fn` decorator to ch
4141

4242
### Explicitly CLI Invoke
4343

44-
An alterntive way is to use `cocoindex.cli.cli` (with type [`click.Group`](https://click.palletsprojects.com/en/stable/api/#click.Group)).
44+
An alternative way is to use `cocoindex.cli.cli` (with type [`click.Group`](https://click.palletsprojects.com/en/stable/api/#click.Group)).
4545
For example, you may invoke the CLI explicitly with additional arguments:
4646

4747
<Tabs>
@@ -60,7 +60,7 @@ The following subcommands are available:
6060

6161
| Subcommand | Description |
6262
| ---------- | ----------- |
63-
| `ls` | List all flows. |
63+
| `ls` | List all flows present in the current process. Or list all persisted flows under the current app namespace if `--all` is specified. |
6464
| `show` | Show the spec for a specific flow. |
6565
| `setup` | Check and apply backend setup changes for flows, including the internal and target storage (to export). |
6666
| `drop` | Drop the backend setup for specified flows. |

docs/docs/core/flow_def.mdx

Lines changed: 34 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,6 @@
11
---
22
title: Flow Definition
33
description: Define a CocoIndex flow, by specifying source, transformations and storages, and connect input/output data of them.
4-
toc_max_heading_level: 4
54
---
65

76
import Tabs from '@theme/Tabs';
@@ -146,8 +145,9 @@ def demo_flow(flow_builder: cocoindex.FlowBuilder, data_scope: cocoindex.DataSco
146145

147146
:::info
148147

149-
In live update mode, for each refresh, CocoIndex will traverse the data source to figure out the changes,
148+
In live update mode, for each refresh, CocoIndex will list rows in the data source to figure out the changes based on metadata such as last modified time,
150149
and only perform transformations on changed source keys.
150+
If nothing changed during the last refresh cycle, only list operations will be performed, which is usually cheap for most data sources.
151151

152152
:::
153153

@@ -311,6 +311,38 @@ Following metrics are supported:
311311

312312
## Miscellaneous
313313

314+
### Getting App Namespace
315+
316+
You can use the [`app_namespace` setting](initialization#app-namespace) or `COCOINDEX_APP_NAMESPACE` environment variable to specify the app namespace,
317+
to organize flows across different environments (e.g., dev, staging, production), team members, etc.
318+
319+
In the code, You can call `flow.get_app_namespace()` to get the app namespace, and use it to name certain backends. It takes the following arguments:
320+
321+
* `trailing_delimiter` (optional): a string to append to the app namespace when it's not empty.
322+
323+
e.g. when the current app namespace is `Staging`, `flow.get_app_namespace(trailing_delimiter='.')` will return `Staging.`.
324+
325+
For example,
326+
327+
<Tabs>
328+
<TabItem value="python" label="Python" default>
329+
330+
```python
331+
doc_embeddings.export(
332+
"doc_embeddings",
333+
cocoindex.storages.Qdrant(
334+
collection_name=cocoindex.get_app_namespace(trailing_delimiter='__') + "doc_embeddings",
335+
...
336+
),
337+
...
338+
)
339+
```
340+
341+
</TabItem>
342+
</Tabs>
343+
344+
It will use `Staging__doc_embeddings` as the collection name if the current app namespace is `Staging`, and use `doc_embeddings` if the app namespace is empty.
345+
314346
### Target Declarations
315347

316348
Most time a target storage is created by calling `export()` method on a collector, and this `export()` call comes with configurations needed for the target storage, e.g. options for storage indexes.

docs/docs/core/flow_methods.mdx

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -105,7 +105,7 @@ A data source may enable one or multiple *change capture mechanisms*:
105105
* Configured with a [refresh interval](flow_def#refresh-interval), which is generally applicable to all data sources.
106106

107107
* Specific data sources also provide their specific change capture mechanisms.
108-
For example, [`GoogleDrive` source](../ops/sources#googledrive) allows polling recent modified files.
108+
For example, [`AmazonS3` source](../ops/sources/#amazons3) watches S3 bucket's change events, and [`GoogleDrive` source](../ops/sources#googledrive) allows polling recent modified files.
109109
See documentations for specific data sources.
110110

111111
Change capture mechanisms enable CocoIndex to continuously capture changes from the source data and update the target data accordingly, under live update mode.

docs/docs/core/initialization.mdx

Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -83,8 +83,20 @@ if __name__ == "__main__":
8383

8484
`cocoindex.Settings` is used to configure the CocoIndex library. It's a dataclass that contains the following fields:
8585

86+
* `app_namespace` (type: `str`, required): The namespace of the application.
8687
* `database` (type: `DatabaseConnectionSpec`, required): The connection to the Postgres database.
8788

89+
### App Namespace
90+
91+
The `app_namespace` field helps organize flows across different environments (e.g., dev, staging, production), team members, etc. When set, it prefixes flow names with the namespace.
92+
93+
For example, if the namespace is `Staging`, for a flow with name specified as `Flow1` in code, the full name of the flow will be `Staging.Flow1`.
94+
You can also get the current app namespace by calling `cocoindex.get_app_namespace()` (see [Getting App Namespace](flow_def#getting-app-namespace) for more details).
95+
96+
If not set, all flows are in a default unnamed namespace.
97+
98+
You can also control it by the `COCOINDEX_APP_NAMESPACE` environment variable.
99+
88100
### DatabaseConnectionSpec
89101

90102
`DatabaseConnectionSpec` configures the connection to a database. Only Postgres is supported for now. It has the following fields:
@@ -116,6 +128,7 @@ Each setting field has a corresponding environment variable:
116128

117129
| environment variable | corresponding field in `Settings` | required? |
118130
|---------------------|-------------------|----------|
131+
| `COCOINDEX_APP_NAMESPACE` | `app_namespace` | No |
119132
| `COCOINDEX_DATABASE_URL` | `database.url` | Yes |
120133
| `COCOINDEX_DATABASE_USER` | `database.user` | No |
121134
| `COCOINDEX_DATABASE_PASSWORD` | `database.password` | No |

0 commit comments

Comments
 (0)