You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
@@ -9,7 +9,7 @@ An **index** is a collection of data stored in a way that is easy for retrieval.
9
9
10
10
CocoIndex is an ETL framework for building indexes from specified data sources, a.k.a. indexing. It also offers utilities for users to retrieve data from the indexes.
11
11
12
-
## Indexing Flow
12
+
## Indexing flow
13
13
14
14
An indexing flow extracts data from speicfied data sources, upon specified transformations, and puts the transformed data into specified storage for later retrieval.
15
15
@@ -36,7 +36,7 @@ An **operation** in an indexing flow defines a step in the flow. An operation is
36
36
***Action**, which defines the behavior of the operation, e.g. *import*, *transform*, *for each*, *collect* and *export*.
37
37
See [Flow Definition](flow_def) for more details for each action.
38
38
39
-
* Some actions (i.e. "import", "transform" and "export") require an **Operation Spec**, which describes the specific behavior of the operation, e.g. a source to import from, a function describing the transformation behavior, a storage to export to as an index.
39
+
* Some actions (i.e. "import", "transform" and "export") require an **Operation Spec**, which describes the specific behavior of the operation, e.g. a source to import from, a function describing the transformation behavior, a target storage to export to (as an index).
40
40
* Each operation spec has a **operation type**, e.g. `LocalFile` (data source), `SplitRecursively` (function), `SentenceTransformerEmbed` (function), `Postgres` (storage).
41
41
* CocoIndex framework maintains a set of supported operation types. Users can also implement their own.
42
42
@@ -60,31 +60,40 @@ This shows schema and example data for the indexing flow:
60
60
61
61

62
62
63
-
### Life Cycle of an Indexing Flow
63
+
### Life cycle of an indexing flow
64
64
65
-
An indexing flow, once set up, maintains a long-lived relationship between source data and indexes. This means:
65
+
An indexing flow, once set up, maintains a long-lived relationship between data source and data in target storage. This means:
66
+
67
+
1. The target storage created by the flow remain available for querying at any time
68
+
69
+
2. As source data changes (new data added, existing data updated or deleted), data in the target storage are updated to reflect those changes,
70
+
on certain pace, according to the update mode:
71
+
72
+
***One time update**: Once triggered, CocoIndex updates the target data to reflect the version of source data up to the current moment.
73
+
***Live update**: CocoIndex continuously watches the source data and updates the target data accordingly.
74
+
75
+
See more details in the [build / update target data](flow_methods#build--update-target-data) section.
76
+
77
+
3. CocoIndex intelligently manages these updates by:
78
+
* Determining which parts of the target data need to be recomputed
79
+
* Reusing existing computations where possible
80
+
* Only reprocessing the minimum necessary data
66
81
67
-
1. The indexes created by the flow remain available for querying at any time
68
-
2. When source data changes, the indexes are automatically updated to reflect those changes
69
-
3. CocoIndex intelligently manages these updates by:
70
-
- Determining which parts of the index need to be recomputed
71
-
- Reusing existing computations where possible
72
-
- Only reprocessing the minimum necessary data
73
82
74
83
You can think of an indexing flow similar to formulas in a spreadsheet:
75
84
76
-
- In a spreadsheet, you define formulas that transform input cells into output cells
77
-
-When input values change, the spreadsheet automatically recalculates affected outputs
78
-
- You focus on defining the transformation logic, not managing updates
85
+
* In a spreadsheet, you define formulas that transform input cells into output cells
86
+
*When input values change, the spreadsheet recalculates affected outputs
87
+
* You focus on defining the transformation logic, not managing updates
79
88
80
89
CocoIndex works the same way, but with more powerful capabilities:
81
90
82
-
- Instead of flat tables, CocoIndex models data in nested data structures, making it more natural to model complex data
83
-
- Instead of simple cell-level formulas, you have operations like "for each" to apply the same formula across rows without repeating yourself
91
+
* Instead of flat tables, CocoIndex models data in nested data structures, making it more natural to model complex data
92
+
* Instead of simple cell-level formulas, you have operations like "for each" to apply the same formula across rows without repeating yourself
84
93
85
-
This means when writing your flow operations, you can treat source data as if it were static - focusing purely on defining the transformation logic. CocoIndex takes care of maintaining the dynamic relationship between sources and indexes behind the scenes.
94
+
This means when writing your flow operations, you can treat source data as if it were static - focusing purely on defining the transformation logic. CocoIndex takes care of maintaining the dynamic relationship between sources and target data behind the scenes.
86
95
87
-
### Internal Storage
96
+
### Internal storage
88
97
89
98
As an indexing flow is long-lived, it needs to store intermediate data to keep track of the states.
90
99
CocoIndex uses internal storage for this purpose.
@@ -94,9 +103,9 @@ See [Initialization](initialization) for configuring its location, and `cocoinde
94
103
95
104
## Retrieval
96
105
97
-
There are two ways to retrieve data from indexes built by an indexing flow:
106
+
There are two ways to retrieve data from target storage built by an indexing flow:
98
107
99
-
* Query the underlying index storage directly for maximum flexibility.
100
-
* Use CocoIndex *query handlers* for a more convenient experience with built-in tooling support (e.g. CocoInsight) to understand query performance against the index.
108
+
* Query the underlying target storage directly for maximum flexibility.
109
+
* Use CocoIndex *query handlers* for a more convenient experience with built-in tooling support (e.g. CocoInsight) to understand query performance against the target data.
101
110
102
-
Query handlers are tied to specific indexing flows. They accept query inputs, transform them by defined operations, and retrieve matching data from the index storage that was created by the flow.
111
+
Query handlers are tied to specific indexing flows. They accept query inputs, transform them by defined operations, and retrieve matching data from the target storage that was created by the flow.
0 commit comments