Skip to content

Commit efdbc34

Browse files
authored
Update docs for incremental update supports. (#256)
1 parent 523d593 commit efdbc34

File tree

4 files changed

+280
-68
lines changed

4 files changed

+280
-68
lines changed

docs/docs/core/basics.md

Lines changed: 31 additions & 22 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
---
22
title: Basics
3-
description: CocoIndex Basics
3+
description: "CocoIndex basic concepts: indexing flow, data, operations, data updates, etc."
44
---
55

66
# CocoIndex Basics
@@ -9,7 +9,7 @@ An **index** is a collection of data stored in a way that is easy for retrieval.
99

1010
CocoIndex is an ETL framework for building indexes from specified data sources, a.k.a. indexing. It also offers utilities for users to retrieve data from the indexes.
1111

12-
## Indexing Flow
12+
## Indexing flow
1313

1414
An indexing flow extracts data from speicfied data sources, upon specified transformations, and puts the transformed data into specified storage for later retrieval.
1515

@@ -36,7 +36,7 @@ An **operation** in an indexing flow defines a step in the flow. An operation is
3636
* **Action**, which defines the behavior of the operation, e.g. *import*, *transform*, *for each*, *collect* and *export*.
3737
See [Flow Definition](flow_def) for more details for each action.
3838

39-
* Some actions (i.e. "import", "transform" and "export") require an **Operation Spec**, which describes the specific behavior of the operation, e.g. a source to import from, a function describing the transformation behavior, a storage to export to as an index.
39+
* Some actions (i.e. "import", "transform" and "export") require an **Operation Spec**, which describes the specific behavior of the operation, e.g. a source to import from, a function describing the transformation behavior, a target storage to export to (as an index).
4040
* Each operation spec has a **operation type**, e.g. `LocalFile` (data source), `SplitRecursively` (function), `SentenceTransformerEmbed` (function), `Postgres` (storage).
4141
* CocoIndex framework maintains a set of supported operation types. Users can also implement their own.
4242

@@ -60,31 +60,40 @@ This shows schema and example data for the indexing flow:
6060

6161
![Data Example](data_example.svg)
6262

63-
### Life Cycle of an Indexing Flow
63+
### Life cycle of an indexing flow
6464

65-
An indexing flow, once set up, maintains a long-lived relationship between source data and indexes. This means:
65+
An indexing flow, once set up, maintains a long-lived relationship between data source and data in target storage. This means:
66+
67+
1. The target storage created by the flow remain available for querying at any time
68+
69+
2. As source data changes (new data added, existing data updated or deleted), data in the target storage are updated to reflect those changes,
70+
on certain pace, according to the update mode:
71+
72+
* **One time update**: Once triggered, CocoIndex updates the target data to reflect the version of source data up to the current moment.
73+
* **Live update**: CocoIndex continuously watches the source data and updates the target data accordingly.
74+
75+
See more details in the [build / update target data](flow_methods#build--update-target-data) section.
76+
77+
3. CocoIndex intelligently manages these updates by:
78+
* Determining which parts of the target data need to be recomputed
79+
* Reusing existing computations where possible
80+
* Only reprocessing the minimum necessary data
6681

67-
1. The indexes created by the flow remain available for querying at any time
68-
2. When source data changes, the indexes are automatically updated to reflect those changes
69-
3. CocoIndex intelligently manages these updates by:
70-
- Determining which parts of the index need to be recomputed
71-
- Reusing existing computations where possible
72-
- Only reprocessing the minimum necessary data
7382

7483
You can think of an indexing flow similar to formulas in a spreadsheet:
7584

76-
- In a spreadsheet, you define formulas that transform input cells into output cells
77-
- When input values change, the spreadsheet automatically recalculates affected outputs
78-
- You focus on defining the transformation logic, not managing updates
85+
* In a spreadsheet, you define formulas that transform input cells into output cells
86+
* When input values change, the spreadsheet recalculates affected outputs
87+
* You focus on defining the transformation logic, not managing updates
7988

8089
CocoIndex works the same way, but with more powerful capabilities:
8190

82-
- Instead of flat tables, CocoIndex models data in nested data structures, making it more natural to model complex data
83-
- Instead of simple cell-level formulas, you have operations like "for each" to apply the same formula across rows without repeating yourself
91+
* Instead of flat tables, CocoIndex models data in nested data structures, making it more natural to model complex data
92+
* Instead of simple cell-level formulas, you have operations like "for each" to apply the same formula across rows without repeating yourself
8493

85-
This means when writing your flow operations, you can treat source data as if it were static - focusing purely on defining the transformation logic. CocoIndex takes care of maintaining the dynamic relationship between sources and indexes behind the scenes.
94+
This means when writing your flow operations, you can treat source data as if it were static - focusing purely on defining the transformation logic. CocoIndex takes care of maintaining the dynamic relationship between sources and target data behind the scenes.
8695

87-
### Internal Storage
96+
### Internal storage
8897

8998
As an indexing flow is long-lived, it needs to store intermediate data to keep track of the states.
9099
CocoIndex uses internal storage for this purpose.
@@ -94,9 +103,9 @@ See [Initialization](initialization) for configuring its location, and `cocoinde
94103

95104
## Retrieval
96105

97-
There are two ways to retrieve data from indexes built by an indexing flow:
106+
There are two ways to retrieve data from target storage built by an indexing flow:
98107

99-
* Query the underlying index storage directly for maximum flexibility.
100-
* Use CocoIndex *query handlers* for a more convenient experience with built-in tooling support (e.g. CocoInsight) to understand query performance against the index.
108+
* Query the underlying target storage directly for maximum flexibility.
109+
* Use CocoIndex *query handlers* for a more convenient experience with built-in tooling support (e.g. CocoInsight) to understand query performance against the target data.
101110

102-
Query handlers are tied to specific indexing flows. They accept query inputs, transform them by defined operations, and retrieve matching data from the index storage that was created by the flow.
111+
Query handlers are tied to specific indexing flows. They accept query inputs, transform them by defined operations, and retrieve matching data from the target storage that was created by the flow.

docs/docs/core/flow_def.mdx

Lines changed: 75 additions & 35 deletions
Original file line numberDiff line numberDiff line change
@@ -1,14 +1,15 @@
11
---
22
title: Flow Definition
3-
description: CocoIndex Flow Definition
3+
description: Define a CocoIndex flow, by specifying source, transformations and storages, and connect input/output data of them.
44
---
55

66
import Tabs from '@theme/Tabs';
77
import TabItem from '@theme/TabItem';
88

99
# CocoIndex Flow Definition
1010

11-
In CocoIndex, to define an indexing flow, you provide a function to construct the flow, by adding operations and connecting them with fields.
11+
In CocoIndex, to define an indexing flow, you provide a function to import source, transform data and put them into target storage (sinks).
12+
You connect input/output of these operations with fields of data scopes.
1213

1314
## Entry Point
1415

@@ -43,7 +44,7 @@ demo_flow = cocoindex.flow.add_flow_def("DemoFlow", demo_flow_def)
4344
```
4445

4546
In both cases, `demo_flow` will be an object with `cocoindex.Flow` class type.
46-
See [Flow Methods](/docs/core/flow_methods) for more details on it.
47+
See [Flow Running](/docs/core/flow_methods) for more details on it.
4748

4849
</TabItem>
4950
</Tabs>
@@ -52,7 +53,7 @@ See [Flow Methods](/docs/core/flow_methods) for more details on it.
5253

5354
The `FlowBuilder` object is the starting point to construct a flow.
5455

55-
### Import From Source
56+
### Import from source
5657

5758
`FlowBuilder` provides a `add_source()` method to import data from external sources.
5859
A *source spec* needs to be provided for any import operation, to describe the source and parameters related to the source.
@@ -64,7 +65,7 @@ Import must happen at the top level, and the field created by import must be in
6465
```python
6566
@cocoindex.flow_def(name="DemoFlow")
6667
def demo_flow(flow_builder: cocoindex.FlowBuilder, data_scope: cocoindex.DataScope):
67-
data_scope["documents"] = flow_builder.add_source(DemoSourceSpec(...))
68+
data_scope["documents"] = flow_builder.add_source(DemoSourceSpec(...))
6869
......
6970
```
7071

@@ -74,17 +75,56 @@ def demo_flow(flow_builder: cocoindex.FlowBuilder, data_scope: cocoindex.DataSco
7475
`add_source()` returns a `DataSlice`. Once external data sources are imported, you can further transform them using methods exposed by these data objects, as discussed in the following sections.
7576

7677
We'll describe different data objects in next few sections.
77-
Note that the actual value of data is not available at the time when we define the flow: it's only available at runtime.
78+
79+
:::note
80+
81+
The actual value of data is not available at the time when we define the flow: it's only available at runtime.
7882
In a flow definition, you can use a data representation as input for operations, but not access the actual value.
7983

84+
:::
85+
86+
#### Refresh interval
87+
88+
You can provide a `refresh_interval` argument.
89+
When present, in the [live update mode](/docs/core/flow_methods#live-update), the data source will be refreshed by specified interval.
90+
91+
<Tabs>
92+
<TabItem value="python" label="Python" default>
93+
94+
The `refresh_interval` argument is of type `datetime.timedelta`. For example, this refreshes the data source every 1 minute:
95+
96+
```python
97+
@cocoindex.flow_def(name="DemoFlow")
98+
def demo_flow(flow_builder: cocoindex.FlowBuilder, data_scope: cocoindex.DataScope):
99+
data_scope["documents"] = flow_builder.add_source(
100+
DemoSourceSpec(...), refresh_interval=datetime.timedelta(minutes=1))
101+
......
102+
```
103+
104+
</TabItem>
105+
</Tabs>
106+
107+
:::info
108+
109+
In live update mode, for each refresh, CocoIndex will traverse the data source to figure out the changes,
110+
and only perform transformations on changed source keys.
111+
112+
:::
113+
80114
## Data Scope
81115

82116
A **data scope** represents data for a certain unit, e.g. the top level scope (involving all data for a flow), for a document, or for a chunk.
83117
A data scope has a bunch of fields and collectors, and users can add new fields and collectors to it.
84118

85119
### Get or Add a Field
86120

87-
Get or add a field of a data scope (which is a data slice). Note that you cannot override an existing field.
121+
You can get or add a field of a data scope (which is a data slice).
122+
123+
:::note
124+
125+
You cannot override an existing field.
126+
127+
:::
88128

89129
<Tabs>
90130
<TabItem value="python" label="Python" default>
@@ -95,20 +135,20 @@ Getting and setting a field of a data scope is done by the `[]` operator with a
95135
@cocoindex.flow_def(name="DemoFlow")
96136
def demo_flow(flow_builder: cocoindex.FlowBuilder, data_scope: cocoindex.DataScope):
97137
98-
# Add "documents" to the top-level data scope.
99-
data_scope["documents"] = flow_builder.add_source(DemoSourceSpec(...))
138+
# Add "documents" to the top-level data scope.
139+
data_scope["documents"] = flow_builder.add_source(DemoSourceSpec(...))
100140
101-
# Each row of "documents" is a child scope.
102-
with data_scope["documents"].row() as document:
141+
# Each row of "documents" is a child scope.
142+
with data_scope["documents"].row() as document:
103143
104-
# Get "content" from the document scope, transform, and add "summary" to scope.
105-
document["summary"] = field1_row["content"].transform(DemoFunctionSpec(...))
144+
# Get "content" from the document scope, transform, and add "summary" to scope.
145+
document["summary"] = field1_row["content"].transform(DemoFunctionSpec(...))
106146
```
107147

108148
</TabItem>
109149
</Tabs>
110150

111-
### Add a Collector
151+
### Add a collector
112152

113153
See [Data Collector](#data-collector) below for more details.
114154

@@ -132,17 +172,17 @@ Other arguments can be passed in as positional arguments or keyword arguments, a
132172
```python
133173
@cocoindex.flow_def(name="DemoFlow")
134174
def demo_flow(flow_builder: cocoindex.FlowBuilder, data_scope: cocoindex.DataScope):
135-
...
136-
data_scope["field2"] = data_scope["field1"].transform(
137-
DemoFunctionSpec(...),
138-
arg1, arg2, ..., key0=kwarg0, key1=kwarg1, ...)
139-
...
175+
...
176+
data_scope["field2"] = data_scope["field1"].transform(
177+
DemoFunctionSpec(...),
178+
arg1, arg2, ..., key0=kwarg0, key1=kwarg1, ...)
179+
...
140180
```
141181

142182
</TabItem>
143183
</Tabs>
144184

145-
### For Each Row
185+
### For each row
146186

147187
If the data slice has `Table` type, you can call `row()` method to obtain a child scope representing each row, to apply operations on each row.
148188

@@ -161,7 +201,7 @@ def demo_flow(flow_builder: cocoindex.FlowBuilder, data_scope: cocoindex.DataSco
161201
</TabItem>
162202
</Tabs>
163203

164-
### Get a Sub Field
204+
### Get a sub field
165205

166206
If the data slice has `Struct` type, you can obtain a data slice on a specific sub field of it, similar to getting a field of a data scope.
167207

@@ -192,14 +232,14 @@ For example,
192232
```python
193233
@cocoindex.flow_def(name="DemoFlow")
194234
def demo_flow(flow_builder: cocoindex.FlowBuilder, data_scope: cocoindex.DataScope):
195-
...
196-
demo_collector = data_scope.add_collector()
197-
with data_scope["documents"].row() as document:
198235
...
199-
demo_collector.collect(id=cocoindex.GeneratedField.UUID,
200-
filename=document["filename"],
201-
summary=document["summary"])
202-
...
236+
demo_collector = data_scope.add_collector()
237+
with data_scope["documents"].row() as document:
238+
...
239+
demo_collector.collect(id=cocoindex.GeneratedField.UUID,
240+
filename=document["filename"],
241+
summary=document["summary"])
242+
...
203243
```
204244

205245
</TabItem>
@@ -228,13 +268,13 @@ Export must happen at the top level of a flow, i.e. not within any child scopes
228268
```python
229269
@cocoindex.flow_def(name="DemoFlow")
230270
def demo_flow(flow_builder: cocoindex.FlowBuilder, data_scope: cocoindex.DataScope):
231-
...
232-
demo_collector = data_scope.add_collector()
233-
...
234-
demo_collector.export(
235-
"demo_storage", DemoStorageSpec(...),
236-
primary_key_fields=["field1"],
237-
vector_index=[("field2", cocoindex.VectorSimilarityMetric.COSINE_SIMILARITY)])
271+
...
272+
demo_collector = data_scope.add_collector()
273+
...
274+
demo_collector.export(
275+
"demo_storage", DemoStorageSpec(...),
276+
primary_key_fields=["field1"],
277+
vector_index=[("field2", cocoindex.VectorSimilarityMetric.COSINE_SIMILARITY)])
238278
```
239279

240280
</TabItem>

0 commit comments

Comments
 (0)