Skip to content

Commit 2bb8f7f

Browse files
authored
Docs: update incremental by partition concepts (#3688)
1 parent f5558b6 commit 2bb8f7f

File tree

1 file changed

+38
-14
lines changed

1 file changed

+38
-14
lines changed

docs/concepts/models/model_kinds.md

+38-14
Original file line numberDiff line numberDiff line change
@@ -121,9 +121,9 @@ WHERE
121121
```
122122

123123
### Idempotency
124-
It is recommended that queries of models of this kind are [idempotent](../glossary.md#idempotency) to prevent unexpected results during data [restatement](../plans.md#restatement-plans).
124+
We recommend making sure incremental by time range model queries are [idempotent](../glossary.md#idempotency) to prevent unexpected results during data [restatement](../plans.md#restatement-plans).
125125

126-
Note, however, that upstream models and tables can impact a model's idempotency. For example, referencing an upstream model of kind [FULL](#full) in the model query automatically causes the model to be non-idempotent.
126+
Note, however, that upstream models and tables can impact a model's idempotency. For example, referencing an upstream model of kind [FULL](#full) in the model query automatically causes the model to be non-idempotent because its data could change on every model execution.
127127

128128
### Materialization strategy
129129
Depending on the target engine, models of the `INCREMENTAL_BY_TIME_RANGE` kind are materialized using the following strategies:
@@ -142,11 +142,25 @@ Depending on the target engine, models of the `INCREMENTAL_BY_TIME_RANGE` kind a
142142

143143
Models of the `INCREMENTAL_BY_PARTITION` kind are computed incrementally based on partition. A set of columns defines the model's partitioning key, and a partition is the group of rows with the same partitioning key value.
144144

145-
This model kind is designed for the scenario where data rows should be loaded and updated as a group based on their shared value for the partitioning key. This kind may be used with any SQL engine; SQLMesh will automatically create partitioned tables on engines that support explicit table partitioning (e.g., [BigQuery](https://cloud.google.com/bigquery/docs/creating-partitioned-tables), [Databricks](https://docs.databricks.com/en/sql/language-manual/sql-ref-partition.html)).
145+
!!! info "Should you use this model kind?"
146146

147-
If a partitioning key in newly loaded data is not present in the model table, the new partitioning key and its data rows are inserted. If a partitioning key in newly loaded data is already present in the model table, **all the partitioning key's existing data rows in the model table are replaced** with the partitioning key's data rows in the newly loaded data. If a partitioning key is present in the model table but not present in the newly loaded data, the partitioning key's existing data rows are not modified and remain in the model table.
147+
Any model kind can use a partitioned table by specifying the [`partitioned_by` key](../models/overview.md#partitioned_by) in the `MODEL` DDL. The "partition" in `INCREMENTAL_BY_PARTITION` is about how the data is **loaded** when the model runs.
148148

149-
This kind is a good fit for datasets that have the following traits:
149+
`INCREMENTAL_BY_PARTITION` models are inherently [non-idempotent](../glossary.md#idempotency), so restatements and other actions can cause data loss. This makes them more complex to manage than other model kinds.
150+
151+
In most scenarios, an `INCREMENTAL_BY_TIME_RANGE` model can meet your needs and will be easier to manage. The `INCREMENTAL_BY_PARTITION` model kind should only be used when the data must be loaded by partition (usually for performance reasons).
152+
153+
This model kind is designed for the scenario where data rows should be loaded and updated as a group based on their shared value for the partitioning key.
154+
155+
It may be used with any SQL engine. SQLMesh will automatically create partitioned tables on engines that support explicit table partitioning (e.g., [BigQuery](https://cloud.google.com/bigquery/docs/creating-partitioned-tables), [Databricks](https://docs.databricks.com/en/sql/language-manual/sql-ref-partition.html)).
156+
157+
New rows are loaded based on their partitioning key value:
158+
159+
- If a partitioning key in newly loaded data is not present in the model table, the new partitioning key and its data rows are inserted.
160+
- If a partitioning key in newly loaded data is already present in the model table, **all the partitioning key's existing data rows in the model table are replaced** with the partitioning key's data rows in the newly loaded data.
161+
- If a partitioning key is present in the model table but not present in the newly loaded data, the partitioning key's existing data rows are not modified and remain in the model table.
162+
163+
This kind should only be used for datasets that have the following traits:
150164

151165
* The dataset's records can be grouped by a partitioning key.
152166
* Each record has a partitioning key associated with it.
@@ -183,12 +197,22 @@ MODEL (
183197
);
184198
```
185199

186-
This is a fuller example of how you would use this model kind in practice to avoid backfilling too many partitions and/or limiting the partitions to backfill based on time ranges.
200+
!!! warning "Only full restatements supported"
201+
202+
Partial data [restatements](../plans.md#restatement-plans) are used to reprocess part of a table's data (usually a limited time range).
203+
204+
Partial data restatement is not supported for `INCREMENTAL_BY_PARTITION` models. If you restate an `INCREMENTAL_BY_PARTITION` model, its entire table will be recreated from scratch.
205+
206+
Restating `INCREMENTAL_BY_PARTITION` models may lead to data loss and should be performed with care.
207+
208+
### Example
209+
210+
This is a fuller example of how you would use this model kind in practice. It limits the number of partitions to backfill based on time range in the `partitions_to_update` CTE.
187211

188212
```sql linenums="1"
189213
MODEL (
190214
name demo.incremental_by_partition_demo,
191-
kind INCREMENTAL_BY_PARTITION,
215+
kind INCREMENTAL_BY_PARTITION,
192216
partitioned_by user_segment,
193217
);
194218

@@ -221,7 +245,7 @@ SELECT
221245
usage_count,
222246
feature_utilization_score,
223247
user_segment,
224-
CASE
248+
CASE
225249
WHEN usage_count > 100 AND feature_utilization_score > 0.7 THEN 'Power User'
226250
WHEN usage_count > 50 THEN 'Regular User'
227251
WHEN usage_count IS NULL THEN 'New User'
@@ -357,18 +381,18 @@ Redshift supports only the `UPDATE` or `DELETE` actions for the `WHEN MATCHED` c
357381

358382
### Merge Filter Expression
359383

360-
The `MERGE` statement typically induces a full table scan of the existing table, which can be problematic with large data volumes.
384+
The `MERGE` statement typically induces a full table scan of the existing table, which can be problematic with large data volumes.
361385

362386
Prevent a full table scan by passing filtering conditions to the `merge_filter` parameter.
363387

364-
The `merge_filter` accepts a single or a conjunction of predicates to be used in the `ON` clause of the `MERGE` operation:
388+
The `merge_filter` accepts a single or a conjunction of predicates to be used in the `ON` clause of the `MERGE` operation:
365389

366390
```sql linenums="1" hl_lines="5"
367391
MODEL (
368392
name db.employee_contracts,
369393
kind INCREMENTAL_BY_UNIQUE_KEY (
370394
unique_key id,
371-
merge_filter source._operation IS NULL AND target.contract_date > dateadd(day, -7, current_date)
395+
merge_filter source._operation IS NULL AND target.contract_date > dateadd(day, -7, current_date)
372396
)
373397
);
374398
```
@@ -935,7 +959,7 @@ GROUP BY
935959

936960
### Reset SCD Type 2 Model (clearing history)
937961

938-
SCD Type 2 models are designed by default to protect the data that has been captured because it is not possible to recreate the history once it has been lost.
962+
SCD Type 2 models are designed by default to protect the data that has been captured because it is not possible to recreate the history once it has been lost.
939963
However, there are cases where you may want to clear the history and start fresh.
940964
For this use use case you will want to start by setting `disable_restatement` to `false` in the model definition.
941965

@@ -949,9 +973,9 @@ MODEL (
949973
);
950974
```
951975

952-
Plan/apply this change to production.
976+
Plan/apply this change to production.
953977
Then you will want to [restate the model](../plans.md#restatement-plans).
954-
978+
955979
```bash
956980
sqlmesh plan --restate-model db.menu_items
957981
```

0 commit comments

Comments
 (0)