You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardexpand all lines: docs/concepts/models/model_kinds.md
+38-14
Original file line number
Diff line number
Diff line change
@@ -121,9 +121,9 @@ WHERE
121
121
```
122
122
123
123
### Idempotency
124
-
It is recommended that queries of models of this kind are [idempotent](../glossary.md#idempotency) to prevent unexpected results during data [restatement](../plans.md#restatement-plans).
124
+
We recommend making sure incremental by time range model queries are [idempotent](../glossary.md#idempotency) to prevent unexpected results during data [restatement](../plans.md#restatement-plans).
125
125
126
-
Note, however, that upstream models and tables can impact a model's idempotency. For example, referencing an upstream model of kind [FULL](#full) in the model query automatically causes the model to be non-idempotent.
126
+
Note, however, that upstream models and tables can impact a model's idempotency. For example, referencing an upstream model of kind [FULL](#full) in the model query automatically causes the model to be non-idempotent because its data could change on every model execution.
127
127
128
128
### Materialization strategy
129
129
Depending on the target engine, models of the `INCREMENTAL_BY_TIME_RANGE` kind are materialized using the following strategies:
@@ -142,11 +142,25 @@ Depending on the target engine, models of the `INCREMENTAL_BY_TIME_RANGE` kind a
142
142
143
143
Models of the `INCREMENTAL_BY_PARTITION` kind are computed incrementally based on partition. A set of columns defines the model's partitioning key, and a partition is the group of rows with the same partitioning key value.
144
144
145
-
This model kind is designed for the scenario where data rows should be loaded and updated as a group based on their shared value for the partitioning key. This kind may be used with any SQL engine; SQLMesh will automatically create partitioned tables on engines that support explicit table partitioning (e.g., [BigQuery](https://cloud.google.com/bigquery/docs/creating-partitioned-tables), [Databricks](https://docs.databricks.com/en/sql/language-manual/sql-ref-partition.html)).
145
+
!!! info "Should you use this model kind?"
146
146
147
-
If a partitioning key in newly loaded data is not present in the model table, the new partitioning key and its data rows are inserted. If a partitioning key in newly loaded data is already present in the model table, **all the partitioning key's existing data rows in the model table are replaced** with the partitioning key's data rows in the newly loaded data. If a partitioning key is present in the model table but not present in the newly loaded data, the partitioning key's existing data rows are not modified and remain in the model table.
147
+
Any model kind can use a partitioned table by specifying the [`partitioned_by` key](../models/overview.md#partitioned_by) in the `MODEL` DDL. The "partition" in `INCREMENTAL_BY_PARTITION` is about how the data is **loaded** when the model runs.
148
148
149
-
This kind is a good fit for datasets that have the following traits:
149
+
`INCREMENTAL_BY_PARTITION` models are inherently [non-idempotent](../glossary.md#idempotency), so restatements and other actions can cause data loss. This makes them more complex to manage than other model kinds.
150
+
151
+
In most scenarios, an `INCREMENTAL_BY_TIME_RANGE` model can meet your needs and will be easier to manage. The `INCREMENTAL_BY_PARTITION` model kind should only be used when the data must be loaded by partition (usually for performance reasons).
152
+
153
+
This model kind is designed for the scenario where data rows should be loaded and updated as a group based on their shared value for the partitioning key.
154
+
155
+
It may be used with any SQL engine. SQLMesh will automatically create partitioned tables on engines that support explicit table partitioning (e.g., [BigQuery](https://cloud.google.com/bigquery/docs/creating-partitioned-tables), [Databricks](https://docs.databricks.com/en/sql/language-manual/sql-ref-partition.html)).
156
+
157
+
New rows are loaded based on their partitioning key value:
158
+
159
+
- If a partitioning key in newly loaded data is not present in the model table, the new partitioning key and its data rows are inserted.
160
+
- If a partitioning key in newly loaded data is already present in the model table, **all the partitioning key's existing data rows in the model table are replaced** with the partitioning key's data rows in the newly loaded data.
161
+
- If a partitioning key is present in the model table but not present in the newly loaded data, the partitioning key's existing data rows are not modified and remain in the model table.
162
+
163
+
This kind should only be used for datasets that have the following traits:
150
164
151
165
* The dataset's records can be grouped by a partitioning key.
152
166
* Each record has a partitioning key associated with it.
@@ -183,12 +197,22 @@ MODEL (
183
197
);
184
198
```
185
199
186
-
This is a fuller example of how you would use this model kind in practice to avoid backfilling too many partitions and/or limiting the partitions to backfill based on time ranges.
200
+
!!! warning "Only full restatements supported"
201
+
202
+
Partial data [restatements](../plans.md#restatement-plans) are used to reprocess part of a table's data (usually a limited time range).
203
+
204
+
Partial data restatement is not supported for `INCREMENTAL_BY_PARTITION` models. If you restate an `INCREMENTAL_BY_PARTITION` model, its entire table will be recreated from scratch.
205
+
206
+
Restating `INCREMENTAL_BY_PARTITION` models may lead to data loss and should be performed with care.
207
+
208
+
### Example
209
+
210
+
This is a fuller example of how you would use this model kind in practice. It limits the number of partitions to backfill based on time range in the `partitions_to_update` CTE.
187
211
188
212
```sql linenums="1"
189
213
MODEL (
190
214
name demo.incremental_by_partition_demo,
191
-
kind INCREMENTAL_BY_PARTITION,
215
+
kind INCREMENTAL_BY_PARTITION,
192
216
partitioned_by user_segment,
193
217
);
194
218
@@ -221,7 +245,7 @@ SELECT
221
245
usage_count,
222
246
feature_utilization_score,
223
247
user_segment,
224
-
CASE
248
+
CASE
225
249
WHEN usage_count >100AND feature_utilization_score >0.7 THEN 'Power User'
226
250
WHEN usage_count >50 THEN 'Regular User'
227
251
WHEN usage_count IS NULL THEN 'New User'
@@ -357,18 +381,18 @@ Redshift supports only the `UPDATE` or `DELETE` actions for the `WHEN MATCHED` c
357
381
358
382
### Merge Filter Expression
359
383
360
-
The `MERGE` statement typically induces a full table scan of the existing table, which can be problematic with large data volumes.
384
+
The `MERGE` statement typically induces a full table scan of the existing table, which can be problematic with large data volumes.
361
385
362
386
Prevent a full table scan by passing filtering conditions to the `merge_filter` parameter.
363
387
364
-
The `merge_filter` accepts a single or a conjunction of predicates to be used in the `ON` clause of the `MERGE` operation:
388
+
The `merge_filter` accepts a single or a conjunction of predicates to be used in the `ON` clause of the `MERGE` operation:
365
389
366
390
```sql linenums="1" hl_lines="5"
367
391
MODEL (
368
392
name db.employee_contracts,
369
393
kind INCREMENTAL_BY_UNIQUE_KEY (
370
394
unique_key id,
371
-
merge_filter source._operation IS NULLANDtarget.contract_date> dateadd(day, -7, current_date)
395
+
merge_filter source._operation IS NULLANDtarget.contract_date> dateadd(day, -7, current_date)
372
396
)
373
397
);
374
398
```
@@ -935,7 +959,7 @@ GROUP BY
935
959
936
960
### Reset SCD Type 2 Model (clearing history)
937
961
938
-
SCD Type 2 models are designed by default to protect the data that has been captured because it is not possible to recreate the history once it has been lost.
962
+
SCD Type 2 models are designed by default to protect the data that has been captured because it is not possible to recreate the history once it has been lost.
939
963
However, there are cases where you may want to clear the history and start fresh.
940
964
For this use use case you will want to start by setting `disable_restatement` to `false` in the model definition.
941
965
@@ -949,9 +973,9 @@ MODEL (
949
973
);
950
974
```
951
975
952
-
Plan/apply this change to production.
976
+
Plan/apply this change to production.
953
977
Then you will want to [restate the model](../plans.md#restatement-plans).
0 commit comments