You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardexpand all lines: docs/concepts/models/model_kinds.md
+133-131
Original file line number
Diff line number
Diff line change
@@ -138,137 +138,6 @@ Depending on the target engine, models of the `INCREMENTAL_BY_TIME_RANGE` kind a
138
138
| Postgres | DELETE by time range, then INSERT |
139
139
| DuckDB | DELETE by time range, then INSERT |
140
140
141
-
## INCREMENTAL_BY_PARTITION
142
-
143
-
Models of the `INCREMENTAL_BY_PARTITION` kind are computed incrementally based on partition. A set of columns defines the model's partitioning key, and a partition is the group of rows with the same partitioning key value.
144
-
145
-
!!! info "Should you use this model kind?"
146
-
147
-
Any model kind can use a partitioned table by specifying the [`partitioned_by` key](../models/overview.md#partitioned_by) in the `MODEL` DDL. The "partition" in `INCREMENTAL_BY_PARTITION` is about how the data is **loaded** when the model runs.
148
-
149
-
`INCREMENTAL_BY_PARTITION` models are inherently [non-idempotent](../glossary.md#idempotency), so restatements and other actions can cause data loss. This makes them more complex to manage than other model kinds.
150
-
151
-
In most scenarios, an `INCREMENTAL_BY_TIME_RANGE` model can meet your needs and will be easier to manage. The `INCREMENTAL_BY_PARTITION` model kind should only be used when the data must be loaded by partition (usually for performance reasons).
152
-
153
-
This model kind is designed for the scenario where data rows should be loaded and updated as a group based on their shared value for the partitioning key.
154
-
155
-
It may be used with any SQL engine. SQLMesh will automatically create partitioned tables on engines that support explicit table partitioning (e.g., [BigQuery](https://cloud.google.com/bigquery/docs/creating-partitioned-tables), [Databricks](https://docs.databricks.com/en/sql/language-manual/sql-ref-partition.html)).
156
-
157
-
New rows are loaded based on their partitioning key value:
158
-
159
-
- If a partitioning key in newly loaded data is not present in the model table, the new partitioning key and its data rows are inserted.
160
-
- If a partitioning key in newly loaded data is already present in the model table, **all the partitioning key's existing data rows in the model table are replaced** with the partitioning key's data rows in the newly loaded data.
161
-
- If a partitioning key is present in the model table but not present in the newly loaded data, the partitioning key's existing data rows are not modified and remain in the model table.
162
-
163
-
This kind should only be used for datasets that have the following traits:
164
-
165
-
* The dataset's records can be grouped by a partitioning key.
166
-
* Each record has a partitioning key associated with it.
167
-
* It is appropriate to upsert records, so existing records can be overwritten by new arrivals when their partitioning keys match.
168
-
* All existing records associated with a given partitioning key can be removed or overwritten when any new record has the partitioning key value.
169
-
170
-
The column defining the partitioning key is specified in the model's `MODEL` DDL `partitioned_by` key. This example shows the `MODEL` DDL for an `INCREMENTAL_BY_PARTITION` model whose partition key is the row's value for the `region` column:
171
-
172
-
```sql linenums="1" hl_lines="4"
173
-
MODEL (
174
-
name db.events,
175
-
kind INCREMENTAL_BY_PARTITION,
176
-
partitioned_by region,
177
-
);
178
-
```
179
-
180
-
Compound partition keys are also supported, such as `region` and `department`:
181
-
182
-
```sql linenums="1" hl_lines="4"
183
-
MODEL (
184
-
name db.events,
185
-
kind INCREMENTAL_BY_PARTITION,
186
-
partitioned_by (region, department),
187
-
);
188
-
```
189
-
190
-
Date and/or timestamp column expressions are also supported (varies by SQL engine). This BigQuery example's partition key is based on the month each row's `event_date` occurred:
191
-
192
-
```sql linenums="1" hl_lines="4"
193
-
MODEL (
194
-
name db.events,
195
-
kind INCREMENTAL_BY_PARTITION,
196
-
partitioned_by DATETIME_TRUNC(event_date, MONTH)
197
-
);
198
-
```
199
-
200
-
!!! warning "Only full restatements supported"
201
-
202
-
Partial data [restatements](../plans.md#restatement-plans) are used to reprocess part of a table's data (usually a limited time range).
203
-
204
-
Partial data restatement is not supported for `INCREMENTAL_BY_PARTITION` models. If you restate an `INCREMENTAL_BY_PARTITION` model, its entire table will be recreated from scratch.
205
-
206
-
Restating `INCREMENTAL_BY_PARTITION` models may lead to data loss and should be performed with care.
207
-
208
-
### Example
209
-
210
-
This is a fuller example of how you would use this model kind in practice. It limits the number of partitions to backfill based on time range in the `partitions_to_update` CTE.
211
-
212
-
```sql linenums="1"
213
-
MODEL (
214
-
name demo.incremental_by_partition_demo,
215
-
kind INCREMENTAL_BY_PARTITION,
216
-
partitioned_by user_segment,
217
-
);
218
-
219
-
-- This is the source of truth for what partitions need to be updated and will join to the product usage data
220
-
-- This could be an INCREMENTAL_BY_TIME_RANGE model that reads in the user_segment values last updated in the past 30 days to reduce scope
221
-
-- Use this strategy to reduce full restatements
222
-
WITH partitions_to_update AS (
223
-
SELECT DISTINCT
224
-
user_segment
225
-
FROMdemo.incremental_by_time_range_demo-- upstream table tracking which user segments to update
226
-
WHERE last_updated_at BETWEEN DATE_SUB(@start_dt, INTERVAL 30 DAY) AND @end_dt
227
-
),
228
-
229
-
product_usage AS (
230
-
SELECT
231
-
product_id,
232
-
customer_id,
233
-
last_usage_date,
234
-
usage_count,
235
-
feature_utilization_score,
236
-
user_segment
237
-
FROM sqlmesh-public-demo.tcloud_raw_data.product_usage
238
-
WHERE user_segment IN (SELECT user_segment FROM partitions_to_update) -- partition filter applied here
239
-
)
240
-
241
-
SELECT
242
-
product_id,
243
-
customer_id,
244
-
last_usage_date,
245
-
usage_count,
246
-
feature_utilization_score,
247
-
user_segment,
248
-
CASE
249
-
WHEN usage_count >100AND feature_utilization_score >0.7 THEN 'Power User'
250
-
WHEN usage_count >50 THEN 'Regular User'
251
-
WHEN usage_count IS NULL THEN 'New User'
252
-
ELSE 'Light User'
253
-
END as user_type
254
-
FROM product_usage
255
-
```
256
-
257
-
**Note**: Partial data [restatement](../plans.md#restatement-plans) is not supported for this model kind, which means that the entire table will be recreated from scratch if restated. This may lead to data loss.
258
-
259
-
### Materialization strategy
260
-
Depending on the target engine, models of the `INCREMENTAL_BY_PARTITION` kind are materialized using the following strategies:
| Databricks | REPLACE WHERE by partitioning key |
265
-
| Spark | INSERT OVERWRITE by partitioning key |
266
-
| Snowflake | DELETE by partitioning key, then INSERT |
267
-
| BigQuery | DELETE by partitioning key, then INSERT |
268
-
| Redshift | DELETE by partitioning key, then INSERT |
269
-
| Postgres | DELETE by partitioning key, then INSERT |
270
-
| DuckDB | DELETE by partitioning key, then INSERT |
271
-
272
141
## INCREMENTAL_BY_UNIQUE_KEY
273
142
274
143
Models of the `INCREMENTAL_BY_UNIQUE_KEY` kind are computed incrementally based on a key that is unique for each data row.
@@ -1018,3 +887,136 @@ Due to there being no standard, each vendor has a different implementation with
1018
887
We would recommend using standard SQLMesh model types in the first instance. However, if you do need to use Managed models, you still gain other SQLMesh benefits like the ability to use them in [virtual environments](../../concepts/overview#build-a-virtual-environment).
1019
888
1020
889
See [Managed Models](./managed_models.md) for more information on which engines are supported and which properties are available.
890
+
891
+
## INCREMENTAL_BY_PARTITION
892
+
893
+
Models of the `INCREMENTAL_BY_PARTITION` kind are computed incrementally based on partition. A set of columns defines the model's partitioning key, and a partition is the group of rows with the same partitioning key value.
894
+
895
+
!!! question "Should you use this model kind?"
896
+
897
+
Any model kind can use a partitioned **table** by specifying the [`partitioned_by` key](../models/overview.md#partitioned_by) in the `MODEL` DDL.
898
+
899
+
The "partition" in `INCREMENTAL_BY_PARTITION` is about how the data is **loaded** when the model runs.
900
+
901
+
`INCREMENTAL_BY_PARTITION` models are inherently [non-idempotent](../glossary.md#idempotency), so restatements and other actions can cause data loss. This makes them more complex to manage than other model kinds.
902
+
903
+
In most scenarios, an `INCREMENTAL_BY_TIME_RANGE` model can meet your needs and will be easier to manage. The `INCREMENTAL_BY_PARTITION` model kind should only be used when the data must be loaded by partition (usually for performance reasons).
904
+
905
+
This model kind is designed for the scenario where data rows should be loaded and updated as a group based on their shared value for the partitioning key.
906
+
907
+
It may be used with any SQL engine. SQLMesh will automatically create partitioned tables on engines that support explicit table partitioning (e.g., [BigQuery](https://cloud.google.com/bigquery/docs/creating-partitioned-tables), [Databricks](https://docs.databricks.com/en/sql/language-manual/sql-ref-partition.html)).
908
+
909
+
New rows are loaded based on their partitioning key value:
910
+
911
+
- If a partitioning key in newly loaded data is not present in the model table, the new partitioning key and its data rows are inserted.
912
+
- If a partitioning key in newly loaded data is already present in the model table, **all the partitioning key's existing data rows in the model table are replaced** with the partitioning key's data rows in the newly loaded data.
913
+
- If a partitioning key is present in the model table but not present in the newly loaded data, the partitioning key's existing data rows are not modified and remain in the model table.
914
+
915
+
This kind should only be used for datasets that have the following traits:
916
+
917
+
* The dataset's records can be grouped by a partitioning key.
918
+
* Each record has a partitioning key associated with it.
919
+
* It is appropriate to upsert records, so existing records can be overwritten by new arrivals when their partitioning keys match.
920
+
* All existing records associated with a given partitioning key can be removed or overwritten when any new record has the partitioning key value.
921
+
922
+
The column defining the partitioning key is specified in the model's `MODEL` DDL `partitioned_by` key. This example shows the `MODEL` DDL for an `INCREMENTAL_BY_PARTITION` model whose partition key is the row's value for the `region` column:
923
+
924
+
```sql linenums="1" hl_lines="4"
925
+
MODEL (
926
+
name db.events,
927
+
kind INCREMENTAL_BY_PARTITION,
928
+
partitioned_by region,
929
+
);
930
+
```
931
+
932
+
Compound partition keys are also supported, such as `region` and `department`:
933
+
934
+
```sql linenums="1" hl_lines="4"
935
+
MODEL (
936
+
name db.events,
937
+
kind INCREMENTAL_BY_PARTITION,
938
+
partitioned_by (region, department),
939
+
);
940
+
```
941
+
942
+
Date and/or timestamp column expressions are also supported (varies by SQL engine). This BigQuery example's partition key is based on the month each row's `event_date` occurred:
943
+
944
+
```sql linenums="1" hl_lines="4"
945
+
MODEL (
946
+
name db.events,
947
+
kind INCREMENTAL_BY_PARTITION,
948
+
partitioned_by DATETIME_TRUNC(event_date, MONTH)
949
+
);
950
+
```
951
+
952
+
!!! warning "Only full restatements supported"
953
+
954
+
Partial data [restatements](../plans.md#restatement-plans) are used to reprocess part of a table's data (usually a limited time range).
955
+
956
+
Partial data restatement is not supported for `INCREMENTAL_BY_PARTITION` models. If you restate an `INCREMENTAL_BY_PARTITION` model, its entire table will be recreated from scratch.
957
+
958
+
Restating `INCREMENTAL_BY_PARTITION` models may lead to data loss and should be performed with care.
959
+
960
+
### Example
961
+
962
+
This is a fuller example of how you would use this model kind in practice. It limits the number of partitions to backfill based on time range in the `partitions_to_update` CTE.
963
+
964
+
```sql linenums="1"
965
+
MODEL (
966
+
name demo.incremental_by_partition_demo,
967
+
kind INCREMENTAL_BY_PARTITION,
968
+
partitioned_by user_segment,
969
+
);
970
+
971
+
-- This is the source of truth for what partitions need to be updated and will join to the product usage data
972
+
-- This could be an INCREMENTAL_BY_TIME_RANGE model that reads in the user_segment values last updated in the past 30 days to reduce scope
973
+
-- Use this strategy to reduce full restatements
974
+
WITH partitions_to_update AS (
975
+
SELECT DISTINCT
976
+
user_segment
977
+
FROMdemo.incremental_by_time_range_demo-- upstream table tracking which user segments to update
978
+
WHERE last_updated_at BETWEEN DATE_SUB(@start_dt, INTERVAL 30 DAY) AND @end_dt
979
+
),
980
+
981
+
product_usage AS (
982
+
SELECT
983
+
product_id,
984
+
customer_id,
985
+
last_usage_date,
986
+
usage_count,
987
+
feature_utilization_score,
988
+
user_segment
989
+
FROM sqlmesh-public-demo.tcloud_raw_data.product_usage
990
+
WHERE user_segment IN (SELECT user_segment FROM partitions_to_update) -- partition filter applied here
991
+
)
992
+
993
+
SELECT
994
+
product_id,
995
+
customer_id,
996
+
last_usage_date,
997
+
usage_count,
998
+
feature_utilization_score,
999
+
user_segment,
1000
+
CASE
1001
+
WHEN usage_count >100AND feature_utilization_score >0.7 THEN 'Power User'
1002
+
WHEN usage_count >50 THEN 'Regular User'
1003
+
WHEN usage_count IS NULL THEN 'New User'
1004
+
ELSE 'Light User'
1005
+
END as user_type
1006
+
FROM product_usage
1007
+
```
1008
+
1009
+
**Note**: Partial data [restatement](../plans.md#restatement-plans) is not supported for this model kind, which means that the entire table will be recreated from scratch if restated. This may lead to data loss.
1010
+
1011
+
### Materialization strategy
1012
+
Depending on the target engine, models of the `INCREMENTAL_BY_PARTITION` kind are materialized using the following strategies:
0 commit comments