You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardexpand all lines: docs/comparisons.md
+11-8
Original file line number
Diff line number
Diff line change
@@ -33,6 +33,7 @@ SQLMesh aims to be dbt format-compatible. Importing existing dbt projects with m
33
33
| `Semantic validation` | ❌ | ✅
34
34
| `Transpilation` | ❌ | ✅
35
35
| `Unit tests` | ❌ | ✅
36
+
| `Data audits` | ✅ | ✅
36
37
| `Column level lineage` | ❌ | ✅
37
38
| `Accessible incremental models` | ❌ | ✅
38
39
| `Downstream impact planner` | ❌ | ✅
@@ -52,7 +53,7 @@ SQLMesh aims to be dbt format-compatible. Importing existing dbt projects with m
52
53
### Environments
53
54
Development and staging environments in dbt are costly to make and not fully representative of what will go into production.
54
55
55
-
The standard approach to creating a new environment in dbt is to rerun your entire warehouse in a new environment. This may work at small scales, but even then it wastes time and money.
56
+
The standard approach to creating a new environment in dbt is to rerun your entire warehouse in a new environment. This may work at small scales, but even then it wastes time and money. Here's why:
56
57
57
58
The first part of running a data transformation system is repeatedly iterating through three steps: create or modify model code, execute the models, evaluate the outputs. Practitioners may repeat these steps many times in a day's work.
58
59
@@ -91,7 +92,7 @@ Manually specifying macros to find date boundaries is repetitive and error-prone
91
92
92
93
The example above shows how incremental models behave differently in dbt depending on whether they have been run before. As models become more complex, the cognitive burden of having two run times, "first time full refresh" vs. "subsequent incremental", increases.
93
94
94
-
SQLMesh keeps track of which date ranges exist so that the query can be simplified as follows:
95
+
SQLMesh keeps track of which date ranges exist, producing a simplified and efficient query as follows:
95
96
96
97
```sql
97
98
-- sqlmesh incremental
@@ -104,11 +105,11 @@ WHERE d.ds BETWEEN @start_ds AND @end_ds
104
105
```
105
106
106
107
#### Data leakage
107
-
dbt does not check whether the data inserted into an incremental table should be there or not. This can lead to problems or consistency issues, such as late-arriving data overriding past partitions. These problems are called "data leakage."
108
+
dbt does not check whether the data inserted into an incremental table should be there or not. This can lead to problems and consistency issues, such as late-arriving data overriding past partitions. These problems are called "data leakage."
108
109
109
-
SQLMesh wraps all queries in a subquery with a time filter under the hood to enforce that the data inserted for a particular batch is as expected.
110
+
SQLMesh wraps all queries in a subquery with a time filter under the hood to enforce that the data inserted for a particular batch is as expected and reproducible everytime.
110
111
111
-
In addition, dbt only supports the 'insert/overwrite' incremental load pattern for systems that natively support it. SQLMesh enables 'insert/overwrite' on any system, because it is the most robust approach to incremental loading. 'Append' pipelines risk data inaccuracy in the variety of scenarios where your pipelines may run more than once for a given date.
112
+
In addition, dbt only supports the 'insert/overwrite' incremental load pattern for systems that natively support it. SQLMesh enables 'insert/overwrite' on any system, because it is the most robust approach to incremental loading, while 'Append' pipelines risk data inaccuracy in the variety of scenarios where your pipelines may run more than once for a given date.
112
113
113
114
This example shows the time filtering subquery SQLMesh applies to all queries as a guard against data leakage:
114
115
```sql
@@ -119,7 +120,7 @@ JOIN raw.event_dims d
119
120
ONe.id=d.idANDd.ds BETWEEN @start_ds AND @end_ds
120
121
WHEREd.ds BETWEEN @start_ds AND @end_ds
121
122
122
-
-- with data leakage guard
123
+
-- with automated data leakage guard
123
124
SELECT*
124
125
FROM (
125
126
SELECT*
@@ -141,6 +142,8 @@ Missing past data: ?, 2022-01-02, 2022-01-03
141
142
Data gap: 2022-01-01, ?, 2022-01-03
142
143
```
143
144
145
+
SQLMesh will automatically fill these data gaps on the next run.
146
+
144
147
#### Performance
145
148
Subqueries that look for MAX(date) could have a performance impact on the primary query. SQLMesh is able to avoid these extra subqueries.
146
149
@@ -151,9 +154,9 @@ SQLMesh is able to [batch](../concepts/models/overview#batch_size) up backfills
151
154
### SQL understanding
152
155
dbt heavily relies on [Jinja](https://jinja.palletsprojects.com/en/3.1.x/). It has no understanding of SQL and treats all queries as raw strings without context. This means that simple syntax errors like trailing commas are difficult to debug and require a full run to detect.
153
156
154
-
SQLMesh supports Jinja, but it does not rely on it - instead, it parses/understands SQL through [SQLGlot](https://github.com/tobymao/sqlglot). Simple errors can be detected at compile time, so you no longer have to wait minutes to see that you've referenced a column incorrectly or missed a comma.
157
+
SQLMesh supports Jinja, but it does not rely on it - instead, it parses/understands SQL through [SQLGlot](https://github.com/tobymao/sqlglot). Simple errors can be detected at compile time, so you no longer have to wait minutes or even longer to see that you've referenced a column incorrectly or missed a comma.
155
158
156
-
Additionally, having a first-class understanding of SQL allows SQLMesh to do some interesting and useful things like transpilation, column-level lineage, and automatic change categorization.
159
+
Additionally, having a first-class understanding of SQL supercharges SQLMesh with features such as transpilation, column-level lineage, and automatic change categorization.
157
160
158
161
### Testing
159
162
Data quality checks such as detecting NULL values and duplicated rows are extremely valuable for detecting upstream data issues and large scale problems. However, they are not meant for testing edge cases or business logic, and they are not sufficient for creating robust data pipelines.
0 commit comments