@@ -42,6 +42,7 @@ The following are DAGs grouped by their primary tag:
42
42
43
43
| DAG ID | Schedule Interval |
44
44
| --------------------------------------------------------------------------------- | ----------------- |
45
+ | [ ` batched_update ` ] ( #batched_update ) | ` None ` |
45
46
| [ ` recreate_audio_popularity_calculation ` ] ( #recreate_audio_popularity_calculation ) | ` None ` |
46
47
| [ ` recreate_image_popularity_calculation ` ] ( #recreate_image_popularity_calculation ) | ` None ` |
47
48
| [ ` report_pending_reported_media ` ] ( #report_pending_reported_media ) | ` @weekly ` |
@@ -112,6 +113,7 @@ The following is documentation associated with each DAG (where available):
112
113
1 . [ ` add_license_url ` ] ( #add_license_url )
113
114
1 . [ ` airflow_log_cleanup ` ] ( #airflow_log_cleanup )
114
115
1 . [ ` audio_data_refresh ` ] ( #audio_data_refresh )
116
+ 1 . [ ` batched_update ` ] ( #batched_update )
115
117
1 . [ ` check_silenced_dags ` ] ( #check_silenced_dags )
116
118
1 . [ ` create_filtered_audio_index ` ] ( #create_filtered_audio_index )
117
119
1 . [ ` create_filtered_image_index ` ] ( #create_filtered_image_index )
@@ -219,6 +221,80 @@ and related PRs:
219
221
- [[ Feature] Data refresh orchestration DAG] ( https://github.com/WordPress/openverse-catalog/issues/353 )
220
222
- [[ Feature] Merge popularity calculations and data refresh into a single DAG] ( https://github.com/WordPress/openverse-catalog/issues/453 )
221
223
224
+ ## ` batched_update `
225
+
226
+ Batched Update DAG
227
+
228
+ This DAG is used to run a batched SQL update on a media table in the Catalog
229
+ database. It is automatically triggered by the ` popularity_refresh ` DAGs to
230
+ refresh popularity data using newly calculated constants, but can also be
231
+ triggered manually with custom SQL operations.
232
+
233
+ The DAG must be run with a valid dag_run configuration specifying the SQL
234
+ commands to be run. The DAG will then split the rows to be updated into batches,
235
+ and report to Slack when all batches have been updated. It handles all
236
+ deadlocking and timeout concerns, ensuring that the provided SQL is run without
237
+ interfering with ingestion. For more information, see the implementation plan:
238
+ https://docs.openverse.org/projects/proposals/popularity_optimizations/20230420-implementation_plan_popularity_optimizations.html#special-considerations-avoiding-deadlocks-and-timeouts
239
+
240
+ By default the DAG will run as a dry_run, logging the generated SQL but not
241
+ actually running it. To actually perform the update, the ` dry_run ` parameter
242
+ must be explicitly set to ` false ` in the configuration.
243
+
244
+ Required Dagrun Configuration parameters:
245
+
246
+ - query_id: a string identifier which will be appended to temporary table used
247
+ in the update
248
+ - table_name: the name of the table to update. Must be a valid media table
249
+ - select_query: a SQL ` WHERE ` clause used to select the rows that will be
250
+ updated
251
+ - update_query: the SQL ` UPDATE ` expression to be run on all selected rows
252
+
253
+ Optional params:
254
+
255
+ - dry_run: bool, whether to actually run the generated SQL. True by default.
256
+ - batch_size: int number of records to process in each batch. By default, 10_000
257
+ - update_timeout: int number of seconds to run an individual batch update before
258
+ timing out. By default, 3600 (or one hour)
259
+ - batch_start: int index into the temp table at which to start the update. By
260
+ default, this is 0 and all rows in the temp table are updated.
261
+ - resume_update: boolean indicating whether to attempt to resume an update using
262
+ an existing temp table matching the ` query_id ` . When True, a new temp table is
263
+ not created.
264
+
265
+ An example dag_run configuration used to set the thumbnails of all Flickr images
266
+ to null would look like this:
267
+
268
+ ```
269
+ {
270
+ "query_id": "my_flickr_query",
271
+ "table_name": "image",
272
+ "select_query": "WHERE provider='flickr'",
273
+ "update_query": "SET thumbnail=null",
274
+ "batch_size": 10,
275
+ "dry_run": false
276
+ }
277
+ ```
278
+
279
+ It is possible to resume an update from an arbitrary starting point on an
280
+ existing temp table, for example if a DAG succeeds in creating the temp table
281
+ but fails midway through the update. To do so, set the ` resume_update ` param to
282
+ True and select your desired ` batch_start ` . For instance, if the example DAG
283
+ given above failed after processing the first 50_000 records, you might run:
284
+
285
+ ```
286
+ {
287
+ "query_id": "my_flickr_query",
288
+ "table_name": "image",
289
+ "select_query": "WHERE provider='flickr'",
290
+ "update_query": "SET thumbnail=null",
291
+ "batch_size": 10,
292
+ "batch_start": 50000,
293
+ "resume_update": true,
294
+ "dry_run": false
295
+ }
296
+ ```
297
+
222
298
## ` check_silenced_dags `
223
299
224
300
### Silenced DAGs check
0 commit comments