feat: better worker tracking from the db #499

cevian · 2025-02-19T20:36:26Z

Records more information about the worker status in the database. Allows more visibility from the db layer.

JamesGuthrie

Initial pass.

JamesGuthrie · 2025-02-20T08:23:21Z

projects/pgai/pgai/cli.py

-                        log.error("the pgai extension is not installed")
+                        err_msg = "the pgai extension is not installed"
+                        log.error(err_msg)
+                        await worker_tracking.save_vectorizer_error(None, err_msg)


worker_tracking is not yet initialized here.

JamesGuthrie · 2025-02-20T08:26:02Z

projects/pgai/pgai/cli.py

+                        db_url, poll_interval, features, __version__
+                    )
+                    await worker_tracking.start()
+                    asyncio.create_task(worker_tracking.heartbeat())


It feels like creating this task should be part of worker_tracking.start(). Is there any reason not to do that?

Also: If you don't have any vectorizers then this doesn't actually work (the heartbeat never runs) because the current task doesn't ever yield control back to the event loop. Switching out time.sleep for asyncio.sleep below fixes it.

JamesGuthrie · 2025-02-20T08:34:10Z

projects/pgai/pgai/vectorizer/worker_tracking/worker_tracking.py

+            num_errors = self.num_errors_since_last_heartbeat
+            self.num_errors_since_last_heartbeat = 0
+            error_message = self.error_message
+            self.error_message = None
+            num_successes = self.num_successes_since_last_heartbeat
+            self.num_successes_since_last_heartbeat = 0
+            await cur.execute(
+                "select ai._worker_heartbeat(%s, %s, %s, %s)",
+                (self.worker_id, num_successes, num_errors, error_message),
+            )


I suspect that you have a race condition here.

You spawn the async task for heartbeat(), which periodically does a heartbeat (calling _heartbeat). Simultaneously you have the "parent" async task running, which calls force_heartbeat in a number of locations.

JamesGuthrie · 2025-02-20T08:34:44Z

projects/pgai/tests/vectorizer/cli/test_openai_vectorizer.py

@@ -89,18 +89,42 @@ def test_process_vectorizer(
        f"items={num_items}-batch_size={batch_size}-"
        f"custom_base_url={openai_proxy_url is not None}.yaml"
    )
-    logging.getLogger("vcr").setLevel(logging.DEBUG)
+    # logging.getLogger("vcr").setLevel(logging.DEBUG)


stray comment

JamesGuthrie · 2025-02-20T08:34:50Z

projects/pgai/tests/vectorizer/cli/test_openai_vectorizer.py


    with vcr_.use_cassette(cassette):
        result = run_vectorizer_worker(cli_db_url, vectorizer_id, concurrency)

    assert not result.exception
    assert result.exit_code == 0
+    print(f"result: {result.stdout}")


intentional?

JamesGuthrie · 2025-02-20T10:55:32Z

projects/pgai/pgai/cli.py

                sys.exit(1)

        if once:
+            await worker_tracking.force_heartbeat()
            return
        log.info(f"sleeping for {poll_interval_str} before polling for new work")
        time.sleep(poll_interval)


Suggested change

time.sleep(poll_interval)

asyncio.sleep(poll_interval)

JamesGuthrie · 2025-02-20T10:57:04Z

projects/pgai/pgai/cli.py

+                    worker_tracking = WorkerTracking(
+                        db_url, poll_interval, features, __version__
+                    )
+                    await worker_tracking.start()


This is probably not a big issue, but a failure to create or start the worker_tracking will result in the worker_tracking being silently broken.

jgpruitt · 2025-02-20T14:27:47Z