WIP: feat: add support for generating embeddings for external documents #442

smoya · 2025-02-06T10:06:35Z

Integration branch. Work is in progress.

--------- Co-authored-by: Adol Rodriguez <adolsalamanca@gmail.com>

* feat: tweak chunking func to support doc chunking in the future (#407) --------- Co-authored-by: Sergio Moya <1083296+smoya@users.noreply.github.com> Co-authored-by: Adol Rodriguez <adolsalamanca@gmail.com>

…457) continue the work to support s3 document loading Co-authored-by: Sergio Moya <1083296+smoya@users.noreply.github.com> Co-authored-by: Adol Rodriguez <adolsalamanca@gmail.com>

another iteration on implementation of documents processing.

we're adding a couple of new columns to the vectorizer queue tables so we can track whether the vectorizers are being properly generated or require a retry.

…e-branch

…ment (#480) --------- Co-authored-by: Sergio Moya <1083296+smoya@users.noreply.github.com>

jgpruitt

First pass. Only looked at the extension.

jgpruitt · 2025-02-20T21:07:06Z

projects/extension/sql/idempotent/012-vectorizer-int.sql

+    ( $sql$create table %I.%I(%s
+    , queued_at pg_catalog.timestamptz not null default now()
+    , retries pg_catalog.int4 default 0
+    , retry_after pg_catalog.timestamptz)$sql$


Suggested change

( $sql$create table %I.%I(%s

, queued_at pg_catalog.timestamptz not null default now()

, retries pg_catalog.int4 default 0

, retry_after pg_catalog.timestamptz)$sql$

( $sql$

create table %I.%I

( %s

, queued_at pg_catalog.timestamptz not null default now()

, retries pg_catalog.int4 default 0

, retry_after pg_catalog.timestamptz

)

$sql$

incredibly nit picky. sorry.

jgpruitt · 2025-02-20T21:09:02Z

projects/extension/sql/idempotent/013-loading.sql

+
+-------------------------------------------------------------------------------
+-- loading_document
+create or replace function ai.loading_document


IMO, loading_uri would be a more appropriate name. It describes "where" we are loading from.

jgpruitt · 2025-02-20T21:09:29Z

projects/extension/sql/idempotent/013-loading.sql

@@ -0,0 +1,83 @@
+-------------------------------------------------------------------------------
+-- loading_row
+create or replace function ai.loading_row


IMO, loading_column would be a more appropriate name.

It would also "feel more ergonomic". You're passing a column name to this function

select ai.loading_column('my_column');

jgpruitt · 2025-02-20T21:27:46Z

projects/extension/sql/idempotent/015-vectorizer-api.sql

+    perform ai._validate_parsing(jsonb_build_object(
+        'parsing', parsing,
+        'loading', loading,
+        'source_schema', _source_schema,
+        'source_table', _source_table
+    ));


I would pass these elements as individual arguments rather than constructing a jsonb object.

jgpruitt · 2025-02-20T21:28:45Z

projects/extension/sql/idempotent/014-parsing.sql

+-------------------------------------------------------------------------------
+-- _validate_parsing
+create or replace function ai._validate_parsing
+( config pg_catalog.jsonb  -- has to contain both loading and parsing config


I would create distinct parameters for each element you need. This makes it more clear what the function expects and can better enforce that everything needed is provided and in the right types.

jgpruitt · 2025-02-20T21:51:36Z

projects/extension/tests/vectorizer/test_loading.py

+            ( ai.loading_document(), 'public', 'thing' )
+            """,
+            "function ai.loading_document() does not exist",
+        ),


To ensure that our type check works, add a "bad" test for loading_row and loading_document on a column of an unsupported type, like the weight column.

Also, add "bad" tests for a column that doesn't exist.

jgpruitt · 2025-02-20T21:52:58Z

projects/extension/tests/vectorizer/test_parsing.py

+        """,
+    ]
+    bad = [
+        (


add a case for a non-existent column

jgpruitt · 2025-02-20T21:54:27Z

projects/extension/tests/vectorizer/test_vectorizer.py

-            # bob should have select on the source table
-            cur.execute("select has_table_privilege('bob', 'website.blog', 'select')")
-            actual = cur.fetchone()[0]
-            assert actual
-
-            # bob should have select, update, delete on the queue table
-            cur.execute(
-                f"select has_table_privilege('bob', '{vec.queue_schema}.{vec.queue_table}', 'select, update, delete')"
-            )
-            actual = cur.fetchone()[0]
-            assert actual
-
-            # bob should have select, insert, update on the target table
-            cur.execute(
-                f"select has_table_privilege('bob', '{vec.target_schema}.{vec.target_table}', 'select, insert, update')"
-            )
-            actual = cur.fetchone()[0]
-            assert actual
-
-            # bob should have select on the view
-            cur.execute(
-                f"select has_table_privilege('bob', '{vec.view_schema}.{vec.view_name}', 'select')"
-            )
-            actual = cur.fetchone()[0]
-            assert actual
-
-            # bob should have select on the vectorizer table
-            cur.execute("select has_table_privilege('bob', 'ai.vectorizer', 'select')")
-            actual = cur.fetchone()[0]
-            assert actual
-


why are we nuking these?

jgpruitt · 2025-02-20T21:54:53Z

projects/extension/tests/vectorizer/test_vectorizer.py

@@ -193,6 +203,7 @@ def test_vectorizer_timescaledb():
        db_url("test"), autocommit=True, row_factory=namedtuple_row
    ) as con:
        with con.cursor() as cur:
+            cur.execute("set statement_timeout = '5s'")


why is this added?

jgpruitt · 2025-02-20T21:55:59Z

projects/extension/tests/vectorizer/test_vectorizer.py

+            # Insert a sample PDF as bytea
+            cur.execute("""
+                insert into vec.doc_bytea(content)
+                values (decode('255044462D312E340A25', 'hex'))  -- Start of PDF file magic bytes


feat: tweak chunking func to support doc chunking in the future (#407)

b5cf803

--------- Co-authored-by: Adol Rodriguez <adolsalamanca@gmail.com>

smoya force-pushed the s3-integration-feature-branch branch from 554c4dc to b5cf803 Compare February 10, 2025 10:18

feat: vectorizer row loader, default chunking (#448)

82ea7fb

* feat: tweak chunking func to support doc chunking in the future (#407) --------- Co-authored-by: Sergio Moya <1083296+smoya@users.noreply.github.com> Co-authored-by: Adol Rodriguez <adolsalamanca@gmail.com>

This was referenced Feb 10, 2025

feat: allow vectorizer to read documents via smart-open parse via pymupdf #431

Closed

poc: document support #408

Closed

smoya and others added 3 commits February 10, 2025 16:00

feat: add loading_row & loading_document functions to the extension (#…

c16e13a

…457) continue the work to support s3 document loading Co-authored-by: Sergio Moya <1083296+smoya@users.noreply.github.com> Co-authored-by: Adol Rodriguez <adolsalamanca@gmail.com>

feat: add parsing and implement load document (#461)

d5254fe

another iteration on implementation of documents processing.

test: fix test_simple_document_embedding_s3_no_credentials (#465)

d7fdcaa

adolsalamanca force-pushed the s3-integration-feature-branch branch 2 times, most recently from b783aac to d7fdcaa Compare February 11, 2025 13:40

smoya and others added 9 commits February 11, 2025 11:05

feat: support multiple document file types (#467)

5119923

test: add epub and markdown tests + byta column (#469)

9af568b

feat: support queue item retries (#466)

f87b25a

we're adding a couple of new columns to the vectorizer queue tables so we can track whether the vectorizers are being properly generated or require a retry.

feat: docling parser

3b880a6

chore: disable OCR

9071c33

ci: prefetch docling models in CI

b4f81b7

feat: support epub through pymupdf

7252034

ci: globally ignore docling models urls in vcrpy

9a74ae0

ci: prefetch in conftest.py

627387f

smoya force-pushed the s3-integration-feature-branch branch 2 times, most recently from 330e86c to b28f186 Compare February 18, 2025 10:37

Merge remote-tracking branch 'origin/main' into s3-integration-featur…

b2f01b4

…e-branch

smoya force-pushed the s3-integration-feature-branch branch from c96b0e8 to b2f01b4 Compare February 18, 2025 11:08

smoya and others added 4 commits February 18, 2025 12:13

Merge branch 'sergio/docling' into s3-integration-feature-branch

6764726

feat: support docling parser in the extension (#489)

74b7c7a

Merge remote-tracking branch 'origin/main' into s3-integration-featur…

5f12867

…e-branch

docs: add loading and parsing where relevant remove chunk column argu…

1213cd0

…ment (#480) --------- Co-authored-by: Sergio Moya <1083296+smoya@users.noreply.github.com>

jgpruitt requested changes Feb 20, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP: feat: add support for generating embeddings for external documents #442

WIP: feat: add support for generating embeddings for external documents #442

smoya commented Feb 6, 2025

jgpruitt left a comment

jgpruitt Feb 20, 2025

jgpruitt Feb 20, 2025

jgpruitt Feb 20, 2025

jgpruitt Feb 20, 2025

jgpruitt Feb 20, 2025

jgpruitt Feb 20, 2025

jgpruitt Feb 20, 2025

jgpruitt Feb 20, 2025

jgpruitt Feb 20, 2025

jgpruitt Feb 20, 2025

jgpruitt Feb 20, 2025

WIP: feat: add support for generating embeddings for external documents #442

Are you sure you want to change the base?

WIP: feat: add support for generating embeddings for external documents #442

Conversation

smoya commented Feb 6, 2025

jgpruitt left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment