Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: feat: add support for generating embeddings for external documents #442

Draft
wants to merge 19 commits into
base: main
Choose a base branch
from

Conversation

smoya
Copy link
Contributor

@smoya smoya commented Feb 6, 2025

Integration branch. Work is in progress.

---------

Co-authored-by: Adol Rodriguez <adolsalamanca@gmail.com>
@smoya smoya force-pushed the s3-integration-feature-branch branch from 554c4dc to b5cf803 Compare February 10, 2025 10:18
* feat: tweak chunking func to support doc chunking in the future (#407)


---------

Co-authored-by: Sergio Moya <1083296+smoya@users.noreply.github.com>
Co-authored-by: Adol Rodriguez <adolsalamanca@gmail.com>
smoya and others added 3 commits February 10, 2025 16:00
…457)

continue the work to support s3 document loading

Co-authored-by: Sergio Moya <1083296+smoya@users.noreply.github.com>
Co-authored-by: Adol Rodriguez <adolsalamanca@gmail.com>
another iteration on implementation of documents processing.
@adolsalamanca adolsalamanca force-pushed the s3-integration-feature-branch branch 2 times, most recently from b783aac to d7fdcaa Compare February 11, 2025 13:40
@smoya smoya force-pushed the s3-integration-feature-branch branch 2 times, most recently from 330e86c to b28f186 Compare February 18, 2025 10:37
@smoya smoya force-pushed the s3-integration-feature-branch branch from c96b0e8 to b2f01b4 Compare February 18, 2025 11:08
Copy link
Collaborator

@jgpruitt jgpruitt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First pass. Only looked at the extension.

Comment on lines +408 to +411
( $sql$create table %I.%I(%s
, queued_at pg_catalog.timestamptz not null default now()
, retries pg_catalog.int4 default 0
, retry_after pg_catalog.timestamptz)$sql$
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
( $sql$create table %I.%I(%s
, queued_at pg_catalog.timestamptz not null default now()
, retries pg_catalog.int4 default 0
, retry_after pg_catalog.timestamptz)$sql$
( $sql$
create table %I.%I
( %s
, queued_at pg_catalog.timestamptz not null default now()
, retries pg_catalog.int4 default 0
, retry_after pg_catalog.timestamptz
)
$sql$

incredibly nit picky. sorry.


-------------------------------------------------------------------------------
-- loading_document
create or replace function ai.loading_document
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO, loading_uri would be a more appropriate name. It describes "where" we are loading from.

@@ -0,0 +1,83 @@
-------------------------------------------------------------------------------
-- loading_row
create or replace function ai.loading_row
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO, loading_column would be a more appropriate name.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would also "feel more ergonomic". You're passing a column name to this function

select ai.loading_column('my_column');

Comment on lines +142 to +147
perform ai._validate_parsing(jsonb_build_object(
'parsing', parsing,
'loading', loading,
'source_schema', _source_schema,
'source_table', _source_table
));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would pass these elements as individual arguments rather than constructing a jsonb object.

-------------------------------------------------------------------------------
-- _validate_parsing
create or replace function ai._validate_parsing
( config pg_catalog.jsonb -- has to contain both loading and parsing config
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would create distinct parameters for each element you need. This makes it more clear what the function expects and can better enforce that everything needed is provided and in the right types.

( ai.loading_document(), 'public', 'thing' )
""",
"function ai.loading_document() does not exist",
),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To ensure that our type check works, add a "bad" test for loading_row and loading_document on a column of an unsupported type, like the weight column.

Also, add "bad" tests for a column that doesn't exist.

""",
]
bad = [
(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add a case for a non-existent column

Comment on lines -263 to -293
# bob should have select on the source table
cur.execute("select has_table_privilege('bob', 'website.blog', 'select')")
actual = cur.fetchone()[0]
assert actual

# bob should have select, update, delete on the queue table
cur.execute(
f"select has_table_privilege('bob', '{vec.queue_schema}.{vec.queue_table}', 'select, update, delete')"
)
actual = cur.fetchone()[0]
assert actual

# bob should have select, insert, update on the target table
cur.execute(
f"select has_table_privilege('bob', '{vec.target_schema}.{vec.target_table}', 'select, insert, update')"
)
actual = cur.fetchone()[0]
assert actual

# bob should have select on the view
cur.execute(
f"select has_table_privilege('bob', '{vec.view_schema}.{vec.view_name}', 'select')"
)
actual = cur.fetchone()[0]
assert actual

# bob should have select on the vectorizer table
cur.execute("select has_table_privilege('bob', 'ai.vectorizer', 'select')")
actual = cur.fetchone()[0]
assert actual

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why are we nuking these?

@@ -193,6 +203,7 @@ def test_vectorizer_timescaledb():
db_url("test"), autocommit=True, row_factory=namedtuple_row
) as con:
with con.cursor() as cur:
cur.execute("set statement_timeout = '5s'")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this added?

# Insert a sample PDF as bytea
cur.execute("""
insert into vec.doc_bytea(content)
values (decode('255044462D312E340A25', 'hex')) -- Start of PDF file magic bytes
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

neat!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants