-
Notifications
You must be signed in to change notification settings - Fork 210
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WIP: feat: add support for generating embeddings for external documents #442
base: main
Are you sure you want to change the base?
Conversation
--------- Co-authored-by: Adol Rodriguez <adolsalamanca@gmail.com>
554c4dc
to
b5cf803
Compare
* feat: tweak chunking func to support doc chunking in the future (#407) --------- Co-authored-by: Sergio Moya <1083296+smoya@users.noreply.github.com> Co-authored-by: Adol Rodriguez <adolsalamanca@gmail.com>
…457) continue the work to support s3 document loading Co-authored-by: Sergio Moya <1083296+smoya@users.noreply.github.com> Co-authored-by: Adol Rodriguez <adolsalamanca@gmail.com>
another iteration on implementation of documents processing.
b783aac
to
d7fdcaa
Compare
we're adding a couple of new columns to the vectorizer queue tables so we can track whether the vectorizers are being properly generated or require a retry.
330e86c
to
b28f186
Compare
c96b0e8
to
b2f01b4
Compare
…ment (#480) --------- Co-authored-by: Sergio Moya <1083296+smoya@users.noreply.github.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First pass. Only looked at the extension.
( $sql$create table %I.%I(%s | ||
, queued_at pg_catalog.timestamptz not null default now() | ||
, retries pg_catalog.int4 default 0 | ||
, retry_after pg_catalog.timestamptz)$sql$ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
( $sql$create table %I.%I(%s | |
, queued_at pg_catalog.timestamptz not null default now() | |
, retries pg_catalog.int4 default 0 | |
, retry_after pg_catalog.timestamptz)$sql$ | |
( $sql$ | |
create table %I.%I | |
( %s | |
, queued_at pg_catalog.timestamptz not null default now() | |
, retries pg_catalog.int4 default 0 | |
, retry_after pg_catalog.timestamptz | |
) | |
$sql$ |
incredibly nit picky. sorry.
|
||
------------------------------------------------------------------------------- | ||
-- loading_document | ||
create or replace function ai.loading_document |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO, loading_uri
would be a more appropriate name. It describes "where" we are loading from.
@@ -0,0 +1,83 @@ | |||
------------------------------------------------------------------------------- | |||
-- loading_row | |||
create or replace function ai.loading_row |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO, loading_column
would be a more appropriate name.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would also "feel more ergonomic". You're passing a column name to this function
select ai.loading_column('my_column');
perform ai._validate_parsing(jsonb_build_object( | ||
'parsing', parsing, | ||
'loading', loading, | ||
'source_schema', _source_schema, | ||
'source_table', _source_table | ||
)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would pass these elements as individual arguments rather than constructing a jsonb object.
------------------------------------------------------------------------------- | ||
-- _validate_parsing | ||
create or replace function ai._validate_parsing | ||
( config pg_catalog.jsonb -- has to contain both loading and parsing config |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would create distinct parameters for each element you need. This makes it more clear what the function expects and can better enforce that everything needed is provided and in the right types.
( ai.loading_document(), 'public', 'thing' ) | ||
""", | ||
"function ai.loading_document() does not exist", | ||
), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To ensure that our type check works, add a "bad" test for loading_row and loading_document on a column of an unsupported type, like the weight
column.
Also, add "bad" tests for a column that doesn't exist.
""", | ||
] | ||
bad = [ | ||
( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add a case for a non-existent column
# bob should have select on the source table | ||
cur.execute("select has_table_privilege('bob', 'website.blog', 'select')") | ||
actual = cur.fetchone()[0] | ||
assert actual | ||
|
||
# bob should have select, update, delete on the queue table | ||
cur.execute( | ||
f"select has_table_privilege('bob', '{vec.queue_schema}.{vec.queue_table}', 'select, update, delete')" | ||
) | ||
actual = cur.fetchone()[0] | ||
assert actual | ||
|
||
# bob should have select, insert, update on the target table | ||
cur.execute( | ||
f"select has_table_privilege('bob', '{vec.target_schema}.{vec.target_table}', 'select, insert, update')" | ||
) | ||
actual = cur.fetchone()[0] | ||
assert actual | ||
|
||
# bob should have select on the view | ||
cur.execute( | ||
f"select has_table_privilege('bob', '{vec.view_schema}.{vec.view_name}', 'select')" | ||
) | ||
actual = cur.fetchone()[0] | ||
assert actual | ||
|
||
# bob should have select on the vectorizer table | ||
cur.execute("select has_table_privilege('bob', 'ai.vectorizer', 'select')") | ||
actual = cur.fetchone()[0] | ||
assert actual | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why are we nuking these?
@@ -193,6 +203,7 @@ def test_vectorizer_timescaledb(): | |||
db_url("test"), autocommit=True, row_factory=namedtuple_row | |||
) as con: | |||
with con.cursor() as cur: | |||
cur.execute("set statement_timeout = '5s'") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is this added?
# Insert a sample PDF as bytea | ||
cur.execute(""" | ||
insert into vec.doc_bytea(content) | ||
values (decode('255044462D312E340A25', 'hex')) -- Start of PDF file magic bytes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
neat!
Integration branch. Work is in progress.