Skip to content

Commit e3800f8

Browse files
committed
Add documentation for multi-schemas
1 parent 80c360e commit e3800f8

8 files changed

+55
-29
lines changed

Diff for: README.md

+22-3
Original file line numberDiff line numberDiff line change
@@ -178,6 +178,24 @@ curl -X 'POST' \
178178
}'
179179
```
180180

181+
##### Connecting multi-schemas
182+
You can connect many schemas using one db connection if you want to create SQL joins between schemas.
183+
Currently only `BigQuery`, `Snowflake`, `Databricks` and `Postgres` support this feature.
184+
To use multi-schemas instead of sending the `schema` in the `connection_uri` set it in the `schemas` param, like this:
185+
186+
```
187+
curl -X 'POST' \
188+
'<host>/api/v1/database-connections' \
189+
-H 'accept: application/json' \
190+
-H 'Content-Type: application/json' \
191+
-d '{
192+
"alias": "my_db_alias",
193+
"use_ssh": false,
194+
"connection_uri": snowflake://<user>:<password>@<organization>-<account-name>/<database>",
195+
"schemas": ["schema_1", "schema_2", ...]
196+
}'
197+
```
198+
181199
##### Connecting to supported Data warehouses and using SSH
182200
You can find the details on how to connect to the supported data warehouses in the [docs](https://dataherald.readthedocs.io/en/latest/api.create_database_connection.html)
183201

@@ -194,7 +212,8 @@ While only the Database scan part is required to start generating SQL, adding ve
194212
#### Scanning the Database
195213
The database scan is used to gather information about the database including table and column names and identifying low cardinality columns and their values to be stored in the context store and used in the prompts to the LLM.
196214
In addition, it retrieves logs, which consist of historical queries associated with each database table. These records are then stored within the query_history collection. The historical queries retrieved encompass data from the past three months and are grouped based on query and user.
197-
db_connection_id is the id of the database connection you want to scan, which is returned when you create a database connection.
215+
The db_connection_id param is the id of the database connection you want to scan, which is returned when you create a database connection.
216+
The ids param is the table_description_id that you want to scan.
198217
You can trigger a scan of a database from the `POST /api/v1/table-descriptions/sync-schemas` endpoint. Example below
199218

200219

@@ -205,11 +224,11 @@ curl -X 'POST' \
205224
-H 'Content-Type: application/json' \
206225
-d '{
207226
"db_connection_id": "db_connection_id",
208-
"table_names": ["table_name"]
227+
"ids": ["<table_description_id_1>", "<table_description_id_2>", ...]
209228
}'
210229
```
211230

212-
Since the endpoint identifies low cardinality columns (and their values) it can take time to complete. Therefore while it is possible to trigger a scan on the entire DB by not specifying the `table_names`, we recommend against it for large databases.
231+
Since the endpoint identifies low cardinality columns (and their values) it can take time to complete.
213232

214233
#### Get logs per db connection
215234
Once a database was scanned you can use this endpoint to retrieve the tables logs

Diff for: dataherald/tests/test_api.py

-21
Original file line numberDiff line numberDiff line change
@@ -12,24 +12,3 @@
1212
def test_heartbeat():
1313
response = client.get("/api/v1/heartbeat")
1414
assert response.status_code == HTTP_200_CODE
15-
16-
17-
def test_scan_all_tables():
18-
response = client.post(
19-
"/api/v1/table-descriptions/sync-schemas",
20-
json={"db_connection_id": "64dfa0e103f5134086f7090c"},
21-
)
22-
assert response.status_code == HTTP_201_CODE
23-
24-
25-
def test_scan_one_table():
26-
try:
27-
client.post(
28-
"/api/v1/table-descriptions/sync-schemas",
29-
json={
30-
"db_connection_id": "64dfa0e103f5134086f7090c",
31-
"table_names": ["foo"],
32-
},
33-
)
34-
except ValueError as e:
35-
assert str(e) == "No table found"

Diff for: docs/api.create_database_connection.rst

+24-1
Original file line numberDiff line numberDiff line change
@@ -26,6 +26,9 @@ Once the database connection is established, it retrieves the table names and cr
2626
"alias": "string",
2727
"use_ssh": false,
2828
"connection_uri": "string",
29+
"schemas": [
30+
"string"
31+
],
2932
"path_to_credentials_file": "string",
3033
"llm_api_key": "string",
3134
"ssh_settings": {
@@ -189,7 +192,7 @@ Connections to supported Data warehouses
189192
-----------------------------------------
190193

191194
The format of the ``connection_uri`` parameter in the API call will depend on the data warehouse type you are connecting to.
192-
You can find samples and how to generate them :ref:<below >.
195+
You can find samples and how to generate them below.
193196

194197
Postgres
195198
^^^^^^^^^^^^
@@ -324,3 +327,23 @@ Example::
324327
"connection_uri": bigquery://v2-real-estate/K2
325328

326329

330+
**Connecting multi-schemas**
331+
332+
You can connect many schemas using one db connection if you want to create SQL joins between schemas.
333+
Currently only `BigQuery`, `Snowflake`, `Databricks` and `Postgres` support this feature.
334+
To use multi-schemas instead of sending the `schema` in the `connection_uri` set it in the `schemas` param, like this:
335+
336+
**Example**
337+
338+
.. code-block:: rst
339+
340+
curl -X 'POST' \
341+
'<host>/api/v1/database-connections' \
342+
-H 'accept: application/json' \
343+
-H 'Content-Type: application/json' \
344+
-d '{
345+
"alias": "my_db_alias_identifier",
346+
"use_ssh": false,
347+
"connection_uri": "snowflake://<user>:<password>@<organization>-<account-name>/<database>",
348+
"schemas": ["foo", "bar"]
349+
}'

Diff for: docs/api.get_table_description.rst

+1
Original file line numberDiff line numberDiff line change
@@ -24,6 +24,7 @@ HTTP 200 code response
2424
"table_schema": "string",
2525
"status": "NOT_SCANNED | SYNCHRONIZING | DEPRECATED | SCANNED | FAILED"
2626
"error_message": "string",
27+
"table_schema": "string",
2728
"columns": [
2829
{
2930
"name": "string",

Diff for: docs/api.list_database_connections.rst

+2
Original file line numberDiff line numberDiff line change
@@ -21,6 +21,7 @@ HTTP 200 code response
2121
"dialect": "databricks",
2222
"use_ssh": false,
2323
"connection_uri": "foooAABk91Q4wjoR2h07GR7_72BdQnxi8Rm6i_EjyS-mzz_o2c3RAWaEqnlUvkK5eGD5kUfE5xheyivl1Wfbk_EM7CgV4SvdLmOOt7FJV-3kG4zAbar=",
24+
"schemas": null,
2425
"path_to_credentials_file": null,
2526
"llm_api_key": null,
2627
"ssh_settings": null
@@ -31,6 +32,7 @@ HTTP 200 code response
3132
"dialect": "postgres",
3233
"use_ssh": true,
3334
"connection_uri": null,
35+
"schemas": null,
3436
"path_to_credentials_file": "bar-LWxPdFcjQw9lU7CeK_2ELR3jGBq0G_uQ7E2rfPLk2RcFR4aDO9e2HmeAQtVpdvtrsQ_0zjsy9q7asdsadXExYJ0g==",
3537
"llm_api_key": "gAAAAABlCz5TeU0ym4hW3bf9u21dz7B9tlnttOGLRDt8gq2ykkblNvpp70ZjT9FeFcoyMv-Csvp3GNQfw66eYvQBrcBEPsLokkLO2Jc2DD-Q8Aw6g_8UahdOTxJdT4izA6MsiQrf7GGmYBGZqbqsjTdNmcq661wF9Q==",
3638
"ssh_settings": {

Diff for: docs/api.list_table_description.rst

+1
Original file line numberDiff line numberDiff line change
@@ -33,6 +33,7 @@ HTTP 200 code response
3333
"table_schema": "string",
3434
"status": "NOT_SCANNED | SYNCHRONIZING | DEPRECATED | SCANNED | FAILED"
3535
"error_message": "string",
36+
"table_schema": "string",
3637
"columns": [
3738
{
3839
"name": "string",

Diff for: docs/api.refresh_table_description.rst

+1
Original file line numberDiff line numberDiff line change
@@ -34,6 +34,7 @@ HTTP 201 code response
3434
"table_schema": "string",
3535
"status": "NOT_SCANNED | SYNCHRONIZING | DEPRECATED | SCANNED | FAILED"
3636
"error_message": "string",
37+
"table_schema": "string",
3738
"columns": [
3839
{
3940
"name": "string",

Diff for: docs/api.scan_table_description.rst

+4-4
Original file line numberDiff line numberDiff line change
@@ -9,7 +9,7 @@ which consist of historical queries associated with each database table. These r
99
query_history collection. The historical queries retrieved encompass data from the past three months and are grouped
1010
based on query and user.
1111

12-
It can scan all db tables or if you specify a `table_names` then It will only scan those tables.
12+
The `ids` param is used to set the table description ids that you want to scan.
1313

1414
The process is carried out through Background Tasks, ensuring that even if it operates slowly, taking several minutes, the HTTP response remains swift.
1515

@@ -23,7 +23,7 @@ Request this ``POST`` endpoint::
2323
2424
{
2525
"db_connection_id": "string",
26-
"table_names": ["string"] # Optional
26+
"ids": ["string"]
2727
}
2828
2929
**Responses**
@@ -36,7 +36,6 @@ HTTP 201 code response
3636
3737
**Request example**
3838

39-
To scan all the tables in a db don't specify a `table_names`
4039

4140
.. code-block:: rst
4241
@@ -45,5 +44,6 @@ To scan all the tables in a db don't specify a `table_names`
4544
-H 'accept: application/json' \
4645
-H 'Content-Type: application/json' \
4746
-d '{
48-
"db_connection_id": "db_connection_id"
47+
"db_connection_id": "db_connection_id",
48+
"ids": ["14e52c5f7d6dc4bc510d6d27", "15e52c5f7d6dc4bc510d6d34"]
4949
}'

0 commit comments

Comments
 (0)