Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(go/adbc/driver/snowflake): use one session for connection #2494

Merged
merged 4 commits into from
Feb 18, 2025

Conversation

zeroshade
Copy link
Member

@zeroshade zeroshade commented Jan 28, 2025

Instead of creating a separate connection for simple queries like SELECT COUNT and the metadata queries, we'll keep using the single connection to maintain the existing session throughout. This should avoid the problem mentioned in #2128 where a new session cancels the remaining COPY INTO queries. It also simplifies some of the work on metadata queries so that we don't need to explicitly propagate catalog, schema and database names between connections.

Fixes #2517

@github-actions github-actions bot modified the milestone: ADBC Libraries 17 Jan 28, 2025
@zeroshade
Copy link
Member Author

@davlee1972 @Zan-L @Niivii Could all three of you try testing using this PR and confirm that it eliminates the issue you were having with adbc_ingest?

@davlee1972
Copy link

@zeroshade
Copy link
Member Author

@davlee1972 you would probably need to pull down this PR and build it yourself to test it. If you need I can build it and upload the shared lib and wheel

@Niivii
Copy link

Niivii commented Jan 29, 2025

@davlee1972 you would probably need to pull down this PR and build it yourself to test it. If you need I can build it and upload the shared lib and wheel

This would really help me at least. I do not have enough knowledge in Go to build everything from scratch 🥲

I pulled the nightly 1.5 easily but I saw it was 8d old so I guess it wont have this change !

@zeroshade
Copy link
Member Author

@Niivii here's a wheel I built using my branch. This should be installable and testable (I hope)

pr-wheel.zip

@davlee1972
Copy link

davlee1972 commented Jan 29, 2025

I installed the 1.5.0dev wheel and copy into operations are still getting truncated.

Three Copy Into operations were running for 30 seconds before they were killed by select count(*) call.

Error messages..

time="2025-01-29T18:19:06-05:00" level=error msg="failed to get response. err: context canceled" func=gosnowflake.postRestfulQueryHelper file="restful.go:290"

time="2025-01-29T18:19:06-05:00" level=error msg="error: context canceled" func="gosnowflake.(*snowflakeConn).queryContextInternal" file="connection.go:413"

time="2025-01-29T18:19:06-05:00" level=error msg="error: context canceled" func="gosnowflake.(*snowflakeConn).queryContextInternal" file="connection.go:413"

time="2025-01-29T18:19:06-05:00" level=error msg="error: context canceled" func="gosnowflake.(*snowflakeConn).queryContextInternal" file="connection.go:413"


dbapi.py:937

return _blocking_call(self._stmt.execute_update, (), {}, self._stmt.cancel)

InternalError: INTERNAL: some files not loaded by COPY command, 96 files remain after 5 retries

@zeroshade
Copy link
Member Author

Well, "failed to get response" is at least a different error. Is there any way you can give me a reproducer with synthetic data?

I know that you're testing with internal stuff, but I haven't been able to recreate the truncated copies myself.

@Niivii
Copy link

Niivii commented Jan 30, 2025

Hello, this didn't solve the issue, still get some remaining COPY commands.
We opened a ticket at Snowflake to get more details on the failed query.

This seems to happens more often with large amount of data. A couple millions of rows goes through without issue.
Several Gb in memory triggers it consistently from what I experience.

@zeroshade
Copy link
Member Author

@Niivii would you be able to share your reproducer? If you can't share the data, can you generate synthetic data that I would be able to reproduce it with?

@Niivii
Copy link

Niivii commented Jan 30, 2025

I will make a small python project to simulate what is done and generate some dummy data. Not sure I will have the time before the week-end though. Will let you know

@zeroshade
Copy link
Member Author

@davlee1972 @Niivii checking in again to see if either of you can provide a reproducer that I can try utilizing to debug and test this

@Niivii
Copy link

Niivii commented Feb 10, 2025

Hello @zeroshade,

This should do the trick:

import adbc_driver_snowflake.dbapi
import numpy as np
import polars as pl


def main() -> None:
    big_df = generate_big_dataframe(num_rows=5000000)
    with create_snowflake_connection() as snowflake_connection:
        big_df.write_database(table_name="my_table", connection=snowflake_connection,
                              if_table_exists="append", engine="adbc")


def generate_big_dataframe(num_rows: int) -> pl.DataFrame:
    rng = np.random.default_rng(12345)
    data = {
        "column_int": rng.integers(low=-10000, high=10000, size=num_rows),
        "column_float": rng.random(size=num_rows),
        "column_string": rng.choice(["A", "b", "cdefghijkl", "âozZEzaéXZez"], size=num_rows),
    }
    return pl.DataFrame(data=data)


def create_snowflake_connection() -> adbc_driver_snowflake.dbapi.Connection:
    return adbc_driver_snowflake.dbapi.connect(db_kwargs={
        "username": "myusername",
        "adbc.snowflake.sql.account": "myaccount",
        "adbc.snowflake.sql.auth_type": "auth_ext_browser",
        "adbc.snowflake.sql.warehouse": "mywarehouse",
        "adbc.snowflake.sql.role": "myrole",
        "adbc.snowflake.sql.db": "mydb",
        "adbc.snowflake.sql.schema": "myschema",
    })


if __name__ == "__main__":
    main()

python>=3.13
polars>=1.22.0
adbc-driver-snowflake>=1.4.0
pyarrow>=19.0.0
numpy>=2.2.2

Let me know if you can reproduce.
Problem might come from the 'append' method with existing data.

@zeroshade
Copy link
Member Author

Thanks @Niivii with the help of that script I was able to narrow down and determine the precise cause of the issue. While this is needed to fix it, it will ultimately need another solution to fix that issue (I'm gonna start working on it now).

In the meantime, this does provide a good fix for some issues including #2517. (I've updated the PR description such that it marks fixing the correct issue)

@lidavidm Any further comments on this or do you think it's good to merge as-is while I work on the actual solution for the COPY into problem?

@zeroshade
Copy link
Member Author

@Niivii @davlee1972 I think i figured it out without needing a larger refactor.... here's a built version that you can try and hopefully fixes the issue on your end, it fixed my reproducer at least.... Please report back!

adbc_driver_snowflake-1.5.0.tar.gz

@Niivii
Copy link

Niivii commented Feb 12, 2025

This didn't solve the issue on my side sadly 😭
Still get remaining copy into:
image
adbc_driver_manager.InternalError: INTERNAL: some files not loaded by COPY command, 1 files remain after 5 retries
Tested with roughly 10M rows (6 columns composed of 2 numbers, 1 timestampntz, 3 doubles)

@Niivii
Copy link

Niivii commented Feb 12, 2025

Disclaimer: I am not a Go dev !
I had a quick look at the stopFn handler and there is a timing issue imho.
If executeCopyQuery is slow it could trigger an early cancel() after the retries.
Could a local independent retry context help to avoid premature cancelation of the main context ?

Also worth to mention that it is systematically the first COPY INTO query that is slow.
Stats of the first failed query:
image

@zeroshade
Copy link
Member Author

@Niivii Okay, I think i spotted where the other potential race condition was:

  1. stopFn gets called leading to it closing the channels
  2. the goroutine that calls executeCopyQuery has already checked that the channel is still open, but hasn't called g.Go yet
  3. stopFn calls g.Wait and returns immediately because there are no running goroutines, so it continues on
  4. g.Go is called kicking off executeCopyQuery which gets in and runs before the calls to it happen in stopFn

That's my theory as to how it still happened in your test. So I've updated stopFn to make sure that it won't call g.Wait until after the goroutine finishes running. i.e. until after it has kicked off any calls to executeCopyQuery so that g.Wait will definitely wait and we can't end up in-between. I've attached a new build that should include this fix to this, hopefully this finally fixes the problem for you!

If not, then I'm probably going to have to just rework and refactor the entire process here which i'm trying to avoid....

adbc_driver_snowflake-1.5.0.tar.gz

adbc_driver_snowflake_manylinux.whl.tar.gz

@lidavidm
Copy link
Member

Is there a cleaner architecture we can achieve by refactoring? At some point it feels a bit brittle 😅 (that said I'm happy to merge this for the time being if it does fix everything)

@Niivii
Copy link

Niivii commented Feb 13, 2025

image
🎉 🎉 🎉

@davlee1972
Copy link

davlee1972 commented Feb 13, 2025

Unfortunately this fix didn't work from my end after installing the manylinux whl file.

First time it aborted an active copy into.
Second time it aborted two active copy into.

Unfortunately I'm having a corporate issue between my personal GitHub and copilot GitHub account so here is a camera photo of the error.

PXL_20250213_213714782 NIGHT

The error I'm seeing is 64 files remain after 5 retries.

I'm wondering why we have all these copy into operations with no rows loaded and if there happens to be 5 of these in a row will it abort prematurely???

@zeroshade
Copy link
Member Author

I'm wondering why we have all these copy into operations with no rows loaded and if there happens to be 5 of these in a row will it abort prematurely???

It's the finishCopy part. Once all the uploading has completed, there's a loop calling executeCopy if not all of the files have been loaded yet. It makes 5 attempts at performing the last copies and then aborts if there are still files that weren't copied yet. But with the latest changes, it should be waiting until the existing ones complete before moving on to the final attempts so I don't know how you're getting the abort in this scenario. I can't see any way in the code that it should be able to end up in the retry loop until the other COPY INTO commands finish...

can you give me any more information about the shape of your data? How many rows/columns? I want to try to see if I can reproduce the failure you're seeing somehow so I can try to debug how it's happening.

@lidavidm
Copy link
Member

Should we merge this for the time being while we continue investigating? It seems it fixes at least a couple issues if not all

@zeroshade
Copy link
Member Author

@lidavidm I agree, let's merge this to fix the issues it does fix, and I'll continue investigating.

@zeroshade zeroshade merged commit 7706ace into apache:main Feb 18, 2025
38 of 40 checks passed
@zeroshade
Copy link
Member Author

@davlee1972 Please feel free to reply here if you can give me more information (number of rows / columns / batches, etc) about your dataset that can help me to try to reproduce your failure (also, please try to build it fresh locally with the updated changes and see if that helps). In the meantime, I have some ideas about reducing the extraneous copy calls that have 0 rows inserted.

@zeroshade zeroshade deleted the snowflake-sessions branch February 18, 2025 23:29
@Niivii
Copy link

Niivii commented Feb 19, 2025

@davlee1972 Please feel free to reply here if you can give me more information (number of rows / columns / batches, etc) about your dataset that can help me to try to reproduce your failure (also, please try to build it fresh locally with the updated changes and see if that helps). In the meantime, I have some ideas about reducing the extraneous copy calls that have 0 rows inserted.

From his photo, you can see the number of inserted rows in Snowflake on the right.
This is roughly 5 times more than on my side but ofc it depends of the data type and number of columns too.
I guess you might be able to replicate by tweaking my script and adding a 0 to the number of rows if you have the memory for it on your computer (should be around ~5Gb ?):
big_df = generate_big_dataframe(num_rows=50000000)

Otherwise, try to saturate your RAM with the biggest dataframe possible and see how it goes 🤷

@davlee1972
Copy link

Finally got my dual login to github issues resolved (for next 90 days)...

Here's my test code using the NYC Taxi Trip Data.

https://www.kaggle.com/datasets/elemento/nyc-yellow-taxi-trip-data?resource=download

It aborted after coming up 5 empty copy into(s)..

import pyarrow.dataset as ds
import adbc_driver_snowflake.dbapi as dbapi

csv_dataset = ds.dataset("archive", format="csv")

db_args = {}
db_args["db_kwargs"] = db_kwargs
db_args["autocommit"] = True

conn = dbapi.connect(**db_args)

cursor = conn.cursor()

cursor.adbc_ingest(
    table_name="taxi_data",
    data=csv_dataset,
    mode="create_append",
)
Exception ignored in: <function Connection.__del__ at 0x7ff8738f8790>
Traceback (most recent call last):
  File "/home/my_login/    miniconda3/lib/python3.9/site-packages/adbc_driver_manager/dbapi.py", line 352, in __del__
    self.close()
  File "/home/my_login/    miniconda3/lib/python3.9/site-packages/adbc_driver_manager/dbapi.py", line 332, in close
    self._conn.close()
  File "adbc_driver_manager/_lib.pyx", line 1046, in adbc_driver_manager._lib.AdbcConnection.close
  File "adbc_driver_manager/_lib.pyx", line 333, in adbc_driver_manager._lib._AdbcHandle._check_open_children
  File "adbc_driver_manager/_lib.pyx", line 335, in adbc_driver_manager._lib._AdbcHandle._check_open_children
RuntimeError: Cannot close AdbcConnection with open AdbcStatement
Exception ignored in: <function Connection.__del__ at 0x7ff8738f8790>
Traceback (most recent call last):
  File "/home/my_login/    miniconda3/lib/python3.9/site-packages/adbc_driver_manager/dbapi.py", line 352, in __del__
    self.close()
  File "/home/my_login/    miniconda3/lib/python3.9/site-packages/adbc_driver_manager/dbapi.py", line 332, in close
    self._conn.close()
  File "adbc_driver_manager/_lib.pyx", line 1046, in adbc_driver_manager._lib.AdbcConnection.close
  File "adbc_driver_manager/_lib.pyx", line 333, in adbc_driver_manager._lib._AdbcHandle._check_open_children
  File "adbc_driver_manager/_lib.pyx", line 335, in adbc_driver_manager._lib._AdbcHandle._check_open_children
RuntimeError: Cannot close AdbcConnection with open AdbcStatement
Exception ignored in: <function Connection.__del__ at 0x7ff8738f8790>
Traceback (most recent call last):
  File "/home/my_login/    miniconda3/lib/python3.9/site-packages/adbc_driver_manager/dbapi.py", line 352, in __del__
    self.close()
  File "/home/my_login/    miniconda3/lib/python3.9/site-packages/adbc_driver_manager/dbapi.py", line 332, in close
    self._conn.close()
  File "adbc_driver_manager/_lib.pyx", line 1046, in adbc_driver_manager._lib.AdbcConnection.close
  File "adbc_driver_manager/_lib.pyx", line 333, in adbc_driver_manager._lib._AdbcHandle._check_open_children
  File "adbc_driver_manager/_lib.pyx", line 335, in adbc_driver_manager._lib._AdbcHandle._check_open_children
RuntimeError: Cannot close AdbcConnection with open AdbcStatement
time="2025-02-19T15:42:23-05:00" level=error msg="error: context canceled" func="gosnowflake.(*snowflakeConn).queryContextInternal" file="connection.go:413"
Exception ignored in: <function Connection.__del__ at 0x7ff8738f8790>
Traceback (most recent call last):
  File "/home/my_login/    miniconda3/lib/python3.9/site-packages/adbc_driver_manager/dbapi.py", line 352, in __del__
    self.close()
  File "/home/my_login/    miniconda3/lib/python3.9/site-packages/adbc_driver_manager/dbapi.py", line 332, in close
    self._conn.close()
  File "adbc_driver_manager/_lib.pyx", line 1046, in adbc_driver_manager._lib.AdbcConnection.close
  File "adbc_driver_manager/_lib.pyx", line 333, in adbc_driver_manager._lib._AdbcHandle._check_open_children
  File "adbc_driver_manager/_lib.pyx", line 335, in adbc_driver_manager._lib._AdbcHandle._check_open_children
RuntimeError: Cannot close AdbcConnection with open AdbcStatement
---------------------------------------------------------------------------
InternalError                             Traceback (most recent call last)
Cell In[19], line 14
     10 conn = dbapi.connect(**db_args)
     12 cursor = conn.cursor()
---> 14 cursor.adbc_ingest(
     15     table_name="taxi_data",
     16     data=csv_dataset,
     17     mode="create_append",
     18 )

File ~/miniconda3/lib/python3.9/site-packages/adbc_driver_manager/dbapi.py:937, in Cursor.adbc_ingest(self, table_name, data, mode, catalog_name, db_schema_name, temporary)
    934     self._stmt.bind_stream(handle)
    936 self._last_query = None
--> 937 return _blocking_call(self._stmt.execute_update, (), {}, self._stmt.cancel)

File ~/miniconda3/lib/python3.9/site-packages/adbc_driver_manager/_lib.pyx:1573, in adbc_driver_manager._lib._blocking_call_impl()

File ~/miniconda3/lib/python3.9/site-packages/adbc_driver_manager/_lib.pyx:1566, in adbc_driver_manager._lib._blocking_call_impl()

File ~/miniconda3/lib/python3.9/site-packages/adbc_driver_manager/_lib.pyx:1299, in adbc_driver_manager._lib.AdbcStatement.execute_update()

File ~/miniconda3/lib/python3.9/site-packages/adbc_driver_manager/_lib.pyx:260, in adbc_driver_manager._lib.check_error()

InternalError: INTERNAL: some files not loaded by COPY command, 44 files remain after 5 retries

image

image

@zeroshade
Copy link
Member Author

Nice, I'll see if I can reproduce with the same data. Hopefully it'll be consistent and I can fix this. Thanks @davlee1972

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

go/adbc/driver/snowflake: failing to get columns when current schema is not set
4 participants