Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: support non vectorized managed function #1373

Open
wants to merge 17 commits into
base: main
Choose a base branch
from
Open

Conversation

jialuoo
Copy link
Contributor

@jialuoo jialuoo commented Feb 6, 2025

b/391680147

Thank you for opening a Pull Request! Before submitting your PR, there are a few things you can do to make sure it goes smoothly:

  • Make sure to open an issue as a bug/issue before writing your code! That way we can discuss the change, evaluate designs, and agree on the general idea
  • Ensure the tests and linter pass
  • Code coverage does not decrease (if any source code was changed)
  • Appropriate docs were updated (if necessary)

Fixes #<issue_number_goes_here> 🦕

@product-auto-label product-auto-label bot added size: xl Pull request size is extra large. api: bigquery Issues related to the googleapis/python-bigquery-dataframes API. labels Feb 6, 2025
@jialuoo jialuoo self-assigned this Feb 6, 2025
@jialuoo jialuoo requested a review from shobsi February 7, 2025 18:27
@jialuoo jialuoo marked this pull request as ready for review February 7, 2025 18:27
@jialuoo jialuoo requested review from a team as code owners February 7, 2025 18:27
Copy link
Contributor

@shobsi shobsi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll review the test code in the next batch

def wrapper(func):
nonlocal input_types, output_type

if not callable(func):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can lines 808-837 be put in a common function?

Copy link
Contributor Author

@jialuoo jialuoo Feb 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, there is a TODO on top of the wrapper. I'll use another PR to do it later if you agree.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, let's create an issue for tracking

ssets can be located through the following properties set in the
object:

`bigframes_managed_function` - The bigquery managed function
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should document bigframes_bigquery_function (related to the other comment)

@@ -570,11 +647,12 @@ def try_delattr(attr):
func.bigframes_cloud_function = (
remote_function_client.get_cloud_function_fully_qualified_name(cf_name)
)
func.bigframes_remote_function = (
func.bigframes_function = (
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nit] I think for clarity we should call the new attribute "bigframes_bigquery_function".

# TODO(jialuo): Deprecate the "bigframes_remote_function" attribute.
# We have some tests using pre-defined remote_function that were
# defined based on "bigframes_remote_function" instead of
# "bigframes_bigquery_function". So we need to fix those pre-defined
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's chat offline which tests need the logic here to depend on both attributes. If possible we should rely on the new attribute and keep the older attribute only for backward compatibility

is_row_processor,
):
"""Create a BigQuery managed function."""
self._create_bq_connection()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

connection is not mandatory in managed function

ibis_signature.output_type
),
language="python",
runtime_version="python-3.11",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should pick this up from the environment instead of hard coding


managed_function_options = {
"runtime_version": runtime_version,
"entry_point": "managed_func",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nit] maybe call it "bigframes_handler"


udf = cloudpickle.loads({pickled})

def managed_func(*args, **kwargs):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think kwargs is redundant here, we can just use args

self._try_delattr(func, "is_row_processor")
self._try_delattr(func, "ibis_node")

bq_function_name = name if name else func.__name__
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's not use func.__name__, multiple users using a common name with entirely different code could end up overwriting each other. See how provision_bq_remote_function is determining the name of the BQ function from the hash of the user code + dependencies



@pytest.fixture(scope="module")
def bq_cf_connection() -> str:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since connection is only optional in managed udf, let's run the tests without one. We can have one or two separate tests in large tests which specifically test an explicit connection

pd_int64_col = scalars_pandas_df["int64_col"]
pd_int64_col_filter = pd_int64_col.notnull()
pd_int64_col_filtered = pd_int64_col[pd_int64_col_filter]
pd_result_col = pd_int64_col_filtered.apply(lambda x: x * x)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to use an independent lambda, .apply(square) would work on a pandas series. (If you found such usage elsewhere, it was probably written before the remote function could be applied on scalar directly - the op in line 62)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: bigquery Issues related to the googleapis/python-bigquery-dataframes API. size: xl Pull request size is extra large.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants