Skip to content

Allow parsing byte literals as FixedSizeBinary #15686

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
leoyvens opened this issue Apr 11, 2025 · 4 comments · Fixed by #15726 · May be fixed by #15687
Closed

Allow parsing byte literals as FixedSizeBinary #15686

leoyvens opened this issue Apr 11, 2025 · 4 comments · Fixed by #15726 · May be fixed by #15687
Labels
enhancement New feature or request

Comments

@leoyvens
Copy link
Contributor

Is your feature request related to a problem or challenge?

Say you have a column byte_column of type FixedSizeBinary. Doing where byte_column = x'deadbeef' will fail, because the literal is parsed as Binary.

Describe the solution you'd like

Add a option to parse literals as FixedSizeBinary.

Describe alternatives you've considered

Coercing the LHS to Binary is less performant. Unconditionally parsing literals as FixedSizeBytes would be a breaking change.

Additional context

I am working on this.

@alamb
Copy link
Contributor

alamb commented Apr 12, 2025

Coercing the LHS to Binary is less performant. Unconditionally parsing literals as FixedSizeBytes would be a breaking change.

What about coercing the RHS to FixedSizeBinary?

So like

where byte_column = x'deadbeef'

Becomes

where byte_column = arrow_cast(x'deadbeef', FixedSizeBinary)

This would have the benefit of:

  1. Easier to use (no config flag)
  2. Could handle non exactly sizes things like casting 0xdead to 0x0000dead for comparisoon with 4 byte integers for example
  3. Would be just as performant

@leoyvens
Copy link
Contributor Author

@alamb thank you for looking at this. Avoiding a config flag would be nice. But I'm skeptical of the proposed coercion.

If we coerce binary to fixed(N) when encountering any expression of type: fixed(N) = binary, this would be a fallible coercion. While the infallible coercion of fixed(N) to binary would allow the expression to be evaluated for all values. DataFusion has already chosen this more lenient coercion for lists. The following:

select arrow_cast([1], 'FixedSizeList(1, Int64)') = [1, 2];

Evaluates today to false instead of giving a coercion error.

@alamb
Copy link
Contributor

alamb commented Apr 14, 2025

Here is a sql reproducer of what happens today

> create table t as values (arrow_cast(x'deadbeef', 'FixedSizeBinary(4)'));
0 row(s) fetched.
Elapsed 0.006 seconds.

> select column1 = x'deadbeef' from t;
Error during planning: Cannot infer common argument type for comparison operation FixedSizeBinary(4) = Binary

Coercing the LHS to Binary is less performant. Unconditionally parsing literals as FixedSizeBytes would be a breaking change.

I wonder how much less performant this in in practice 🤔 In theory we can have custom kernels with custom code for different array sizes, but I think in practice it actually ends up always checking the length

I really think we should avoid a new config option if possible

@leoyvens
Copy link
Contributor Author

leoyvens commented Apr 15, 2025

Ok I hear you, I'll try to do this without touching the parser and its configs.

A concern with an LHS coercion is efficient pushdown. col = x'lit' can make use of statistics and parquet-level pruning, while col::bytea = x'lit' is a full-column scan.

I will try to pursue the following approach:

  1. We do the LHS coercion.
  2. The ExprSimplifier is extended to expressions of the form:
Cast(fixed_size_binary_col, 'Binary') = x'lit'

If the lengths match, the cast is moved to the RHS.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
2 participants