Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug: reading Delta Lake tables hits arrow_scan issue in 10.0.0 #10829

Open
1 task done
lostmygithubaccount opened this issue Feb 12, 2025 · 7 comments · May be fixed by #10833
Open
1 task done

bug: reading Delta Lake tables hits arrow_scan issue in 10.0.0 #10829

lostmygithubaccount opened this issue Feb 12, 2025 · 7 comments · May be fixed by #10833
Labels
bug Incorrect behavior inside of ibis

Comments

@lostmygithubaccount
Copy link
Member

What happened?

noticed trying to read Delta Lake tables was failing, it seems like it's related to upgrading to 10.0.0. simple reproduction after installing in a fresh virtual environment on main:

(ibis) cody@dkdcascend ibis % ipy
imporPython 3.12.8 (main, Jan 14 2025, 23:36:58) [Clang 19.1.6 ]
Type 'copyright', 'credits' or 'license' for more information
IPython 8.32.0 -- An enhanced Interactive Python. Type '?' for help.

[ins] In [1]: import ibis
i
[ins] In [2]: ibis.__version__
Out[2]: '10.0.0'

[ins] In [3]: ibis.options.interactive = True

[ins] In [4]: t = ibis.examples.penguins.fetch()

[ins] In [5]: t.to_delta("penguins.delta")

[ins] In [6]: t = ibis.read_delta("penguins.delta")

[ins] In [7]: t
Out[7]: ---------------------------------------------------------------------------
InvalidInputException                     Traceback (most recent call last)
File ~/code/ibis/.venv/lib/python3.12/site-packages/IPython/core/formatters.py:770, in PlainTextFormatter.__call__(self, obj)
    763 stream = StringIO()
    764 printer = pretty.RepresentationPrinter(stream, self.verbose,
    765     self.max_width, self.newline,
    766     max_seq_length=self.max_seq_length,
    767     singleton_pprinters=self.singleton_printers,
    768     type_pprinters=self.type_printers,
    769     deferred_pprinters=self.deferred_printers)
--> 770 printer.pretty(obj)
    771 printer.flush()
    772 return stream.getvalue()

File ~/code/ibis/.venv/lib/python3.12/site-packages/IPython/lib/pretty.py:419, in RepresentationPrinter.pretty(self, obj)
    408                         return meth(obj, self, cycle)
    409                 if (
    410                     cls is not object
    411                     # check if cls defines __repr__
   (...)
    417                     and callable(_safe_getattr(cls, "__repr__", None))
    418                 ):
--> 419                     return _repr_pprint(obj, self, cycle)
    421     return _default_pprint(obj, self, cycle)
    422 finally:

File ~/code/ibis/.venv/lib/python3.12/site-packages/IPython/lib/pretty.py:794, in _repr_pprint(obj, p, cycle)
    792 """A pprint that just redirects to the normal repr function."""
    793 # Find newlines and replace them with p.break_()
--> 794 output = repr(obj)
    795 lines = output.splitlines()
    796 with p.group():

File ~/code/ibis/ibis/expr/types/core.py:83, in Expr.__repr__(self)
     81 def __repr__(self) -> str:
     82     if ibis.options.interactive:
---> 83         return _capture_rich_renderable(self)
     84     else:
     85         return self._noninteractive_repr()

File ~/code/ibis/ibis/expr/types/core.py:63, in _capture_rich_renderable(renderable)
     61 console = Console(force_terminal=False)
     62 with console.capture() as capture:
---> 63     console.print(renderable)
     64 return capture.get().rstrip()

File ~/code/ibis/.venv/lib/python3.12/site-packages/rich/console.py:1705, in Console.print(self, sep, end, style, justify, overflow, no_wrap, emoji, markup, highlight, width, height, crop, soft_wrap, new_line_start, *objects)
   1703 if style is None:
   1704     for renderable in renderables:
-> 1705         extend(render(renderable, render_options))
   1706 else:
   1707     for renderable in renderables:

File ~/code/ibis/.venv/lib/python3.12/site-packages/rich/console.py:1306, in Console.render(self, renderable, options)
   1304 renderable = rich_cast(renderable)
   1305 if hasattr(renderable, "__rich_console__") and not isclass(renderable):
-> 1306     render_iterable = renderable.__rich_console__(self, _options)
   1307 elif isinstance(renderable, str):
   1308     text_renderable = self.render_str(
   1309         renderable, highlight=_options.highlight, markup=_options.markup
   1310     )

File ~/code/ibis/ibis/expr/types/core.py:106, in Expr.__rich_console__(self, console, options)
    103 if opts.interactive:
    104     from ibis.expr.types.pretty import to_rich
--> 106     rich_object = to_rich(self, console_width=console_width)
    107 else:
    108     rich_object = Text(self._noninteractive_repr())

File ~/code/ibis/ibis/expr/types/pretty.py:279, in to_rich(expr, max_rows, max_columns, max_length, max_string, max_depth, console_width)
    275     return _to_rich_scalar(
    276         expr, max_length=max_length, max_string=max_string, max_depth=max_depth
    277     )
    278 else:
--> 279     return _to_rich_table(
    280         expr,
    281         max_rows=max_rows,
    282         max_columns=max_columns,
    283         max_length=max_length,
    284         max_string=max_string,
    285         max_depth=max_depth,
    286         console_width=console_width,
    287     )

File ~/code/ibis/ibis/expr/types/pretty.py:358, in _to_rich_table(tablish, max_rows, max_columns, max_length, max_string, max_depth, console_width)
    355     if orig_ncols > len(computed_cols):
    356         table = table.select(*computed_cols)
--> 358 result = table.limit(max_rows + 1).to_pyarrow()
    359 # Now format the columns in order, stopping if the console width would
    360 # be exceeded.
    361 col_info = []

File ~/code/ibis/ibis/expr/types/core.py:579, in Expr.to_pyarrow(self, params, limit, **kwargs)
    551 @experimental
    552 def to_pyarrow(
    553     self,
   (...)
    557     **kwargs: Any,
    558 ) -> pa.Table:
    559     """Execute expression and return results in as a pyarrow table.
    560
    561     This method is eager and will execute the associated expression
   (...)
    577         A pyarrow table holding the results of the executed expression.
    578     """
--> 579     return self._find_backend(use_default=True).to_pyarrow(
    580         self, params=params, limit=limit, **kwargs
    581     )

File ~/code/ibis/ibis/backends/duckdb/__init__.py:1314, in Backend.to_pyarrow(self, expr, params, limit, **kwargs)
   1303 def to_pyarrow(
   1304     self,
   1305     expr: ir.Expr,
   (...)
   1310     **kwargs: Any,
   1311 ) -> pa.Table:
   1312     table = self._to_duckdb_relation(
   1313         expr, params=params, limit=limit, **kwargs
-> 1314     ).arrow()
   1315     return expr.__pyarrow_result__(table, data_mapper=DuckDBPyArrowData)

InvalidInputException: Invalid Input Error: arrow_scan: get_next failed(): IOError: Repetition level histogram size mismatch

[ins] In [8]: exit
(ibis) cody@dkdcascend ibis % uv pip list | grep pyarrow
pyarrow                   19.0.0
pyarrow-hotfix            0.6

What version of ibis are you using?

main/10.0.0

What backend(s) are you using, if any?

duckdb

Relevant log output

Code of Conduct

  • I agree to follow this project's Code of Conduct
@cpcloud
Copy link
Member

cpcloud commented Feb 12, 2025

I think this is a bug in DuckDB. Can you try running your code without Ibis involved?

@lostmygithubaccount
Copy link
Member Author

it seems fine without Ibis:

[ins] In [1]: import ibis

[ins] In [2]: ibis.options.interactive = True

[ins] In [3]: t = ibis.examples.penguins.fetch()
t.to
[ins] In [4]: t.to_delta("penguins.delta", mode="overwrite")

[ins] In [5]: import duckdb

[ins] In [6]: con = duckdb.connect()

[ins] In [7]: r = con.sql("select * from delta_scan('penguins.delta');")

[ins] In [8]: r
Out[8]:
┌───────────┬───────────┬────────────────┬───────────────┬───────────────────┬─────────────┬─────────┬───────┐
│  species  │  island   │ bill_length_mm │ bill_depth_mm │ flipper_length_mm │ body_mass_g │   sex   │ year  │
│  varchar  │  varchar  │     double     │    double     │       int64       │    int64    │ varchar │ int64 │
├───────────┼───────────┼────────────────┼───────────────┼───────────────────┼─────────────┼─────────┼───────┤
│ Adelie    │ Torgersen │           39.1 │          18.7 │               181 │        3750 │ male    │  2007 │
│ Adelie    │ Torgersen │           39.5 │          17.4 │               186 │        3800 │ female  │  2007 │
│ Adelie    │ Torgersen │           40.3 │          18.0 │               195 │        3250 │ female  │  2007 │
│ Adelie    │ Torgersen │           NULL │          NULL │              NULL │        NULL │ NULL    │  2007 │
│ Adelie    │ Torgersen │           36.7 │          19.3 │               193 │        3450 │ female  │  2007 │
│ Adelie    │ Torgersen │           39.3 │          20.6 │               190 │        3650 │ male    │  2007 │
│ Adelie    │ Torgersen │           38.9 │          17.8 │               181 │        3625 │ female  │  2007 │
│ Adelie    │ Torgersen │           39.2 │          19.6 │               195 │        4675 │ male    │  2007 │
│ Adelie    │ Torgersen │           34.1 │          18.1 │               193 │        3475 │ NULL    │  2007 │
│ Adelie    │ Torgersen │           42.0 │          20.2 │               190 │        4250 │ NULL    │  2007 │
│   ·       │   ·       │             ·  │            ·  │                ·  │          ·  │  ·      │    ·  │
│   ·       │   ·       │             ·  │            ·  │                ·  │          ·  │  ·      │    ·  │
│   ·       │   ·       │             ·  │            ·  │                ·  │          ·  │  ·      │    ·  │
│ Chinstrap │ Dream     │           50.2 │          18.8 │               202 │        3800 │ male    │  2009 │
│ Chinstrap │ Dream     │           45.6 │          19.4 │               194 │        3525 │ female  │  2009 │
│ Chinstrap │ Dream     │           51.9 │          19.5 │               206 │        3950 │ male    │  2009 │
│ Chinstrap │ Dream     │           46.8 │          16.5 │               189 │        3650 │ female  │  2009 │
│ Chinstrap │ Dream     │           45.7 │          17.0 │               195 │        3650 │ female  │  2009 │
│ Chinstrap │ Dream     │           55.8 │          19.8 │               207 │        4000 │ male    │  2009 │
│ Chinstrap │ Dream     │           43.5 │          18.1 │               202 │        3400 │ female  │  2009 │
│ Chinstrap │ Dream     │           49.6 │          18.2 │               193 │        3775 │ male    │  2009 │
│ Chinstrap │ Dream     │           50.8 │          19.0 │               210 │        4100 │ male    │  2009 │
│ Chinstrap │ Dream     │           50.2 │          18.7 │               198 │        3775 │ female  │  2009 │
├───────────┴───────────┴────────────────┴───────────────┴───────────────────┴─────────────┴─────────┴───────┤
│ 344 rows (20 shown)                                                                              8 columns │
└────────────────────────────────────────────────────────────────────────────────────────────────────────────┘

[ins] In [9]: r.arrow()
Out[9]:
pyarrow.Table
species: string
island: string
bill_length_mm: double
bill_depth_mm: double
flipper_length_mm: int64
body_mass_g: int64
sex: string
year: int64
----
species: [["Adelie","Adelie","Adelie","Adelie","Adelie",...,"Chinstrap","Chinstrap","Chinstrap","Chinstrap","Chinstrap"]]
island: [["Torgersen","Torgersen","Torgersen","Torgersen","Torgersen",...,"Dream","Dream","Dream","Dream","Dream"]]
bill_length_mm: [[39.1,39.5,40.3,null,36.7,...,55.8,43.5,49.6,50.8,50.2]]
bill_depth_mm: [[18.7,17.4,18,null,19.3,...,19.8,18.1,18.2,19,18.7]]
flipper_length_mm: [[181,186,195,null,193,...,207,202,193,210,198]]
body_mass_g: [[3750,3800,3250,null,3450,...,4000,3400,3775,4100,3775]]
sex: [["male","female","female",null,"female",...,"male","female","male","male","female"]]
year: [[2007,2007,2007,2007,2007,...,2009,2009,2009,2009,2009]]

(and similarly via the duckdb CLI)

I'm noticing the same error on the docs here too (in the Delta Lake tab under file formats): https://ibis-project.org/how-to/input-output/basics#file-formats

@cpcloud
Copy link
Member

cpcloud commented Feb 12, 2025

We're not using delta_scan though, we're using the pyarrow dataset reader.

You need to this, which does reproduce it without Ibis:

In [1]: from deltalake import DeltaTable

In [2]: dt = DeltaTable("/tmp/penguins.delta")

In [3]: ds = dt.to_pyarrow_dataset()

In [4]: import duckdb

In [5]: con = duckdb.connect()

In [6]: con.register("ds", ds)
Out[6]: <duckdb.duckdb.DuckDBPyConnection at 0x7f65ae0621b0>

In [7]: res = con.sql("select * from ds").arrow()
╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ in <module>:1                                                                                    │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
InvalidInputException: Invalid Input Error: arrow_scan: get_next failed(): IOError: Repetition level histogram size mismatch

@cpcloud
Copy link
Member

cpcloud commented Feb 12, 2025

We can try switching to delta_scan though. IIRC we're using deltalake implementation because the DuckDB extension had some showstopping bugs.

@cpcloud
Copy link
Member

cpcloud commented Feb 12, 2025

Oh, this looks entirely like a deltalake or pyarrow bug:

In [1]: from deltalake import DeltaTable

In [2]: dt = DeltaTable("/tmp/penguins.delta")

In [3]: ds = dt.to_pyarrow_dataset()

In [4]: ds.to_table()
╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ in <module>:1                                                                                    │
│                                                                                                  │
│ in pyarrow._dataset.Dataset.to_table:574                                                         │
│                                                                                                  │
│ in pyarrow._dataset.Scanner.to_table:3865                                                        │
│                                                                                                  │
│ in pyarrow.lib.pyarrow_internal_check_status:155                                                 │
│                                                                                                  │
│ in pyarrow.lib.check_status:92                                                                   │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
OSError: Repetition level histogram size mismatch

@cpcloud
Copy link
Member

cpcloud commented Feb 12, 2025

Okay, I think it's entirely PyArrow and has already been fixed upstream:

apache/arrow#45283

@cpcloud
Copy link
Member

cpcloud commented Feb 12, 2025

We can leave this open until 19.0.1 is out and we're testing it in Ibis's CI.

Thanks for the report!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Incorrect behavior inside of ibis
Projects
Status: backlog
Development

Successfully merging a pull request may close this issue.

2 participants