Astropy CSV table reader using pyarrow #17706

taldcroft · 2025-02-01T12:05:54Z

Description

This pull request is a draft implementation of a fast CSV reader for astropy that uses pyarrow.csv.read_csv. This was discussed in #16869.

Before going much further, I am hoping to get feedback on the general implementation and API. The goal was to make an interface that will be familiar to astropy io.ascii users, while exposing some additional features brought by pyarrow read_csv. Currently the interface is not complete, but the idea is to keep the interface clean and consistent with astropy.

A quick demonstration notebook that you can use to play with this is at: https://gist.github.com/taldcroft/ac15bc516a7bf7c76f9eec644c787298

Fixes #16869

Do the proposed changes actually accomplish desired goals?
Do the proposed changes follow the Astropy coding guidelines?
Are tests added/updated as required? If so, do they follow the Astropy testing guidelines?
Are docs added/updated as required? If so, do they follow the Astropy documentation guidelines?
Is rebase and/or squash necessary? If so, please provide the author with appropriate instructions. Also see instructions for rebase and squash.
Did the CI pass? If no, are the failures related? If you need to run daily and weekly cron jobs as part of the PR, please apply the "Extra CI" label. Codestyle issues can be fixed by the bot.
Is a change log needed? If yes, did the change log check pass? If no, add the "no-changelog-entry-needed" label. If this is a manual backport, use the "skip-changelog-checks" label unless special changelog handling is necessary.
Is this a big PR that makes a "What's new?" entry worthwhile and if so, is (1) a "what's new" entry included in this PR and (2) the "whatsnew-needed" label applied?
At the time of adding the milestone, if the milestone set requires a backport to release branch(es), apply the appropriate "backport-X.Y.x" label(s) before merge.

github-actions · 2025-02-01T12:06:24Z

👋 Thank you for your draft pull request! Do you know that you can use [ci skip] or [skip ci] in your commit messages to skip running continuous integration tests until you are ready?

pllim

Thanks! I want to benchmark this but does that mean we need to install pyarrow in https://github.com/astropy/astropy/blob/main/.github/workflows/ci_benchmark.yml ?

astropy/io/misc/pyarrow/csv.py

taldcroft · 2025-02-06T18:28:21Z

Thanks! I want to benchmark this but does that mean we need to install pyarrow in https://github.com/astropy/astropy/blob/main/.github/workflows/ci_benchmark.yml ?

There are one-time benchmarks here: #16869 (comment). These demonstrate that pyarrow read_csv() appears to be a factor of 10 faster than any other readers.

mhvk

Nice! I like the general idea; my only more major comment is that I'm not sure in this initial stage one should add the commented-line skipper.

For follow-up, I guess, would be to make this the default "first try" if pyarrow is available, and then deprecate the fast reader?

It does seem Table.{from,to}_pyarrow methods would be reasonable, but better as follow-up.

astropy/io/misc/pyarrow/csv.py

hamogu · 2025-02-07T01:42:52Z

Since @dhomeier has shown pyarrow to be significantly faster, it would be good to have it for the biggest tables. And this is a relatively thin wrapper just to match the API we are used to, so why not?
I do wonder (similar to @mhvk ) how far it makes sense to go to have capabilities that are not native to pyarrow (e.g. comment characters). Is it worth the pure-Python preprocessing at all? Would that dilute the advertised point "this is super fast and super-memory efficient, so use it for tables in the GB range"?

For smaller tables we have other established solutions which are more flexible (not the least our own pure-python readers and our own C reader). How many GB-sized tables are there in the wild with commented lines that are not in the header? I'm just worried about user confusion along the lines of "It's reading this table just fine and that table that's almost identical (but with comment lines) crashes with a Python out-of-memory error". Of course, that only applies to the biggest tables of them all. For csv files in the 0.5-1GB rage, this is probably still be faster AND would fit into memory (and maybe not be too slow) on modern machines. So it's a trade-off.

neutrinoceros

This may be a bit too technical for a first round of review but I wanted to get this kind of feedback in early too so it doesn't grow into too much of a pain later:
Here are a couple suggestions and comments mostly about type annotations and internal consistency.

astropy/io/misc/pyarrow/csv.py

taldcroft

Thanks for the great comments! I think I've addressed them all, or at least responded. Sounds like I have agreement to keep going ahead on this and start working on tests, docs etc?

astropy/io/misc/pyarrow/csv.py

neutrinoceros · 2025-02-12T12:29:12Z

astropy/io/misc/pyarrow/csv.py

+
+
+def get_convert_options(
+    include_names: list | None,


Suggested change

include_names: list | None,

include_names: list[str] | None,

neutrinoceros · 2025-02-12T12:29:25Z

astropy/io/misc/pyarrow/csv.py

+def get_convert_options(
+    include_names: list | None,
+    dtypes: dict[str, "npt.DTypeLike"] | None,
+    null_values: list | None,


Suggested change

null_values: list | None,

null_values: list[str] | None,

neutrinoceros · 2025-02-12T12:29:43Z

astropy/io/misc/pyarrow/csv.py

+def get_read_options(
+    header_start: int | None,
+    data_start: int | None,
+    names: list | None,


Suggested change

names: list | None,

names: list[str] | None,

neutrinoceros · 2025-02-12T12:30:09Z

astropy/io/misc/pyarrow/csv.py

+    include_names: list[str] | None = None,
+    dtypes: dict[str, "npt.DTypeLike"] | None = None,
+    comment: str | None = None,
+    null_values: list | None = None,


Suggested change

null_values: list | None = None,

null_values: list[str] | None = None,

taldcroft added 12 commits January 11, 2025 16:11

Initial prototype for read_csv with pyarrow

738d86a

Add support for comment and other features

417f368

Add docstrings

3006bb2

Localize pyarrow imports

41e7830

Register reader

8a08090

Move package to io.misc

43633d7

Reduce memory usage of comment stripper

2183cde

Move core.py functions into csv.py

0234b35

Add to docstring

94b35b7

Reduce options a bit

4ea6165

Register pyarrow.csv and clean up csv

fdf4f1e

Add null_values

1514e79

github-actions bot added the io.misc label Feb 1, 2025

github-actions bot added the unified-io label Feb 1, 2025

pllim added this to the v7.1.0 milestone Feb 3, 2025

taldcroft requested review from hamogu, mhvk and dhomeier February 6, 2025 16:54

pllim reviewed Feb 6, 2025

View reviewed changes

astropy/io/misc/pyarrow/csv.py Outdated Show resolved Hide resolved

astropy/io/misc/pyarrow/csv.py Outdated Show resolved Hide resolved

mhvk reviewed Feb 6, 2025

View reviewed changes

neutrinoceros reviewed Feb 7, 2025

View reviewed changes

taldcroft added 2 commits February 7, 2025 15:04

Address initial review comments [skip ci]

ccbe874

Typing fixes [skip ci]

cb44ed7

taldcroft commented Feb 7, 2025

View reviewed changes

neutrinoceros reviewed Feb 12, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Astropy CSV table reader using pyarrow #17706

Astropy CSV table reader using pyarrow #17706

taldcroft commented Feb 1, 2025 •

edited

Loading

github-actions bot commented Feb 1, 2025

github-actions bot commented Feb 1, 2025

pllim left a comment

taldcroft commented Feb 6, 2025

mhvk left a comment

hamogu commented Feb 7, 2025

neutrinoceros left a comment

taldcroft left a comment

neutrinoceros Feb 12, 2025

neutrinoceros Feb 12, 2025

neutrinoceros Feb 12, 2025

neutrinoceros Feb 12, 2025

	null_values: list \| None = None,
	null_values: list[str] \| None = None,

Astropy CSV table reader using pyarrow #17706

Are you sure you want to change the base?

Astropy CSV table reader using pyarrow #17706

Conversation

taldcroft commented Feb 1, 2025 • edited Loading

Description

Related

github-actions bot commented Feb 1, 2025

github-actions bot commented Feb 1, 2025

pllim left a comment

Choose a reason for hiding this comment

taldcroft commented Feb 6, 2025

mhvk left a comment

Choose a reason for hiding this comment

hamogu commented Feb 7, 2025

neutrinoceros left a comment

Choose a reason for hiding this comment

taldcroft left a comment

Choose a reason for hiding this comment

neutrinoceros Feb 12, 2025

Choose a reason for hiding this comment

neutrinoceros Feb 12, 2025

Choose a reason for hiding this comment

neutrinoceros Feb 12, 2025

Choose a reason for hiding this comment

neutrinoceros Feb 12, 2025

Choose a reason for hiding this comment

taldcroft commented Feb 1, 2025 •

edited

Loading