Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Astropy CSV table reader using pyarrow #17706

Draft
wants to merge 14 commits into
base: main
Choose a base branch
from

Conversation

taldcroft
Copy link
Member

@taldcroft taldcroft commented Feb 1, 2025

Description

This pull request is a draft implementation of a fast CSV reader for astropy that uses pyarrow.csv.read_csv. This was discussed in #16869.

Before going much further, I am hoping to get feedback on the general implementation and API. The goal was to make an interface that will be familiar to astropy io.ascii users, while exposing some additional features brought by pyarrow read_csv. Currently the interface is not complete, but the idea is to keep the interface clean and consistent with astropy.

A quick demonstration notebook that you can use to play with this is at: https://gist.github.com/taldcroft/ac15bc516a7bf7c76f9eec644c787298

Fixes #16869

Related

pandas-dev/pandas#54466

  • By checking this box, the PR author has requested that maintainers do NOT use the "Squash and Merge" button. Maintainers should respect this when possible; however, the final decision is at the discretion of the maintainer that merges the PR.

Copy link
Contributor

github-actions bot commented Feb 1, 2025

Thank you for your contribution to Astropy! 🌌 This checklist is meant to remind the package maintainers who will review this pull request of some common things to look for.

  • Do the proposed changes actually accomplish desired goals?
  • Do the proposed changes follow the Astropy coding guidelines?
  • Are tests added/updated as required? If so, do they follow the Astropy testing guidelines?
  • Are docs added/updated as required? If so, do they follow the Astropy documentation guidelines?
  • Is rebase and/or squash necessary? If so, please provide the author with appropriate instructions. Also see instructions for rebase and squash.
  • Did the CI pass? If no, are the failures related? If you need to run daily and weekly cron jobs as part of the PR, please apply the "Extra CI" label. Codestyle issues can be fixed by the bot.
  • Is a change log needed? If yes, did the change log check pass? If no, add the "no-changelog-entry-needed" label. If this is a manual backport, use the "skip-changelog-checks" label unless special changelog handling is necessary.
  • Is this a big PR that makes a "What's new?" entry worthwhile and if so, is (1) a "what's new" entry included in this PR and (2) the "whatsnew-needed" label applied?
  • At the time of adding the milestone, if the milestone set requires a backport to release branch(es), apply the appropriate "backport-X.Y.x" label(s) before merge.

Copy link
Contributor

github-actions bot commented Feb 1, 2025

👋 Thank you for your draft pull request! Do you know that you can use [ci skip] or [skip ci] in your commit messages to skip running continuous integration tests until you are ready?

@pllim pllim added this to the v7.1.0 milestone Feb 3, 2025
@taldcroft taldcroft requested review from hamogu, mhvk and dhomeier February 6, 2025 16:54
Copy link
Member

@pllim pllim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! I want to benchmark this but does that mean we need to install pyarrow in https://github.com/astropy/astropy/blob/main/.github/workflows/ci_benchmark.yml ?

astropy/io/misc/pyarrow/csv.py Outdated Show resolved Hide resolved
astropy/io/misc/pyarrow/csv.py Outdated Show resolved Hide resolved
@taldcroft
Copy link
Member Author

Thanks! I want to benchmark this but does that mean we need to install pyarrow in https://github.com/astropy/astropy/blob/main/.github/workflows/ci_benchmark.yml ?

There are one-time benchmarks here: #16869 (comment). These demonstrate that pyarrow read_csv() appears to be a factor of 10 faster than any other readers.

Copy link
Contributor

@mhvk mhvk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! I like the general idea; my only more major comment is that I'm not sure in this initial stage one should add the commented-line skipper.

For follow-up, I guess, would be to make this the default "first try" if pyarrow is available, and then deprecate the fast reader?

It does seem Table.{from,to}_pyarrow methods would be reasonable, but better as follow-up.

astropy/io/misc/pyarrow/csv.py Outdated Show resolved Hide resolved
astropy/io/misc/pyarrow/csv.py Outdated Show resolved Hide resolved
astropy/io/misc/pyarrow/csv.py Show resolved Hide resolved
astropy/io/misc/pyarrow/csv.py Outdated Show resolved Hide resolved
astropy/io/misc/pyarrow/csv.py Outdated Show resolved Hide resolved
astropy/io/misc/pyarrow/csv.py Outdated Show resolved Hide resolved
astropy/io/misc/pyarrow/csv.py Outdated Show resolved Hide resolved
astropy/io/misc/pyarrow/csv.py Outdated Show resolved Hide resolved
astropy/io/misc/pyarrow/csv.py Show resolved Hide resolved
@hamogu
Copy link
Member

hamogu commented Feb 7, 2025

Since @dhomeier has shown pyarrow to be significantly faster, it would be good to have it for the biggest tables. And this is a relatively thin wrapper just to match the API we are used to, so why not?
I do wonder (similar to @mhvk ) how far it makes sense to go to have capabilities that are not native to pyarrow (e.g. comment characters). Is it worth the pure-Python preprocessing at all? Would that dilute the advertised point "this is super fast and super-memory efficient, so use it for tables in the GB range"?

For smaller tables we have other established solutions which are more flexible (not the least our own pure-python readers and our own C reader). How many GB-sized tables are there in the wild with commented lines that are not in the header? I'm just worried about user confusion along the lines of "It's reading this table just fine and that table that's almost identical (but with comment lines) crashes with a Python out-of-memory error". Of course, that only applies to the biggest tables of them all. For csv files in the 0.5-1GB rage, this is probably still be faster AND would fit into memory (and maybe not be too slow) on modern machines. So it's a trade-off.

Copy link
Contributor

@neutrinoceros neutrinoceros left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This may be a bit too technical for a first round of review but I wanted to get this kind of feedback in early too so it doesn't grow into too much of a pain later:
Here are a couple suggestions and comments mostly about type annotations and internal consistency.

astropy/io/misc/pyarrow/csv.py Outdated Show resolved Hide resolved
astropy/io/misc/pyarrow/csv.py Show resolved Hide resolved
astropy/io/misc/pyarrow/csv.py Outdated Show resolved Hide resolved
astropy/io/misc/pyarrow/csv.py Outdated Show resolved Hide resolved
astropy/io/misc/pyarrow/csv.py Outdated Show resolved Hide resolved
astropy/io/misc/pyarrow/csv.py Outdated Show resolved Hide resolved
astropy/io/misc/pyarrow/csv.py Outdated Show resolved Hide resolved
astropy/io/misc/pyarrow/csv.py Show resolved Hide resolved
astropy/io/misc/pyarrow/csv.py Outdated Show resolved Hide resolved
astropy/io/misc/pyarrow/csv.py Outdated Show resolved Hide resolved
Copy link
Member Author

@taldcroft taldcroft left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the great comments! I think I've addressed them all, or at least responded. Sounds like I have agreement to keep going ahead on this and start working on tests, docs etc?

astropy/io/misc/pyarrow/csv.py Outdated Show resolved Hide resolved
astropy/io/misc/pyarrow/csv.py Outdated Show resolved Hide resolved
astropy/io/misc/pyarrow/csv.py Outdated Show resolved Hide resolved
astropy/io/misc/pyarrow/csv.py Outdated Show resolved Hide resolved
astropy/io/misc/pyarrow/csv.py Show resolved Hide resolved
astropy/io/misc/pyarrow/csv.py Outdated Show resolved Hide resolved
astropy/io/misc/pyarrow/csv.py Show resolved Hide resolved
astropy/io/misc/pyarrow/csv.py Outdated Show resolved Hide resolved
astropy/io/misc/pyarrow/csv.py Outdated Show resolved Hide resolved
astropy/io/misc/pyarrow/csv.py Outdated Show resolved Hide resolved


def get_convert_options(
include_names: list | None,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
include_names: list | None,
include_names: list[str] | None,

def get_convert_options(
include_names: list | None,
dtypes: dict[str, "npt.DTypeLike"] | None,
null_values: list | None,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
null_values: list | None,
null_values: list[str] | None,

def get_read_options(
header_start: int | None,
data_start: int | None,
names: list | None,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
names: list | None,
names: list[str] | None,

include_names: list[str] | None = None,
dtypes: dict[str, "npt.DTypeLike"] | None = None,
comment: str | None = None,
null_values: list | None = None,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
null_values: list | None = None,
null_values: list[str] | None = None,

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Consider using pyarrow under the hood for fast ASCII reading
5 participants