Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

o Use numpy unique instead of polars #1150

Closed

Conversation

rajeeja
Copy link
Contributor

@rajeeja rajeeja commented Feb 4, 2025

No description provided.

…ion needed. Also eliminate the need for left join
@rajeeja rajeeja requested a review from philipc2 February 4, 2025 17:21
@philipc2
Copy link
Member

philipc2 commented Feb 4, 2025

@rajeeja

Polars is significantly faster than NumPy. I'm curious to why we would want to switch back?

@rajeeja
Copy link
Contributor Author

rajeeja commented Feb 4, 2025

@rajeeja

Polars is significantly faster than NumPy. I'm curious to why we would want to switch back?

Agreed, polars operates in a multi-threaded manner, this avoids convertion overhead, NumPy arrays → Polars → NumPy multiple times, also there is no convertion between DataFrame representations. Depends on the optimizing for the problem size we want to tackle.

@rajeeja
Copy link
Contributor Author

rajeeja commented Feb 4, 2025

this looks cleaner also, but we can go with the polars version ignoring some of the overhead for small/medium problems based on our target n_pix size.

@rajeeja
Copy link
Contributor Author

rajeeja commented Feb 4, 2025

Running a benchmark (Pixel size ) with the two versions would be good.

@philipc2
Copy link
Member

philipc2 commented Feb 4, 2025

From an email with @erogluorhan and @florianziemen last Month.

I was able to significantly optimize the Numpy unique calls with Pandas; here are some figures:

For a resolution level of 10 (i.e. n_side==1024, n_pixels== ~12.6M):

Numpy unique took 1 min 34 secs in my machine, while the pandas equivalent (calculations of uniques nodes and inverse indices comboned) took only ~16.4 seconds.

With Polars, its only a couple of seconds compared to Pandas.

@rajeeja
Copy link
Contributor Author

rajeeja commented Feb 4, 2025

From an email with @erogluorhan and @florianziemen last Month.

I was able to significantly optimize the Numpy unique calls with Pandas; here are some figures:
For a resolution level of 10 (i.e. n_side==1024, n_pixels== ~12.6M):
Numpy unique took 1 min 34 secs in my machine, while the pandas equivalent (calculations of uniques nodes and inverse indices comboned) took only ~16.4 seconds.

With Polars, its only a couple of seconds compared to Pandas.

wow, just tried it and found as you mention, for 10M pixel corner coords.

NumPy unique() time: 47.6 secs

Polars unique() + join time: 5.26 secs

@rajeeja rajeeja closed this Feb 4, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants