-
Notifications
You must be signed in to change notification settings - Fork 2
Pipweights #1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: linearbinning
Are you sure you want to change the base?
Pipweights #1
Conversation
This looks great, PIP weights are obviously a common use case so it's definitely worth considering supporting them natively. I've only skimmed this, but I'm not entirely sure I follow the angular weights. I would naively expect this could just be implemented as a custom weighting function, without passing down new arrays from the Python wrappers. But is the idea that one needs to specify a function of costheta, and it needs to be set at runtime? On the 5% increase in runtime, is that with PIP weights or angular weights (or both)? Does it scale with number of PIP weights? |
Yes, one should be able to provide an arbitrary function of costheta, hence through tabulated values. In practice, we want to input DD_parent(costheta)/DD_pip(costheta), see e.g. eq. 9 of https://arxiv.org/pdf/2007.09005.pdf
The 5% increase in runtime, was w.r.t. the previous implementation using weight_type = "pair_product". This represents the overhead due to the implementation of the more flexible weight structure (increased number of "if" statements/larger memory footprint). The runtime scales with the number of PIP weights, indeed. Typically, for a box size of 1000 Mpc/h, linear binning up to 200 Mpc/h, 1e5 objects:
That would still make the fastest pair counter with PIP + angular weights on the market I think ;) |
4b91ef2
to
70797ec
Compare
Hi @lgarrison, @manodeep, |
This is an interesting question. Most of the implementations online (e.g. https://github.com/WojciechMula/sse-popcount) are focused on the case where you have a long vector of ints and want to popcnt the whole thing, while this data is "transposed" such that you want the individual popcnt result of each 64-bit int in the vector. I've only skimmed the benchmarks from that repo, but they suggest that it might be sufficient to just always use SSE, e.g. this implementation often seems faster than AVX for 64 bytes: https://github.com/WojciechMula/sse-popcount/blob/6feb3dba32c526b17de01e931c116900e3a23104/popcnt-cpu.cpp#L59 But also, these implementations are summing the results, which we don't want. So that may affect which method is fastest for us. For AVX-512, So I would actually start by trying a scalar popcnt even in the vector kernels, like |
Thank you very much for the feedback! I will start with the simple __builtin_popcount() from GCC, have everything else properly implemented, and we can come back to this later as you suggest if performance is lacking. |
RIght - forgot to comment here. @adematti Since I am not involved in the DESI stuff - don't really know what you are attempting. Could you please outline some pseudo-code for the actual operation you are trying to code up? In terms of merging in with the main repo, it would depend on the complexity of the involved code. I have not gotten any request for a custom weighting scheme, but that's not to say that such a feature could be broadly useful. However, you might get better mileage out of keeping this as a separate fork - that will potentially allow you to create a pair-counter optimised for your specific use-case. Hardware I am curious - how important is performance for this PIP pair-counter? And what are the typical particle numbers in the cells? |
Thanks for the feedback @manodeep! For the PIP part I am simply trying to do: |
…odeep#270) * Add additional check to tell if it's safe to redirect stdout/err. Closes manodeep#269. * Changelog * Update comment
Attempt to implement PIP / angular weights into Corrfunc, for DESI usage.
We first focus on theory DD fallback kernel.
PIP weights require passing integer weights. We will require that these weights occupy the same number of bytes as coordinate arrays (i.e. we request int32 arrays if coordinate arrays are float32, similarly for float64), which may help with SIMD operations.
Then we can keep most of the C code as is.
We add the flag num_integer_weights to weight_struct to keep track of the number of integer bitwise weights. Other weights are considered individual weights (floating type). A list of weights should then be provided to the Python API.
This all goes fine with fallback kernel, but I could not find an obvious correspondence in terms of SIMD instructions, e.g. this is only provided by AVX512... Any idea @lgarrison, @manodeep?
In addition we need to apply angular weights, i.e. weights depending on the cosine angle between the two galaxies. I also drafted an implementation for those, tab_weight_struct. This requires a bit of plumbing, however. Good news is, compared to the linearbinning branch, there is no real increase of computing time for weight_type = None, pair_product (+ 5%).
@lgarrison, @manodeep, any opinion on all of this? These changes are somewhat specific, so I'd understand you would prefer to keep those separate from the main repo/master.