Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add support for bm25 and tfidf #2567

Open
wants to merge 10 commits into
base: branch-25.04
Choose a base branch
from

Conversation

jperez999
Copy link

@jperez999 jperez999 commented Feb 3, 2025

This PR will add support for tfidf and BM25 preprocessing of sparse matrix. This PR supports encoding values in raft device COO or CSR sparse matrices. It breaks up the statistical recording (fit) phase and the transformation phase. This allows for batch fitting data and then transforming a target input. This class also allows for exporting/loading, so you can fit in one place and transform in a separate environment. This builds on #2353

@jperez999 jperez999 requested review from a team as code owners February 3, 2025 21:41
@jperez999 jperez999 self-assigned this Feb 3, 2025
@jperez999 jperez999 requested a review from cjnolet February 3, 2025 21:42
@jperez999 jperez999 added enhancement New feature or request feature request New feature or request non-breaking Non-breaking change labels Feb 3, 2025
saveFile << fullIdLen << " ";
// serialize_mdspan<IndexType>(handle, oss, featIdCount_md.view());
for (int i = 0; i < vocabSize; i++) {
saveFile << featIdCount[i] << " ";
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could not use the serialize_mdspan, it does not work well when reading from a file that has other information on it.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would probably be better to use a map here to save only indexes/values that are not zero. This would save time and space. Thoughts?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could not use the serialize_mdspan, it does not work well when reading from a file that has other information on it.

How did serialize_mdspan not work well? We have code like https://github.com/rapidsai/cuvs/blob/49298b22956fdf4d3966825ae9aa41e1aa94975b/cpp/src/neighbors/detail/dataset_serialize.hpp#L79-L83 that serializes both scalar values and mdspan values.

One of the advantages of serialize_mdspan is that its compatible with numpy, also we're using for most other serialization and would be nice to be consistent

saveFile << fullIdLen << " ";
// serialize_mdspan<IndexType>(handle, oss, featIdCount_md.view());
for (int i = 0; i < vocabSize; i++) {
saveFile << featIdCount[i] << " ";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could not use the serialize_mdspan, it does not work well when reading from a file that has other information on it.

How did serialize_mdspan not work well? We have code like https://github.com/rapidsai/cuvs/blob/49298b22956fdf4d3966825ae9aa41e1aa94975b/cpp/src/neighbors/detail/dataset_serialize.hpp#L79-L83 that serializes both scalar values and mdspan values.

One of the advantages of serialize_mdspan is that its compatible with numpy, also we're using for most other serialization and would be nice to be consistent

cudaMemset(counts, 0, nnz * sizeof(IndexType));
_fit_feats(d_cols.data_handle(), counts, nnz, featIdCount);
cudaFree(counts);
cudaDeviceSynchronize();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need to synchronize the entire device here ? If we need to synchronize - would just sync'ing the stream be sufficient?

raft::sparse::op::coo_sort(
nnz, nnz, nnz, d_cols.data_handle(), d_rows.data_handle(), d_vals.data_handle(), stream);
IndexType* counts;
cudaMallocManaged(&counts, nnz * sizeof(IndexType));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use RMM managed_memory_resource instead here?

* The array holding the feature(column) occurrence counts for all fitted inputs.
* @param[in] counts
* The array representing value changes in rows input.
* @param[in] out_values
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick:

Suggested change
* @param[in] out_values
* @param[out] out_values

template <typename ValueType = float, typename IndexType = int>
class SparseEncoder {
private:
int* featIdCount;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be IndexType instead of int?

Suggested change
int* featIdCount;
IndexType* featIdCount;


* */
template <typename ValueType, typename IndexType>
void SparseEncoder<ValueType, IndexType>::fit(raft::resources& handle,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick: handle should be const

Suggested change
void SparseEncoder<ValueType, IndexType>::fit(raft::resources& handle,
void SparseEncoder<ValueType, IndexType>::fit(raft::resources const& handle,

void SparseEncoder<ValueType, IndexType>::_fit_feats(IndexType* cols,
IndexType* counts,
IndexType nnz,
IndexType* results)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function should accept a raft handle - since its launching cuda kernels, it needs to use the cuda stream when creating these kernels

Comment on lines +232 to +234
raft::sparse::matrix::detail::_scan<<<blockSize, num_blocks>>>(cols, nnz, counts);
raft::sparse::matrix::detail::_fit_compute_occurs<<<blockSize, num_blocks>>>(
cols, nnz, counts, results, numFeats);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can probably simplify this -

This seems like this is doing something like np.bincount - counting up the document frequencies of each term so that we can compute the idf for tf-idf/bm25.

I think we can avoid having the _scan/_fit_compute_occurs kernels here entirely (and also remove the requirement for the coo_sort that is performed previously) by using cub. For example - other code inside of raft is using cub and HistogramEven to peform the bincount operation (like

/**
* @brief function to calculate the bincounts of number of samples in every label
* @tparam DataT: type of the data samples
* @tparam LabelT: type of the labels
* @param labels: the pointer to the array containing labels for every data sample (1 x nRows)
* @param binCountArray: pointer to the 1D array that contains the count of samples per cluster (1 x
* nLabels)
* @param nRows: number of data samples
* @param nUniqueLabels: number of Labels
* @param workspace: device buffer containing workspace memory
* @param stream: the cuda stream where to launch this kernel
*/
template <typename DataT, typename LabelT>
void countLabels(const LabelT* labels,
DataT* binCountArray,
int nRows,
int nUniqueLabels,
rmm::device_uvector<char>& workspace,
cudaStream_t stream)
{
int num_levels = nUniqueLabels + 1;
LabelT lower_level = 0;
LabelT upper_level = nUniqueLabels;
size_t temp_storage_bytes = 0;
rmm::device_uvector<int> countArray(nUniqueLabels, stream);
RAFT_CUDA_TRY(cub::DeviceHistogram::HistogramEven(nullptr,
temp_storage_bytes,
labels,
binCountArray,
num_levels,
lower_level,
upper_level,
nRows,
stream));
workspace.resize(temp_storage_bytes, stream);
RAFT_CUDA_TRY(cub::DeviceHistogram::HistogramEven(workspace.data(),
temp_storage_bytes,
labels,
binCountArray,
num_levels,
lower_level,
upper_level,
nRows,
stream));
}
etc)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3 - Ready for Review CMake cpp enhancement New feature or request feature request New feature or request non-breaking Non-breaking change
Projects
Status: In Progress
Development

Successfully merging this pull request may close these issues.

3 participants