add support for bm25 and tfidf #2567

jperez999 · 2025-02-03T21:41:05Z

This PR will add support for tfidf and BM25 preprocessing of sparse matrix. This PR supports encoding values in raft device COO or CSR sparse matrices. It breaks up the statistical recording (fit) phase and the transformation phase. This allows for batch fitting data and then transforming a target input. This class also allows for exporting/loading, so you can fit in one place and transform in a separate environment. This builds on #2353

jperez999 · 2025-02-03T21:45:48Z

cpp/include/raft/sparse/matrix/preprocessing.cuh

+    saveFile << fullIdLen << " ";
+    // serialize_mdspan<IndexType>(handle, oss, featIdCount_md.view());
+    for (int i = 0; i < vocabSize; i++) {
+      saveFile << featIdCount[i] << " ";


could not use the serialize_mdspan, it does not work well when reading from a file that has other information on it.

would probably be better to use a map here to save only indexes/values that are not zero. This would save time and space. Thoughts?

could not use the serialize_mdspan, it does not work well when reading from a file that has other information on it.

How did serialize_mdspan not work well? We have code like https://github.com/rapidsai/cuvs/blob/49298b22956fdf4d3966825ae9aa41e1aa94975b/cpp/src/neighbors/detail/dataset_serialize.hpp#L79-L83 that serializes both scalar values and mdspan values.

One of the advantages of serialize_mdspan is that its compatible with numpy, also we're using for most other serialization and would be nice to be consistent

benfred · 2025-02-13T19:34:05Z

cpp/include/raft/sparse/matrix/preprocessing.cuh

+    saveFile << fullIdLen << " ";
+    // serialize_mdspan<IndexType>(handle, oss, featIdCount_md.view());
+    for (int i = 0; i < vocabSize; i++) {
+      saveFile << featIdCount[i] << " ";


could not use the serialize_mdspan, it does not work well when reading from a file that has other information on it.

How did serialize_mdspan not work well? We have code like https://github.com/rapidsai/cuvs/blob/49298b22956fdf4d3966825ae9aa41e1aa94975b/cpp/src/neighbors/detail/dataset_serialize.hpp#L79-L83 that serializes both scalar values and mdspan values.

One of the advantages of serialize_mdspan is that its compatible with numpy, also we're using for most other serialization and would be nice to be consistent

benfred · 2025-02-13T19:34:49Z

cpp/include/raft/sparse/matrix/preprocessing.cuh

+  cudaMemset(counts, 0, nnz * sizeof(IndexType));
+  _fit_feats(d_cols.data_handle(), counts, nnz, featIdCount);
+  cudaFree(counts);
+  cudaDeviceSynchronize();


do we need to synchronize the entire device here ? If we need to synchronize - would just sync'ing the stream be sufficient?

benfred · 2025-02-13T19:36:00Z

cpp/include/raft/sparse/matrix/preprocessing.cuh

+  raft::sparse::op::coo_sort(
+    nnz, nnz, nnz, d_cols.data_handle(), d_rows.data_handle(), d_vals.data_handle(), stream);
+  IndexType* counts;
+  cudaMallocManaged(&counts, nnz * sizeof(IndexType));


Can we use RMM managed_memory_resource instead here?

benfred · 2025-02-13T19:51:40Z

cpp/include/raft/sparse/matrix/detail/preprocessing.cuh

+ *   The array holding the feature(column) occurrence counts for all fitted inputs.
+ * @param[in] counts
+ *   The array representing value changes in rows input.
+ * @param[in] out_values


nitpick:

Suggested change

* @param[in] out_values

* @param[out] out_values

benfred · 2025-02-13T20:15:42Z

cpp/include/raft/sparse/matrix/preprocessing.cuh

+template <typename ValueType = float, typename IndexType = int>
+class SparseEncoder {
+ private:
+  int* featIdCount;


Should this be IndexType instead of int?

Suggested change

int* featIdCount;

IndexType* featIdCount;

benfred · 2025-02-13T21:30:09Z

cpp/include/raft/sparse/matrix/preprocessing.cuh

+
+ * */
+template <typename ValueType, typename IndexType>
+void SparseEncoder<ValueType, IndexType>::fit(raft::resources& handle,


nitpick: handle should be const

Suggested change

void SparseEncoder<ValueType, IndexType>::fit(raft::resources& handle,

void SparseEncoder<ValueType, IndexType>::fit(raft::resources const& handle,

benfred · 2025-02-14T00:06:32Z

cpp/include/raft/sparse/matrix/preprocessing.cuh

+void SparseEncoder<ValueType, IndexType>::_fit_feats(IndexType* cols,
+                                                     IndexType* counts,
+                                                     IndexType nnz,
+                                                     IndexType* results)


This function should accept a raft handle - since its launching cuda kernels, it needs to use the cuda stream when creating these kernels

benfred · 2025-02-14T00:08:01Z

cpp/include/raft/sparse/matrix/preprocessing.cuh

+  raft::sparse::matrix::detail::_scan<<<blockSize, num_blocks>>>(cols, nnz, counts);
+  raft::sparse::matrix::detail::_fit_compute_occurs<<<blockSize, num_blocks>>>(
+    cols, nnz, counts, results, numFeats);


I think we can probably simplify this -

This seems like this is doing something like np.bincount - counting up the document frequencies of each term so that we can compute the idf for tf-idf/bm25.

I think we can avoid having the _scan/_fit_compute_occurs kernels here entirely (and also remove the requirement for the coo_sort that is performed previously) by using cub. For example - other code inside of raft is using cub and HistogramEven to peform the bincount operation (like

raft/cpp/include/raft/stats/detail/silhouette_score.cuh

Lines 98 to 146 in 842afd7

/**

* @brief function to calculate the bincounts of number of samples in every label

* @tparam DataT: type of the data samples

* @tparam LabelT: type of the labels

* @param labels: the pointer to the array containing labels for every data sample (1 x nRows)

* @param binCountArray: pointer to the 1D array that contains the count of samples per cluster (1 x

* nLabels)

* @param nRows: number of data samples

* @param nUniqueLabels: number of Labels

* @param workspace: device buffer containing workspace memory

* @param stream: the cuda stream where to launch this kernel

*/

template <typename DataT, typename LabelT>

void countLabels(const LabelT* labels,

DataT* binCountArray,

int nRows,

int nUniqueLabels,

rmm::device_uvector<char>& workspace,

cudaStream_t stream)

{

int num_levels = nUniqueLabels + 1;

LabelT lower_level = 0;

LabelT upper_level = nUniqueLabels;

size_t temp_storage_bytes = 0;

rmm::device_uvector<int> countArray(nUniqueLabels, stream);

RAFT_CUDA_TRY(cub::DeviceHistogram::HistogramEven(nullptr,

temp_storage_bytes,

labels,

binCountArray,

num_levels,

lower_level,

upper_level,

nRows,

stream));

workspace.resize(temp_storage_bytes, stream);

RAFT_CUDA_TRY(cub::DeviceHistogram::HistogramEven(workspace.data(),

temp_storage_bytes,

labels,

binCountArray,

num_levels,

lower_level,

upper_level,

nRows,

stream));

}

etc)

add support for bm25 and tfidf

afb74f2

jperez999 requested review from a team as code owners February 3, 2025 21:41

jperez999 self-assigned this Feb 3, 2025

github-actions bot added cpp CMake labels Feb 3, 2025

jperez999 added 3 - Ready for Review and removed cpp CMake 3 - Ready for Review labels Feb 3, 2025

jperez999 requested a review from cjnolet February 3, 2025 21:42

jperez999 added enhancement New feature or request feature request New feature or request non-breaking Non-breaking change labels Feb 3, 2025

jperez999 commented Feb 3, 2025

View reviewed changes

remove unnecessary commented function

d2cd397

github-actions bot added cpp CMake labels Feb 3, 2025

jperez999 and others added 7 commits February 4, 2025 13:47

add function comments and change wording to feats from vocab

43b0491

add handle descriptor for comment on function definition

54c6c30

add missing comments for functions

44c68f4

add missing param comment

60a94a1

Merge branch 'branch-25.04' into bm25-tfidf

9dee4ec

Merge branch 'branch-25.04' into bm25-tfidf

a6e2f00

Merge branch 'branch-25.04' into bm25-tfidf

4175a4d

benfred reviewed Feb 14, 2025

View reviewed changes

Merge branch 'branch-25.04' into bm25-tfidf

faef8ab

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add support for bm25 and tfidf #2567

add support for bm25 and tfidf #2567

jperez999 commented Feb 3, 2025 •

edited

Loading

jperez999 Feb 3, 2025

jperez999 Feb 4, 2025

benfred Feb 13, 2025

benfred Feb 13, 2025

benfred Feb 13, 2025

benfred Feb 13, 2025

benfred Feb 13, 2025

benfred Feb 13, 2025

benfred Feb 13, 2025

benfred Feb 14, 2025

benfred Feb 14, 2025

	void SparseEncoder<ValueType, IndexType>::fit(raft::resources& handle,
	void SparseEncoder<ValueType, IndexType>::fit(raft::resources const& handle,

	/**
	* @brief function to calculate the bincounts of number of samples in every label
	* @tparam DataT: type of the data samples
	* @tparam LabelT: type of the labels
	* @param labels: the pointer to the array containing labels for every data sample (1 x nRows)
	* @param binCountArray: pointer to the 1D array that contains the count of samples per cluster (1 x
	* nLabels)
	* @param nRows: number of data samples
	* @param nUniqueLabels: number of Labels
	* @param workspace: device buffer containing workspace memory
	* @param stream: the cuda stream where to launch this kernel
	*/
	template <typename DataT, typename LabelT>
	void countLabels(const LabelT* labels,
	DataT* binCountArray,
	int nRows,
	int nUniqueLabels,
	rmm::device_uvector<char>& workspace,
	cudaStream_t stream)
	{
	int num_levels = nUniqueLabels + 1;
	LabelT lower_level = 0;
	LabelT upper_level = nUniqueLabels;
	size_t temp_storage_bytes = 0;

	rmm::device_uvector<int> countArray(nUniqueLabels, stream);

	RAFT_CUDA_TRY(cub::DeviceHistogram::HistogramEven(nullptr,
	temp_storage_bytes,
	labels,
	binCountArray,
	num_levels,
	lower_level,
	upper_level,
	nRows,
	stream));

	workspace.resize(temp_storage_bytes, stream);

	RAFT_CUDA_TRY(cub::DeviceHistogram::HistogramEven(workspace.data(),
	temp_storage_bytes,
	labels,
	binCountArray,
	num_levels,
	lower_level,
	upper_level,
	nRows,
	stream));
	}

add support for bm25 and tfidf #2567

Are you sure you want to change the base?

add support for bm25 and tfidf #2567

Conversation

jperez999 commented Feb 3, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jperez999 commented Feb 3, 2025 •

edited

Loading