-
Notifications
You must be signed in to change notification settings - Fork 200
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add support for bm25 and tfidf #2567
base: branch-25.04
Are you sure you want to change the base?
Conversation
saveFile << fullIdLen << " "; | ||
// serialize_mdspan<IndexType>(handle, oss, featIdCount_md.view()); | ||
for (int i = 0; i < vocabSize; i++) { | ||
saveFile << featIdCount[i] << " "; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could not use the serialize_mdspan, it does not work well when reading from a file that has other information on it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would probably be better to use a map here to save only indexes/values that are not zero. This would save time and space. Thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could not use the serialize_mdspan, it does not work well when reading from a file that has other information on it.
How did serialize_mdspan not work well? We have code like https://github.com/rapidsai/cuvs/blob/49298b22956fdf4d3966825ae9aa41e1aa94975b/cpp/src/neighbors/detail/dataset_serialize.hpp#L79-L83 that serializes both scalar values and mdspan values.
One of the advantages of serialize_mdspan is that its compatible with numpy, also we're using for most other serialization and would be nice to be consistent
saveFile << fullIdLen << " "; | ||
// serialize_mdspan<IndexType>(handle, oss, featIdCount_md.view()); | ||
for (int i = 0; i < vocabSize; i++) { | ||
saveFile << featIdCount[i] << " "; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could not use the serialize_mdspan, it does not work well when reading from a file that has other information on it.
How did serialize_mdspan not work well? We have code like https://github.com/rapidsai/cuvs/blob/49298b22956fdf4d3966825ae9aa41e1aa94975b/cpp/src/neighbors/detail/dataset_serialize.hpp#L79-L83 that serializes both scalar values and mdspan values.
One of the advantages of serialize_mdspan is that its compatible with numpy, also we're using for most other serialization and would be nice to be consistent
cudaMemset(counts, 0, nnz * sizeof(IndexType)); | ||
_fit_feats(d_cols.data_handle(), counts, nnz, featIdCount); | ||
cudaFree(counts); | ||
cudaDeviceSynchronize(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we need to synchronize the entire device here ? If we need to synchronize - would just sync'ing the stream be sufficient?
raft::sparse::op::coo_sort( | ||
nnz, nnz, nnz, d_cols.data_handle(), d_rows.data_handle(), d_vals.data_handle(), stream); | ||
IndexType* counts; | ||
cudaMallocManaged(&counts, nnz * sizeof(IndexType)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we use RMM managed_memory_resource instead here?
* The array holding the feature(column) occurrence counts for all fitted inputs. | ||
* @param[in] counts | ||
* The array representing value changes in rows input. | ||
* @param[in] out_values |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nitpick:
* @param[in] out_values | |
* @param[out] out_values |
template <typename ValueType = float, typename IndexType = int> | ||
class SparseEncoder { | ||
private: | ||
int* featIdCount; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be IndexType instead of int?
int* featIdCount; | |
IndexType* featIdCount; |
|
||
* */ | ||
template <typename ValueType, typename IndexType> | ||
void SparseEncoder<ValueType, IndexType>::fit(raft::resources& handle, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nitpick: handle should be const
void SparseEncoder<ValueType, IndexType>::fit(raft::resources& handle, | |
void SparseEncoder<ValueType, IndexType>::fit(raft::resources const& handle, |
void SparseEncoder<ValueType, IndexType>::_fit_feats(IndexType* cols, | ||
IndexType* counts, | ||
IndexType nnz, | ||
IndexType* results) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function should accept a raft handle - since its launching cuda kernels, it needs to use the cuda stream when creating these kernels
raft::sparse::matrix::detail::_scan<<<blockSize, num_blocks>>>(cols, nnz, counts); | ||
raft::sparse::matrix::detail::_fit_compute_occurs<<<blockSize, num_blocks>>>( | ||
cols, nnz, counts, results, numFeats); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can probably simplify this -
This seems like this is doing something like np.bincount
- counting up the document frequencies of each term so that we can compute the idf
for tf-idf/bm25.
I think we can avoid having the _scan/_fit_compute_occurs kernels here entirely (and also remove the requirement for the coo_sort
that is performed previously) by using cub. For example - other code inside of raft is using cub and HistogramEven
to peform the bincount operation (like
raft/cpp/include/raft/stats/detail/silhouette_score.cuh
Lines 98 to 146 in 842afd7
/** | |
* @brief function to calculate the bincounts of number of samples in every label | |
* @tparam DataT: type of the data samples | |
* @tparam LabelT: type of the labels | |
* @param labels: the pointer to the array containing labels for every data sample (1 x nRows) | |
* @param binCountArray: pointer to the 1D array that contains the count of samples per cluster (1 x | |
* nLabels) | |
* @param nRows: number of data samples | |
* @param nUniqueLabels: number of Labels | |
* @param workspace: device buffer containing workspace memory | |
* @param stream: the cuda stream where to launch this kernel | |
*/ | |
template <typename DataT, typename LabelT> | |
void countLabels(const LabelT* labels, | |
DataT* binCountArray, | |
int nRows, | |
int nUniqueLabels, | |
rmm::device_uvector<char>& workspace, | |
cudaStream_t stream) | |
{ | |
int num_levels = nUniqueLabels + 1; | |
LabelT lower_level = 0; | |
LabelT upper_level = nUniqueLabels; | |
size_t temp_storage_bytes = 0; | |
rmm::device_uvector<int> countArray(nUniqueLabels, stream); | |
RAFT_CUDA_TRY(cub::DeviceHistogram::HistogramEven(nullptr, | |
temp_storage_bytes, | |
labels, | |
binCountArray, | |
num_levels, | |
lower_level, | |
upper_level, | |
nRows, | |
stream)); | |
workspace.resize(temp_storage_bytes, stream); | |
RAFT_CUDA_TRY(cub::DeviceHistogram::HistogramEven(workspace.data(), | |
temp_storage_bytes, | |
labels, | |
binCountArray, | |
num_levels, | |
lower_level, | |
upper_level, | |
nRows, | |
stream)); | |
} |
This PR will add support for tfidf and BM25 preprocessing of sparse matrix. This PR supports encoding values in raft device COO or CSR sparse matrices. It breaks up the statistical recording (fit) phase and the transformation phase. This allows for batch fitting data and then transforming a target input. This class also allows for exporting/loading, so you can fit in one place and transform in a separate environment. This builds on #2353