[WIP] add new operator LU factorization #1019

Chuancysun · 2024-04-30T11:28:40Z

Thanks for your contribution and we appreciate it a lot. 🚀🚀

1. Motivation

add floating point operator LU factorization

2. Modification

add implementation of floating point LU factorization

3. Test Report

3.1 Modification Details

3.1.1 Accuracy Acceptance Standard

For static threshold standard details, see: MLU-OPS™ Accuracy Acceptance Standard.

static threshold
- diff1
  - float32 mlu diff1 <= 1e-5
  - [*] float32 mlu diff1 <= 3e-3
  - float16 mlu diff1 <= 3e-3
- diff2
  - float32 mlu diff2 <= 1e-5
  - [* ] float32 mlu diff2 <= 3e-3
  - float16 mlu diff2 <= 3e-3
- diff3
  - mlu diff3 == 0
  - mlu diff3_1 == 0
  - mlu diff3_2 == 0
dynamic threshold
- diff1: mlu diff1 <= max(baseline diff1 * 10, static threshold)
- diff2: mlu diff2 <= max(baseline diff2 * 10, static threshold)
- diff3: mlu diff3 <= max(baseline diff3 * 10, static threshold)
  - float32, threshold = 1e-5
  - float16, threshold = 1e-3

3.1.2 Operator Scheme checklist

Supported hardware
- [* ] MLU370
- MLU590
Job types
- BLOCK
- UNION1
- UNION2
- UNION4
- [* ] The operator will dynamically select the most suitable task type, for example, UNION8

3.2 Accuracy Test

3.2.1 Accuracy Test

If you have checked the following items, please tick the relevant box.

3.3 Performance Test

Platform：MLU370

----------- case0 -----------
case0
[Op name ]: sgetrf
[Shape ]: input.shape=[256,256], output.shape=[256,256]
[Data type] ]: float32
[MLU Hardware Time ]: 6460 (us)
[MLU Interface Time ]: 15336.7 (us)
[MLU IO Efficiency ]: 0.00026419
[MLU Compute Efficiency ]: 9.90712e-06
[MLU Workspace Size ]: -1 (Bytes)
[MLU Kernel Name(s) ]: {}
[MLU TheoryOps ]: 65536 (Ops)
[MLU TheoryIOs ]: 524288 (Bytes)
[MLU ComputeForce ]: 1.024e+12 (op/s)
[MLU IoBandWidth ]: 307.2 (GB/s)
[GPU Hardware Time ]: -1 (us)
[GPU IO Efficiency ]: -1
[GPU Compute Efficiency ]: -1
[GPU Workspace Size ]: -1 (Bytes)
[Diffs]:
[output]
DIFF1: 1.798500e-04
DIFF2: 7.016698e-04
[^ OK ] ../../test/mlu_op_gtest/pb_gtest/src/zoo/sgetrf/test_case/case0.prototxt
[ OK ] sgetrf/TestSuite.mluOp/0 (36 ms)
[----------] 1 test from sgetrf/TestSuite (36 ms total)

[----------] Global test environment tear-down
[ SUMMARY ] Total 1 cases of 1 op(s).
ALL PASSED.
[==========] 1 test case from 1 test suite ran. (3727 ms total)
[ PASSED ] 1 test case.

3.4 Summary Analysis

Please give a brief overview here, if you want to note and summarize the content.

kernels/sgetrf/add_union1.mlu

PetrelYy · 2024-05-22T09:32:28Z

kernels/sgetrf/add_union1.mlu

+ * permit persons to whom the Software is furnished to do so, subject to
+ * the following conditions:
+ *
+ * The above copyright notice and this permission notice shall be included


补充 proto，docs/bangc-docs/user_guide/9_operators/index.rst 算子说明
可参考 https://github.com/Cambricon/mlu-ops/pull/662/files#diff-7f0a558d8f985a4ebd89cd6674a4bf1a91549ddcc6e708a897f351cb2006f0e8

已添加算子实现方案

mlu_op.h

test/mlu_op_gtest/pb_gtest/src/zoo/sgetrf/sgetrf.cpp

kernels/sgetrf/add_union1.mlu

kernels/sgetrf/trsm_union1.mlu

docs/design_docs/sgetrf/sgetrf.md

ArtIntAI · 2024-05-31T02:55:34Z

kernels/sgetrf/sgetrf_mlu.cpp

+    nb = get_sgetrf_native_nb(m, n);
+
+    float *workspace;
+    cnrtMalloc((void **)&workspace, nb * nb * sizeof(float));


一般workspace是用户自己传进来的，算子是提供getWorkspace的接口来让用户去分配对应的空间，可以参考https://github.com/Cambricon/mlu-ops/blob/master/mlu_op.h#L3716

ArtIntAI · 2024-05-31T02:56:25Z

mlu_op.h

+  typedef enum
+  {
+    MLUOP_STATUS_SUCCESS = 0, /*!< The operation is successfully completed. */
+    MLUOP_STATUS_NOT_INITIALIZED = 1,


建议头文件回退其他修改，只增加自己这次提交的部分。

kernels/sgetrf2/mycnrtMemcpy2D_union1.mlu

mlu_op.h

test/mlu_op_gtest/pb_gtest/src/zoo/sgerf2/sgetrf.cpp

kernels/sgetrf2/myinverse_union1.mlu

kernels/sgetrf2/sgetrf_mlu.cpp

docs/design_docs/sgetrf/sgetrf.md

mlu_op.h

kernels/sgetrf2/sgetrf2.h

kernels/sgetrf2/mycnrtMemcpy2D_union1.mlu

ArtIntAI · 2024-08-14T07:18:52Z

kernels/sgetrf2/myinverse_union1.mlu

+            {
+                temp += mul_result[k];
+            }
+            temp = temp * -1.0 * diag_element;


这里标量计算太慢了，不能向量化调用bangc接口吗

这里是将一个向量中的所有元素中的值相加，文档没有找到能实现此功能的bang函数

reduce_sum或者sumpool可以试试

查看文档后发现这两个函数的功能跟算子的逻辑不符

kernels/sgetrf2/swap_union1.mlu

docs/user_guide/9_operators/index.rst

ArtIntAI · 2024-11-21T02:16:56Z

mlu_op.h

+ *
+ * @par Data Layout
+ * - The supported combinations of data types are shown below:
+ * - size_t(size)


这里data layout也标记成None吧，workspace接口不做额外说明

ArtIntAI · 2024-11-21T02:18:34Z

mlu_op.h

+ * - size_t(size)
+ *
+ * @par Scale Limitation
+ * - The dimension of input tensor must be either 2, 3 or 4.


这里的限制也标记成None吧

ArtIntAI · 2024-11-21T02:19:03Z

mlu_op.h

+ *
+ * @par Data Type
+ * - The supported combinations of data types are shown below:
+ *   - float(x) - float(y)


不也支持complex吗？

ArtIntAI · 2024-11-21T02:28:05Z

mlu_op.h

+ * Considering the size of the GDRAM, the space occupied by the input matrix should not exceed 7GB.
+ *
+ * @par API Dependency
+ * - The allocated extra workspace should be passed to ::mluOpSgetrf2 to perform the LU operation.


Before calling this function to perform ::mluOpRoiPointPool3d, you need to

get the size of workspace by ::mluOpGetRoiPointPool3dWorkspaceSize.
这里的API依赖描述类似这样，这里和workspace接口的一样了！

mlu_op.h

test/mlu_op_gtest/pb_gtest/src/zoo/sgetrf2/sgetrf.cpp

docs/design_docs/sgetrf/sgetrf.md

test/mlu_op_gtest/pb_gtest/src/zoo/sgetrf2/sgetrf.cpp

docs/user_guide/9_operators/index.rst

kernels/sgetrf2/sgetrf.cpp

ArtIntAI · 2024-11-22T02:40:17Z

kernels/sgetrf2/sgetrf2_native.cpp

+        }
+      }
+    }
+    k_dim.x = dim_x;


是这部分内容来抽取到policyFunc，参考下面的实现
void policyFuncBallQuery(const mluOpHandle_t &handle,
const mluOpTensorDescriptor_t &desc, cnrtDim3_t *k_dim,
cnrtFunctionType_t *k_type) {
size_t cluster_num = mluop::runtime::getClusterLimitCapability(handle);
VLOG(5) << "In current device, cluster_num:" << cluster_num;
size_t core_in_cluster = handle->core_num_per_cluster;
VLOG(5) << "In current device, core_in_cluster:" << core_in_cluster;

size_t total_data_num = desc->total_element_num;

// On a core, a lot of new_xyz data element can be stored; but only one data
// element can be processed at a time. So a cluster can only process four data
// element.
size_t needed_cluster_num =
(total_data_num + core_in_cluster - 1) / core_in_cluster;
*k_type = cnrtFuncTypeUnion1;
k_dim->x = core_in_cluster;
k_dim->y =
needed_cluster_num > cluster_num ? cluster_num : needed_cluster_num;
k_dim->z = 1;
}

AndyQiao0828 · 2024-11-22T02:37:27Z

docs/user_guide/9_operators/index.rst

@@ -793,3 +793,10 @@ mluOpLgamma

 - ``x`` 为输入张量。

+.. Sgetrf2::


.. Sgetrf2::>>>.. _sgetrf2:

AndyQiao0828 · 2024-11-22T02:38:10Z

docs/user_guide/9_operators/index.rst

@@ -793,3 +793,10 @@ mluOpLgamma

 - ``x`` 为输入张量。

+.. Sgetrf2::


这个新增内容在手册的更新历史章节也补充下哈

这个能具体解释一下吗

AndyQiao0828 · 2024-11-22T02:43:18Z

docs/user_guide/9_operators/index.rst

+
+mluOpSgetrf2
+---------------
+执行 LU 分解，将一个矩阵分解为一个下三角矩阵（L）和一个上三角矩阵（U），参数``mode``用来指定是否进行选主元操作。


参数mode用来》》参数 mode 用来

这个是什么意思？

就是 mode 跟周围文字前后各保留一个空格

AndyQiao0828 · 2024-11-22T02:47:59Z

docs/user_guide/9_operators/index.rst

+---------------
+执行 LU 分解，将一个矩阵分解为一个下三角矩阵（L）和一个上三角矩阵（U），参数``mode``用来指定是否进行选主元操作。
+
+该算子包含7个输入：handle 为操作句柄，x_desc 与 x 分别描述并提供输入矩阵的信息；两个输出：y_desc 与 y 分别描述并存储输出矩阵的信息；此外，还包含一个参数 mode，用于指定是否进行选主元，值为0表示选择非主元模式，ipiv表示置换矩阵，以及一个 workspace 用于临时存储计算过程中的数据。


有公式的话，需要补充下公式。

Suggested change

该算子包含7个输入：handle 为操作句柄，x_desc 与 x 分别描述并提供输入矩阵的信息；两个输出：y_desc 与 y 分别描述并存储输出矩阵的信息；此外，还包含一个参数 mode，用于指定是否进行选主元，值为0表示选择非主元模式，ipiv表示置换矩阵，以及一个 workspace 用于临时存储计算过程中的数据。

该算子包含7个输入：其中，``handle`` 为操作句柄，``x_desc`` 与 ``x`` 分别描述并提供输入矩阵的信息；两个输出：``y_desc`` 与 ``y`` 分别描述并存储输出矩阵的信息；此外，还包含一个参数 ``mode`` 用于指定是否进行选主元操作，值为0时，表示选择非主元模式，``ipiv`` 表示置换矩阵，以及 ``workspace`` 用于临时存储计算过程中的数据。

》》7个输入中提到了三个（handle，x_desc和x）？mode、ipiv和workspace也是输入吗？

AndyQiao0828 · 2024-11-22T02:49:26Z

mlu_op.h

@@ -14523,6 +14523,132 @@ mluOpLgamma(mluOpHandle_t handle,
            const mluOpTensorDescriptor_t y_desc,
            void *y);

+/*!


这几个接口属于哪个group呢？参照其他补充下哈：
// Group:Lgamma

AndyQiao0828 · 2024-11-22T03:03:52Z

mlu_op.h

+ * INTEGER array, dimension (m);
+ * The pivot indices; row i of the matrix was interchanged with row IPIV(i)


Suggested change

* INTEGER array, dimension (m);

* The pivot indices; row i of the matrix was interchanged with row IPIV(i)

* An integer array, dimension (m);

* The pivot indices; row i of the matrix was interchanged with row IPIV(i).

需要优化，说明ipiv的含义。写成完整的句子：这里dimension跟pivot indices是啥关系。

AndyQiao0828 · 2024-11-22T03:04:30Z

mlu_op.h

+ * INTEGER array, dimension (m);
+ * The pivot indices; row i of the matrix was interchanged with row IPIV(i)
+ *
+ * @param[out] info


在参数解释最前面先说明infor的含义

AndyQiao0828 · 2024-11-22T03:06:09Z

mlu_op.h

+ *                 to solve a system of equations.
+ *
+ * @param[in] mode
+ *   option to perform operation with pivoting/no pivoting versions


Suggested change

* option to perform operation with pivoting/no pivoting versions

* Option to perform the operation with pivoting/no pivoting versions

都有什么mode呢？

已在代码中修改

AndyQiao0828 · 2024-11-22T03:07:04Z

mlu_op.h

+ * - The data layout of y should be MLUOP_LAYOUT_ARRAY.
+ *
+ * @par Scale Limitation
+ * - The dimension of input tensor must be either 2, 3 or 4.


The dimension of input tensor must be 2, 3 or 4.

AndyQiao0828 · 2024-11-22T03:09:48Z

mlu_op.h

+ * @param[out] info
+ *     -     = 0:  successful exit
+ *     -     < 0:  if INFO = -i, the i-th argument had an illegal value
+ *                 or another error occured, such as memory allocation failed.


这是什么意思呢？

kernels/sgetrf2/sgetrf.cpp

shunshen93 · 2024-11-22T08:48:48Z

test/mlu_op_gtest/pb_gtest/src/zoo/sgetrf2/test_case/case0.prototxt

@@ -0,0 +1,41 @@
+op_name: "test_sgetrf2"
+input {
+  id: "input"


在590上测试精度超出阈值，且generator产生的case，590精度也会超出阈值

在590上测试精度超出阈值，且generator产生的case，590精度也会超出阈值

能否给出矩阵输入规模等信息及输出？

[22,29,206]，DTYPE_COMPLEX_FLOAT的case会超时

[22,29,206]，DTYPE_COMPLEX_FLOAT的case会超时

请问[22,29,206]对应矩阵规模的[batch,m,n]/[m,n,batch]吗？我分别测试了这两种情况都没有问题

是跑的complex类型吗？

已复现并修改代码

op_name: "test_xgetrf" input { id: "input" shape: { dims: 1 dims: 22 dims: 129 dims: 206 } layout: LAYOUT_ARRAY dtype: DTYPE_COMPLEX_FLOAT random_data: { seed: 25 upper_bound: 10.0 lower_bound: -10.0 distribution: UNIFORM } } output { id: "output" shape: { dims: 1 dims: 22 dims: 129 dims: 206 } layout: LAYOUT_ARRAY dtype: DTYPE_COMPLEX_FLOAT } output { id: "output2" shape { dims: 1 dims: 22 dims: 129 } layout: LAYOUT_ARRAY dtype: DTYPE_INT32 } xgetrf_param{ mode: 0 } test_param: { error_func: DIFF1 error_func: DIFF2 error_threshold: 0.003 error_threshold: 0.003 baseline_device: CPU }

测试下这个case

这个case已复现并修复，请问mlu端怎么关闭nan/inf的检查呢？包含nan/inf的case的测试标准是什么呢？

mlu需要支持nan/inf的检查。nan/inf的结果需要和竞品一致。如果结果不一样，需要解释原因，比如算法不一致或者其它原因

kernels/sgetrf2/sgetrf2.h

kernels/sgetrf2/scal_ger_union1.mlu

ArtIntAI · 2024-11-25T06:42:05Z

另外针对用户感知到的一些tensor信息，如下所列，支持的做下测试，不支持的可以参考下其他算子做好参数拦截
1.large tensor(tensor单个维度超过2G num, tensor的所有维度乘积超过2G num)
2. inplace，输入和输出tensor地址一致
3. stride，如果不支持做好参数检查报错
4. 广播，如果不支持做好参数检查报错
5. 输入和输出包含nan/inf时精度是否和GPU精度对齐
6. 输入tensor是0元素，某个维度是0

ArtIntAI · 2024-11-27T12:33:03Z

kernels/sgetrf2/sgetrf2_native.cpp

+
+  if (dtype == MLUOP_DTYPE_COMPLEX_FLOAT) {
+    if (batch > 1) {
+      k_type = CNRT_FUNC_TYPE_UNION8;


板卡上不一定有这个类型，建议参考这里进行设置：

mlu-ops/kernels/dynamic_point_to_voxel/dynamic_point_to_voxel_backward/dynamic_point_to_voxel_backward.cpp

Line 191 in 5ae8c94

*k_type = mluop::runtime::getJobLimitCapabilityCnrtFuncType(handle);

mlu-ops/kernels/fft/c2c_fft/c2c_fft_host.cpp

Line 1668 in 5ae8c94

int task_type = mluop::runtime::getJobLimitCapability(handle);

ArtIntAI · 2024-11-27T12:42:29Z

kernels/sgetrf2/sgetrf.cpp

+    transpose(handle, MLUOP_DTYPE_COMPLEX_FLOAT, batch, m, n, (float *)x,
+              (float *)y, handle->queue);
+  } else {
+    cnrtMemcpy((float *)y, (float *)x, batch * m * n * sizeof(float),


不建议使用cnrtMemcpy和cnrtMemset，cnrtQueueSync，会对上层使用mlu_graph有问题
建议cnrtMemcpy使用片上的__memcpy来替换
cnrtMemset使用片上设置数据来替换
cnrtQueueSync可以去掉，对于同一个queue来说，queue内的kernel调用（使用<<<>>>）是串行的

docs/design_docs/xgetrf/xgetrf.md

kernels/xgetrf/xgetrf_mlu.cpp

shunshen93 · 2025-01-03T06:05:34Z

test/mlu_op_gtest/pb_gtest/src/zoo/xgetrf/test_case/case0.prototxt

+output {
+  id: "output2"
+  shape {
+    dims: 1024


output2的shape为[1,1,1024]

output1表示LU，output2表示P，2个output都需要和对应的baseline进行对比。
此外还需要output3来验证还原性

shunshen93 · 2025-01-03T09:04:51Z

test/mlu_op_gtest/pb_gtest/src/zoo/xgetrf/xgetrf.cpp

+    cpu_fp32_output_[1][i] = pivots[i];
+  }
+
+  if (tensor_desc_[0].tensor->dtype == MLUOP_DTYPE_FLOAT) {


添加cpu测试逻辑

shunshen93 · 2025-01-06T02:24:31Z

test/mlu_op_gtest/pb_gtest/src/zoo/xgetrf/xgetrf.cpp

+}
+
+void XgetrfExecutor::cpuCompute() {
+  auto count = parser_->input(0)->shape_count;


lu_facto如下：
P, L, U = torch.linalg.lu(A)
LU, P = torch.linalg.lu_factor(A)
其中L， U 是LU的下三角和上三角

参考svd的测试方法，需要测试：

验证还原性：验证MLU的P * L * U 与 GPU 的P * L * U 的误差满足动态阈值（L, U由LU得到）

验证 MLU 的 LU 矩阵与 baseline 的 LU 矩阵误差满足动态阈值

验证MLU的 P 矩阵与baseline的 P 矩阵误差满足动态阈值

所以output需要存储P, LU, P*LU共3个矩阵
其中，动态阈值对比diff1,diff2,diff4

主要存在三类问题：
1.非主元情况下的精度问题偶尔会不满足动态阈值，复现后发现是因为非主元情况下可能会遇到主元很小，做除法后误差放大，非主元本身就是一种不稳定的算法；
2.选主元情况下的置换矩阵不一致，但最终的矩阵满足动态阈值；复现后发现是因为由于误差积累，当存在两个数非常接近，但在mlu中是数1大于数2而gpu中数2大于数1，但数1和数2仅相差小数点后两位，相对误差也满足动态阈值和静态阈值，所以验证置换矩阵本身是不合理的，应该只验证PLU后的矩阵。
3.nan和inf。torch中对nan和inf的行为与正常逻辑有些不一致，复现时发现有的case中的计算，比如inf/inf应该期望获得nan而torch返回的0，然后我们放到magma中去验证时发现获得的也是nan，但由于缺少内部实现不知道torch是否有特殊处理

生成1000个case，40个错误，出错概率挺高的。应该如何验证正确性呢？是否可以加上一些限制，使随机生成的case满足精度需求？

AndyQiao0828 · 2025-01-13T01:52:03Z

mlu_op.h

+// Group:Xgetrf
+/*!
+ * @brief Calculates the size of the workspace required for the LU decomposition and initializes a workspace pointer.
+ * This function must be called before performing LU decomposition using mluOpXgetrf.


mluOpXgetrf》》::mluOpXgetrf

AndyQiao0828 · 2025-01-13T02:05:41Z

mlu_op.h

+ *
+ * @par Data Type
+ * - The supported combinations of data types are shown below:
+ *   - size_t( size)


请确认数据类型这块儿是否正确~

AndyQiao0828 · 2025-01-13T02:07:23Z

mlu_op.h

+ * Pointer to the MLU memory that is used as an extra workspace for the
+ * ::mluOpXgetrf.


for the ::mluOpXgetrf. 》》for the ::mluOpXgetrf operation.

AndyQiao0828 · 2025-01-13T02:08:29Z

mlu_op.h

+ * @param[in, out] dipiv
+ * An array containing the pivot indices. The value dipiv[i] is the
+ * row index that was swapped with row i. The array is updated during
+ * the execution to reflect the new row indices after each swap.The dipiv


swap.The>>swap. The

AndyQiao0828 · 2025-01-13T02:08:59Z

mlu_op.h

+ * An array containing the pivot indices. The value dipiv[i] is the
+ * row index that was swapped with row i. The array is updated during
+ * the execution to reflect the new row indices after each swap.The dipiv
+ * array is used to track the row permutation (pivoting),and it is


(pivoting),and>>> (pivoting), and

AndyQiao0828 · 2025-01-13T02:21:14Z

mlu_op.h

+ *
+ * @param[out] info
+ * Error code indicating the validity of the input arguments
+ *     -     = 0:  successful exit


末尾句号

AndyQiao0828 · 2025-01-13T02:22:13Z

mlu_op.h

+ *     -     = 0:  successful exit
+ *     -     < 0:  an error occurred, with the value indicating the parameter number that was invalid.
+ *     -     > 0:  if INFO = i, U(i,i) is exactly zero. The factorization
+ *                 has been completed, but the factor U is exactly
+ *                 singular, and division by zero will occur if it is used
+ *                 to solve a system of equations.


句首字母大写。
Successfully

An error

If INFO

AndyQiao0828 · 2025-01-13T02:23:17Z

mlu_op.h

+ * option to perform the operation with pivoting/no pivoting versions
+ * -     = 0: perform the operation without pivoting.
+ * -     = 1: perform the operation with pivoting.


Option to perform the operation with pivoting/no pivoting versions:

= 0: Perform the operation without pivoting.

= 1: Perform the operation with pivoting.

AndyQiao0828 · 2025-01-13T02:23:48Z

mlu_op.h

+ * - The data layout of y should be MLUOP_LAYOUT_ARRAY.
+ *
+ * @par Scale Limitation
+ * - None


末尾句号

AndyQiao0828 · 2025-01-13T02:25:11Z

mlu_op.h

+ *
+ * @par API Dependency
+ * - Before calling this function to perform ::mluOpXgetrf, you need to get the size of workspace by
+ * ::mluOpGetXgetrfWorkspaceSize to perform the LU operation.


to perform the LU operation 这块删掉吧，会误以为是mluOpGetXgetrfWorkspaceSize 的功能

Chuancysun mentioned this pull request May 20, 2024

【新算子】- linalg.lu 算子开发 #1007

Open

ArtIntAI reviewed May 22, 2024

View reviewed changes

kernels/sgetrf/add_union1.mlu Outdated Show resolved Hide resolved

PetrelYy reviewed May 22, 2024

View reviewed changes

mlu_op.h Outdated Show resolved Hide resolved

PetrelYy reviewed May 22, 2024

View reviewed changes

test/mlu_op_gtest/pb_gtest/src/zoo/sgetrf/sgetrf.cpp Outdated Show resolved Hide resolved

PetrelYy reviewed May 22, 2024

View reviewed changes

kernels/sgetrf/add_union1.mlu Outdated Show resolved Hide resolved

ArtIntAI reviewed May 29, 2024

View reviewed changes

kernels/sgetrf/trsm_union1.mlu Outdated Show resolved Hide resolved

ArtIntAI reviewed May 29, 2024

View reviewed changes

kernels/sgetrf/trsm_union1.mlu Outdated Show resolved Hide resolved

ArtIntAI reviewed May 29, 2024

View reviewed changes

docs/design_docs/sgetrf/sgetrf.md Outdated Show resolved Hide resolved

ArtIntAI reviewed May 29, 2024

View reviewed changes

docs/design_docs/sgetrf/sgetrf.md Outdated Show resolved Hide resolved

ArtIntAI reviewed May 31, 2024

View reviewed changes

ArtIntAI reviewed Jul 23, 2024

View reviewed changes

kernels/sgetrf2/mycnrtMemcpy2D_union1.mlu Outdated Show resolved Hide resolved

ArtIntAI reviewed Jul 23, 2024

View reviewed changes

mlu_op.h Outdated Show resolved Hide resolved

ArtIntAI reviewed Jul 23, 2024

View reviewed changes

test/mlu_op_gtest/pb_gtest/src/zoo/sgerf2/sgetrf.cpp Outdated Show resolved Hide resolved

ArtIntAI reviewed Jul 23, 2024

View reviewed changes

kernels/sgetrf2/myinverse_union1.mlu Outdated Show resolved Hide resolved

ArtIntAI reviewed Jul 23, 2024

View reviewed changes

kernels/sgetrf2/myinverse_union1.mlu Outdated Show resolved Hide resolved

ArtIntAI reviewed Jul 23, 2024

View reviewed changes

kernels/sgetrf2/sgetrf_mlu.cpp Outdated Show resolved Hide resolved