GPU Matrices

Table of Contents Matrix Types Block Indexing Matching the Compute Locus Matrix Caching

Matrix Types

BIDMat currently has GPU versions of FMat, IMat and SMat, which are respectively GMat, GIMat and GSMat. Most operators and matrix functions defined for CPU matrices will also work on GPU matrices. These operations should also be defined for the Mat superclass. This allows generic code to be written and run on either CPU host or GPU, and supports sparse or dense data.

e.g. the inner loop of LDA (Latent Dirichlet Allocation) looks like this:

def uupdate(sdata:Mat, user:Mat, ipass:Int):Unit = {
  for (i <- 0 until opts.uiter) {
      val preds = DDS(mm, user, sdata)	
      val dc = sdata.contents
      val pc = preds.contents
      max(opts.weps, pc, pc)
      pc ~ dc / pc
      val unew = user ∘ (mm * preds) + opts.alpha
      if (opts.exppsi) exppsi(unew, unew)
      user <-- unew   
  }
}

The matrix sdata can be either sparse or dense, and CPU- or GPU-based. DDS returns the product of mm and user at the non-zeros of sdata which for a dense sdata is just the full product.

You can convert to GPU matrices with constructors for each type, e.g. GMat(a) constructs a GMat from an FMat source (and returns its argument if a is already a GMat). Similarly GIMat(mi) and GSMat(s) construct GIMats and GSMats respectively from IMat or SMat arguments. GPU matrices should have the same toString as the corresponding CPU matrix, so look the same when returned from interactive commands. e.g.

> val a = rand(4,6)
a: BIDMat.FMat =
   0.67636   0.15675   0.43748  0.081511   0.46293  0.097704
   0.31279   0.69485   0.91233   0.87120   0.12652   0.71330
   0.10547   0.88596   0.58793   0.90858   0.45308   0.45136
   0.83065   0.84817  0.080891  0.022294   0.73676   0.14168
> GMat(a)
res12: BIDMat.GMat =
   0.67636   0.15675   0.43748  0.081511   0.46293  0.097704
   0.31279   0.69485   0.91233   0.87120   0.12652   0.71330
   0.10547   0.88596   0.58793   0.90858   0.45308   0.45136
   0.83065   0.84817  0.080891  0.022294   0.73676   0.14168

you can access elements of GPU matrices with indices, e.g. a(0,0) but element access is normally only useful for debugging or interactive exploration. Pulling single elements across the CPU/GPU boundary is very expensive, and will normally nullify any benefits from computing on the GPU.

Block Indexing

GIMats support block indexing, e.g. for integer (GIMat) matrices ii and jj, you can access a block of a GMat aa as

aa(ii,jj)

or set the contents of a GMat to a GMat RHS bb as:

aa(ii,jj) = bb

block indexing can be combined with integer arguments for single row or column access:

aa(5,jj)

or

aa(ii,0) = bb(?,2)

Matching the Compute Locus

Calculations on the GPU have to be implemented with GPU memory to avoid transit across the PCI bus. Operator, functions and block access on GPU matrices normally require all arguments to be in GPU memory. It would be possible to automatically transfer data between CPU and GPU for mixed-locus operations, but this is not implemented in the library for a few reasons. First of all, its not clear which location to prefer for mixed two-argument operations. The GPU is typically faster, but transiting the bus will remove the benefits much of the time. The CPU has much more memory (and a garbage collector) so memory problems are much easier to avoid by computing on the CPU host. Secondly, storage needs to be allocated for a matrix to be returned for many operations, and the same trade-offs recur. Thirdly, when used with BIDMach, the Learner class normally site calculations either entirely on the GPU or entirely on the CPU, and mixed-locus operations usually indicate an error. By omitting implicit cast operators that move arguments between CPU and GPU, we force exceptions with mixed-locus operations, and this helps identify and fix the problems earlier.

Matrix Caching

Matrix caching applies to both CPU and GPU matrices but it was developed specifically for GPUs whose C++ runtimes lack memory management. Caching is described in detail in the next chapter. But there are a couple of subtleties with using it without "leaking" uncached matrices which we review here.

When evaluating operations with constants like a+2, the cache key is built from the GUID of a and the value 2. This is to allow the use of different constant sums like a+3, a+4 etc. without aliasing. But the consequence is that iterative calculations with changing constants will not be cached. e.g., in a moving average expression:

alpha = 1f/n
newB = alpha * A + (1 - alpha) * B

alpha varies from one iteration to the next, and each iteration will produce different cached copies. One solution is to use a 1x1 matrix container for alpha. The locus of this container (CPU or GPU) should match that of A and B. The value can be updated using the set method:

alpha.set(1f/n)

and then alpha * A and (1-alpha) will both use the same cached container from one iteration to the next.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GPU Matrices

Table of Contents

Matrix Types

Block Indexing

Matching the Compute Locus

Matrix Caching

Clone this wiki locally