-
Notifications
You must be signed in to change notification settings - Fork 0
GPU Matrices
BIDMat currently has GPU versions of FMat, IMat and SMat, which are respectively GMat, GIMat and GSMat. Most operators and matrix functions defined for CPU matrices will also work on GPU matrices. These operations should also be defined for the Mat superclass. This allows generic code to be written and run on either CPU host or GPU, and supports sparse or dense data.
e.g. the inner loop of LDA (Latent Dirichlet Allocation) looks like this:
def uupdate(sdata:Mat, user:Mat, ipass:Int):Unit = { for (i <- 0 until opts.uiter) { val preds = DDS(mm, user, sdata) val dc = sdata.contents val pc = preds.contents max(opts.weps, pc, pc) pc ~ dc / pc val unew = user ∘ (mm * preds) + opts.alpha if (opts.exppsi) exppsi(unew, unew) user <-- unew } }
The matrix sdata
can be either sparse or dense, and CPU- or GPU-based. DDS
returns the product of mm
and user
at the non-zeros of sdata
which for a dense sdata
is just the full product.
You can convert to GPU matrices with constructors for each type, e.g. GMat(a)
constructs a GMat from an FMat
source (and returns its argument if a is already a GMat). Similarly GIMat(mi)
and GSMat(s)
construct GIMats and GSMats respectively from IMat or SMat arguments. GPU matrices should have the same toString as the corresponding CPU matrix, so look the same when returned from interactive commands. e.g.
> val a = rand(4,6) a: BIDMat.FMat = 0.67636 0.15675 0.43748 0.081511 0.46293 0.097704 0.31279 0.69485 0.91233 0.87120 0.12652 0.71330 0.10547 0.88596 0.58793 0.90858 0.45308 0.45136 0.83065 0.84817 0.080891 0.022294 0.73676 0.14168 > GMat(a) res12: BIDMat.GMat = 0.67636 0.15675 0.43748 0.081511 0.46293 0.097704 0.31279 0.69485 0.91233 0.87120 0.12652 0.71330 0.10547 0.88596 0.58793 0.90858 0.45308 0.45136 0.83065 0.84817 0.080891 0.022294 0.73676 0.14168
you can access elements of GPU matrices with indices, e.g. a(0,0)
but element access is normally only useful for debugging or interactive exploration. Pulling single elements across the CPU/GPU boundary is very expensive, and will normally nullify any benefits from computing on the GPU.
GIMat
s support block indexing, e.g. for integer (GIMat) matrices ii
and jj
, you can access a block of a GMat aa
as
aa(ii,jj)
or set the contents of a GMat to a GMat RHS bb
as:
aa(ii,jj) = bb
block indexing can be combined with integer arguments for single row or column access:
aa(5,jj)
or
aa(ii,0) = bb(?,2)
Calculations on the GPU have to be implemented with GPU memory to avoid transit across the PCI bus. Operator, functions and block access on GPU matrices normally require all arguments to be in GPU memory. It would be possible to automatically transfer data between CPU and GPU for mixed-locus operations, but this is not implemented in the library for a few reasons. First of all, its not clear which location to prefer for mixed two-argument operations. The GPU is typically faster, but transiting the bus will remove the benefits much of the time. The CPU has much more memory (and a garbage collector) so memory problems are much easier to avoid by computing on the CPU host. Secondly, storage needs to be allocated for a matrix to be returned for many operations, and the same trade-offs recur. Thirdly, when used with BIDMach, the Learner class normally site calculations either entirely on the GPU or entirely on the CPU, and mixed-locus operations usually indicate an error. By omitting implicit cast operators that move arguments between CPU and GPU, we force exceptions with mixed-locus operations, and this helps identify and fix the problems earlier.
Matrix caching applies to both CPU and GPU matrices but it was developed specifically for GPUs whose C++ runtimes lack memory management. Caching is described in detail in the next chapter. But there are a couple of subtleties with using it without "leaking" uncached matrices which we review here.
When evaluating operations with constants like a+2
, the cache key is built from the GUID
of a
and the value 2. This is to allow the use of different constant sums like a+3
, a+4
etc. without aliasing. But the consequence is that iterative calculations with changing constants will not be cached. e.g., in a moving average expression:
alpha = 1f/n newB = alpha * A + (1 - alpha) * B
alpha
varies from one iteration to the next, and each iteration will produce different cached copies. One solution is to use a 1x1 matrix container for alpha. The locus of this container (CPU or GPU) should match that of A
and B
. The value can be updated using the set method:
alpha.set(1f/n)
and then alpha * A
and (1-alpha)
will both use the same cached container from one iteration to the next.