Skip to content

Loading and saving matrices

jcanny edited this page May 26, 2014 · 27 revisions

Table of Contents

Matlab and HDF5 Files

BIDMat supports two main file types: a Matlab-compatible HDF5 format, and a simple custom binary format with optional gzip or lz4 compression. The first format makes it easy to exchange data with Matlab, Scipy and many other tools. The second format is generally much faster.

Reading Mat Files

Matrices in Matlab-compatible HDF5 format can be read with commands like this:

scala> val a:IMat = load("d:\data\sentiment\data1.mat","tokens")

scala> val b:SMat = load("d:\data\sentiment\data2.mat","trigrams")

The load command takes a filename argument, followed by the name of a variable in the file. Assuming the data were created by Matlab (with the "-v7.3" option to save), the variable name is the name of the object saved in Matlab.

Note that each variable declaration includes a matrix type. This is important. The load function can return FMat, DMat, IMat, SMat, SDMat, CMat, CSMat or String objects, and its actual return type is AnyRef. Providing a type declaration for the assigned value or variable tells the compiler exactly what type to expect, and allows the variable to be bound to the correct type. Note that CSMat is similar to Matlab's "cell matrix" and its elements may be of any of the above types. Mostly commonly though, the CSMat will hold string data.

The underlying representation is HDF5, a widely-used format for storing matrices of scientific data, and the format now used by Matlab. Matlab's version of this format is prefaced by a 512-byte header. That is the only difference between Matlab's HDF5 files and non-Matlab HDF5 files. Without the header though, Matlab will not read a data file. It will also complain if certain metadata on each array are missing. So its best to use a save function that is compatible with Matlab.

You can also load and save non-Matlab compatible HDF5 files using saveAsHDF5(fname,varname). The file contents are the same as saveAs(fname,varname) but the Matlab-compatible header is not created. You can read these files using the same load(fname,varname) command.

Saving Data to Files

Saving variables to a file is straightforward:

scala> saveAs("d:\data\sentiment\data1.mat", a, "tokens", b, "trigrams")

You can save an arbitrary number of variables to a file. The first argument to saveAs is a filename, and the remaining args form an alternating list of variables from the environment, and String names. The effect is that variable a is saved as "tokens", b is saved as "trigrams" etc. In fact a and b dont have to be references to matrices, they could be any expressions that return the appropriate matrix types.

You can load this data directly into Matlab with the load command (which doesnt need the "-v7.3" option). It will create variables named "tokens" and "trigrams" that are respectively a dense matrice of int32, and a sparse matrix of double.

Limitations

Not all Matlab types are supported. Currently there are dense matrices of double, float and int32, and sparse double matrices (you can also save and load sparse matrices with float coefficients which do not exist in Matlab). String data are stored as uint16, which matches well with the internal formats of Matlab and Java/Scala, and will be read by Matlab as strings. A CSMat of string data will be read by Matlab as a cell array of Strings. Unfortunately, this is very inefficient in HDF5. Matlab really only has cell string arrays to handle variable-length strings. As in Matlab, the contents of each cell are stored as a separate array. In HDF5, compression only happens within a given array (i.e. within one string). Arrays of short strings, like dictionaries, cannot be compressed at all. It would be better to use another format, e.g. sparse array of uint16, that could hold variable-length strings for I/O and be converted to cellString array for manipulation.

BIDMat Files

BIDMat includes a simple binary file format for high-speed load/save of compressed or uncompressed data. Each file holds a single matrix of particular type. We recommend expressing the file type in the file name, although it can be read from a header in the file. To save an FMat a you can do:

> saveFMat("/data/mymat.fmat", a)      or
> saveFMat("/data/mymat.fmat.gz", a)   or
> saveFMat("/data/mymat.fmat.lz4", a)

Each command saves the matrix in BIDMat binary format. The first command stores the matrix uncompressed. The second two commands save with gzip or lz4 compression respectively. The compression type in those cases is determined by the file extension. To load the data from these files you use corresponding load commands:

> val x = loadFMat("/data/mymat.fmat")      or
> val x = loadFMat("/data/mymat.fmat.gz")   or
> val x = loadFMat("/data/mymat.fmat.lz4")
In each case x will have type FMat, and the correct decompression method will be inferred from the file name.

BIDMat File Compression

LZ4 compression is typically 5-20 times faster than low-compression gzip. File sizes are larger, but the faster load/save times are a big advantage in most applications we have looked at. The default gzip compression level is 3, which also favors faster compression for somewhat higher file sizes.

Its possible to override the default compression (e.g. to save as a filename without ".gz" or ".lz4") using a third optional argument:

> saveFMat("/data/mymat.fmat", a, compress)
Where compress:Int=2 does gzip compression and compress=3 does lz4 compression. With gzip, you can further tailor the compression level from level 1 (faster, lower compression), to level 9 (slower, higher compression) using the global variable:
> Mat.compressionLevel

BIDMat File Format with LZ4 Compression

BIDMat files include a 4-word (16 byte) binary header. The first word specifies the matrix type. It has the form:

WXYZ00ABC (decimal digits)
WXYZ = version number (currently zero)
A = matrix type: 1 (dense), 2 (sparse), 3 (sparse, norows), 4 (3-tensor), 5 (4-tensor), 6 (5-tensor)
B = data type: 0 (byte), 1 (int), 2 (long), 3 (float), 4 (double), 5 (complex float), 6 (complex double)
C = index type (sparse matrices only): 1 (int), 2 (long)

The next 3 words are respectively:

nrows (Int)
ncols (Int)
nnz   (Int) for sparse matrices, zero for dense matrices
Clone this wiki locally