-
Notifications
You must be signed in to change notification settings - Fork 0
Loading and saving matrices
BIDMat supports two main file types: a Matlab-compatible HDF5 format, and a simple custom binary format with optional gzip or lz4 compression. The first format makes it easy to exchange data with Matlab, Scipy and many other tools. The second format is generally much faster.
Matrices in Matlab-compatible HDF5 format can be read with commands like this:
scala> val a:IMat = load("d:\data\sentiment\data1.mat","tokens") scala> val b:SMat = load("d:\data\sentiment\data2.mat","trigrams")
The load command takes a filename argument, followed by the name of a variable in the file. Assuming the data were created by Matlab (with the "-v7.3" option to save), the variable name is the name of the object saved in Matlab.
Note that each variable declaration includes a matrix type. This is important. The load function can return FMat, DMat, IMat, SMat, SDMat, CMat, CSMat or String objects, and its actual return type is AnyRef. Providing a type declaration for the assigned value or variable tells the compiler exactly what type to expect, and allows the variable to be bound to the correct type. Note that CSMat is similar to Matlab's "cell matrix" and its elements may be of any of the above types. Mostly commonly though, the CSMat will hold string data.
The underlying representation is HDF5, a widely-used format for storing matrices of scientific data, and the format now used by Matlab. Matlab's version of this format is prefaced by a 512-byte header. That is the only difference between Matlab's HDF5 files and non-Matlab HDF5 files. Without the header though, Matlab will not read a data file. It will also complain if certain metadata on each array are missing. So its best to use a save function that is compatible with Matlab.
You can also load and save non-Matlab compatible HDF5 files using saveAsHDF5(fname,varname)
. The file contents are the same as saveAs(fname,varname)
but the Matlab-compatible header is not created. You can read these files using the same load(fname,varname)
command.
Saving variables to a file is straightforward:
scala> saveAs("d:\data\sentiment\data1.mat", a, "tokens", b, "trigrams")
You can save an arbitrary number of variables to a file. The first argument to saveAs is a filename, and the remaining args form an alternating list of variables from the environment, and String names. The effect is that variable a is saved as "tokens", b is saved as "trigrams" etc. In fact a and b dont have to be references to matrices, they could be any expressions that return the appropriate matrix types.
You can load this data directly into Matlab with the load command (which doesnt need the "-v7.3" option). It will create variables named "tokens" and "trigrams" that are respectively a dense matrice of int32, and a sparse matrix of double.
Not all Matlab types are supported. Currently there are dense matrices of double, float and int32, and sparse double matrices (you can also save and load sparse matrices with float coefficients which do not exist in Matlab). String data are stored as uint16, which matches well with the internal formats of Matlab and Java/Scala, and will be read by Matlab as strings. A CSMat of string data will be read by Matlab as a cell array of Strings. Unfortunately, this is very inefficient in HDF5. Matlab really only has cell string arrays to handle variable-length strings. As in Matlab, the contents of each cell are stored as a separate array. In HDF5, compression only happens within a given array (i.e. within one string). Arrays of short strings, like dictionaries, cannot be compressed at all. It would be better to use another format, e.g. sparse array of uint16, that could hold variable-length strings for I/O and be converted to cellString array for manipulation.
BIDMat includes a simple binary file format for high-speed load/save of compressed or uncompressed data. BIDMat files include a 4-word (16 byte) binary header. The first word specifies the matrix type. It has the form:
WXYZ00ABC (decimal digits) WXYZ = version number (currently zero) A = matrix type: 1 (dense), 2 (sparse), 3 (sparse, norows), 4 (3-tensor), 5 (4-tensor), 6 (5-tensor) B = data type: 0 (byte), 1 (int), 2 (long), 3 (float), 4 (double), 5 (complex float), 6 (complex double) C = index type (sparse matrices only): 1 (int), 2 (long)