Skip to content

String matrices

jcanny edited this page Jun 15, 2014 · 10 revisions

Matrices of Strings

BIDMat includes two matrix types intended to store strings, CSMat and SBMat.

CSMat

CSMat is a dense matrix whose elements are Java strings. You can create CSMats e.g. with the csrow and cscol functions:

> val x=csrow("to","be","or","not","to","be")
x: BIDMat.CSMat = to,be,or,not,to,be
> val y=CSMat(1,6)
y: BIDMat.CSMat = NULL,NULL,NULL,NULL,NULL,NULL
> val z = x on y
z: BIDMat.CSMat =
    to    be    or   not    to    be
  NULL  NULL  NULL  NULL  NULL  NULL
> z(?,0->2)
res4: BIDMat.CSMat =
    to    be
  NULL  NULL

Note that CSMat's can have null elements. CSMat supports "structural" operations like slicing and horizontal and vertical concatenation, but none of the algebraic options. It also doesnt support comparison operators for now. It does support a concatenation operator + and a kronecker product operator ** or unicode <math>\otimes</math>. Examples:

> x + x
res5: BIDMat.CSMat = toto,bebe,oror,notnot,toto,bebe
> x ** cscol("1","2")
res6: BIDMat.CSMat =
   to1   be1   or1  not1   to1   be1
   to2   be2   or2  not2   to2   be2

CSMat's have limited functionality because BIDMat by preference represents collections of strings using integer references into dictionary objects (see the next section).

CSMat's can be stored in a Matlab (-v7.3) file in a format compatible with Matlab. But we found that this format is slow to read and write: each string is treated as a separate block of data in HDF5, and compressed separately. This leads to slow reads and writes, and little or no compression.

SBMat

SBMat is a specialization of BIDMat's sparse matrix type to byte contents. In an SBMat, each *column* represents a single string, so an SBMat can hold only a vector of strings (not a 2d matrix). The data field of an SBMat is a byte array containing the entire, concatenated contents of all the strings. The internal jc array of the sparse matrix points to the start of each string in the array. This is a much more memory-efficient, and much more IO-friendly representation than CSMat.

SBMat's are ideal for IO of dictionary data. They don't support much else. You can slice columns of an SBMat to get substrings, but many other operations (e.g. slicing rows) can lead to bad consequences.

Clone this wiki locally