-
Notifications
You must be signed in to change notification settings - Fork 73
String matrices
BIDMat includes two matrix types intended to store strings, CSMat and SBMat.
CSMat is a dense matrix whose elements are Java strings. You can create CSMats e.g. with the csrow and cscol functions:
> val x=csrow("to","be","or","not","to","be")
x: BIDMat.CSMat = to,be,or,not,to,be
> val y=CSMat(1,6)
y: BIDMat.CSMat = NULL,NULL,NULL,NULL,NULL,NULL
> val z = x on y
z: BIDMat.CSMat =
to be or not to be
NULL NULL NULL NULL NULL NULL
> z(?,0->2)
res4: BIDMat.CSMat =
to be
NULL NULL
Note that CSMat's can have null elements. CSMat supports "structural" operations like slicing and horizontal and vertical concatenation, but none of the algebraic options. It also doesnt support comparison operators for now. It does support a concatenation operator + and a kronecker product operator ** or unicode <math>\otimes</math>. Examples:
> x + x
res5: BIDMat.CSMat = toto,bebe,oror,notnot,toto,bebe
> x ** cscol("1","2")
res6: BIDMat.CSMat =
to1 be1 or1 not1 to1 be1
to2 be2 or2 not2 to2 be2
CSMat's have limited functionality because BIDMat by preference represents collections of strings using integer references into dictionary objects (see the next section).
CSMat's can be stored in a Matlab (-v7.3) file in a format compatible with Matlab. But we found that this format is slow to read and write: each string is treated as a separate block of data in HDF5, and compressed separately. This leads to slow reads and writes, and little or no compression.
SBMat is a specialization of BIDMat's sparse matrix type to byte contents. In an SBMat, each *column* represents a single string, so an SBMat can hold only a vector of strings (not a 2d matrix). The data field of an SBMat is a byte array containing the entire, concatenated contents of all the strings. The internal jc array of the sparse matrix points to the start of each string in the array. This is a much more memory-efficient, and much more IO-friendly representation than CSMat.
SBMat's are ideal for IO of dictionary data. They don't support much else. You can slice columns of an SBMat to get substrings, but many other operations (e.g. slicing rows) can lead to bad consequences.