-
Notifications
You must be signed in to change notification settings - Fork 71
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WIP: Read undocumented datatypes #23
base: master
Are you sure you want to change the base?
Conversation
Sorry for the delay; I was on vacation with limited email access. This would be great to have. Given that it looks like the subsystem is effectively a MAT file of its own, I wonder if we could create a separate |
|
||
function read_subsystem_matfile{N}(data::Array{Uint8,N}) | ||
# A Matfile is stored within this matrix's data | ||
f = IOBuffer(data[:]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be IOBuffer(vec(data))
to avoid making a copy.
Not a problem. It's just something that I've been spending bits of spare time on as I get a chance. I did initially have the subsystem wrapped in a MatlabFile type, and we still can do that, but unfortunately it doesn't help with parsing the header fields. They skip the text header (which is reasonable and we can add an offset to the IOBuffer), but then don't have the same sort of spacing between the endianness and version fields and the data start. So we can't punt to MatOpen to do the endianness/version checking, but there could be other advantages. (I have a hunch that in their implementation, they never seek to offset 128, and instead jump to the next multiple of 8 after reading the header, which allows for this craziness). The reason I outlined a dict for the subsystem within the Matlabv5File type is because I think I'll do some upfront work in parsing a binary data structure that contains the class information — they store an array of sequential strings (without lengths) for the class structure information and then refer to those strings by array index (and not total offset). I appreciate your feedback! I'm still learning Julia. |
I'm still working on this as time permits. I've not had much time and the format ended up being way more convoluted than I initially thought. It's really wonky. I've been slowly reverse engineering and documenting the format in an IJulia notebook… and I think I finally have something working! Now I need to start moving the code over into MAT.jl and put together some more test cases that might more thoroughly explore the format's fields. But it's looking promising. And maybe others can make use of this information, too (ping @matthew-brett — you wrote the SciPy parser, no? You may be interested in this, too.) |
Matlab's opaque classes (handle classes) are stored in an undocumented manner. This defines the class id for mxOPAQUE_OBJECT and enables reading of the object data.
If subsys_offset is nonzero (or not spaces, yay backwards compatibility!), then there's an extra unnamed miMATRIX at the end of the file. This matrix contains the data of a complete matfile (except without the header), containing all the data for class objects in an undocumented layout. Crazily, this subsystem can additionally include another matrix element that contains yet another matfile.
Class objects need access to the MCOS subsystem data when read, so read it first if it is there.
Instead of passing the IO stream and swap_bytes boolean around separately, pass the matfile to the higher level functions instead. This will allow them to have access to the subsystem.
(Via parameterization). This improves performance now that I pass around the matfile more
Still need to add tests and implement parsing of nested objects
As an aside, here's a fun Matlab WTF (in 2013b):
Basically, the only thing that Matlab uses to see if it should parse out nested objects is the type and first element or two of the Uint32 array itself. It works with longer arrays, too. Absolutely incredible. Never save any Uint32 arrays as class properties if they might start with 0xdd000000! It seemed like that's exactly how I'd have to implement this, but I couldn't really believe that it was how Matlab did it. Unfortunately, it's the way they do it and it's all I have to go on, too. That said, it's not hard to be a bit smarter about the failure modes... |
Quite amusing. It's hard to believe how much worse that is than the approach we took with JLD. |
Extend the test to contain empty, 2d- and 3d- array test cases
Crazily, they didn't fix it when they moved to version 7.3, either. The above code behaves identically across all save versions. I've not dug into the format for 7.3 yet, but this would seem to imply that they could possibly share code. |
I have adapted a the code in here (and the IJulia notebook linked) to work with v7.3 MAT files. My purpose was to read There are also various discussion items that have not been resolved e.g.: should the |
BTW that was some epic reverse engineering @mbauman ! |
Your notebook has been extremely helpful for me to implement a v7.3 parser in python, supporting a range of built-in class-like types and custom classes. However, one thing I feel confused is that: if flag == 0
# This means that the property is stored in the names array
d[names[name_idx]] = names[heap_idx]
elseif flag == 1
# The property is stored in the MCOS FileWrapper__ heap
d[names[name_idx]] = heap[heap_idx+3] # But... the index is off by 3!? Crazy.
elseif flag == 2
# The property is a boolean, and the heap_idx itself is the value
@assert 0 <= heap_idx <= 1 "boolean flag has a value other than 0 or 1"
d[names[name_idx]] = bool(heap_idx)
else
error("unknown flag ",flag, " for property ",names[name_idx], " with heap index ",heap_idx)
end The comment "The property is a boolean, and the heap_idx itself is the value" makes sense to me at first, before I meet a counterexample with I'm seeking help because I was unable to write a custom class example with fields saved with flag Things I've tried so far:
I would like to know how you find flag 2 and its meaning. Could you provide a sample class definition or some assistance? Sorry for asking under this PR if this is inappropriate. |
I know this is an ancient PR, but I'm wondering if people are still interested in supporting this in MAT.jl. I have reverse-engineered how MATLAB stores classes in the MAT v7.3 format. I am planning to write up my findings in a couple of posts on my homepage. So far two have been written up:
There are a few more things that I'll write up as soon as I find the time.
@mbauman I'd be interested to know to what extent my findings will match up with what you have studied. I have not looked at MAT v5, and I expect that there'll be a lot of similarities. |
@Jeroen-van-der-Meer Happy to give you access to this repo if you'd like to improve the package. |
I've begun work on reverse-engineering some of Matlab's undocumented datatypes — most prominently their new-style class objects. My group works very extensively with class-based objects within Matlab, and so I've always found that to be a roadblock in picking up, e.g., Octave or NumPy or Julia (yes, yes, I know I can just save them as structs, but that adds additional complexities on the Matlab side that I'd prefer not do deal with). Plus, this has been a very interesting challenge.
I'm making good progress (just within v5 files), but I'm at the point where I'll need to do some major refactoring; all of the composite datatype readers (e.g., cells, structs, etc) will need knowledge of the information contained within the subsystem. I think it'd make sense to make them all functions of the Matlabv5File (which would then additionally contain the parsed subsystem information). I figured I'd ping you before making major changes… does this seem reasonable to you? Or do you have other suggestions?