Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: Read undocumented datatypes #23

Open
wants to merge 10 commits into
base: master
Choose a base branch
from

Conversation

mbauman
Copy link
Member

@mbauman mbauman commented Jan 4, 2014

I've begun work on reverse-engineering some of Matlab's undocumented datatypes — most prominently their new-style class objects. My group works very extensively with class-based objects within Matlab, and so I've always found that to be a roadblock in picking up, e.g., Octave or NumPy or Julia (yes, yes, I know I can just save them as structs, but that adds additional complexities on the Matlab side that I'd prefer not do deal with). Plus, this has been a very interesting challenge.

I'm making good progress (just within v5 files), but I'm at the point where I'll need to do some major refactoring; all of the composite datatype readers (e.g., cells, structs, etc) will need knowledge of the information contained within the subsystem. I think it'd make sense to make them all functions of the Matlabv5File (which would then additionally contain the parsed subsystem information). I figured I'd ping you before making major changes… does this seem reasonable to you? Or do you have other suggestions?

@simonster
Copy link
Member

Sorry for the delay; I was on vacation with limited email access. This would be great to have.

Given that it looks like the subsystem is effectively a MAT file of its own, I wonder if we could create a separate Matlabv5File for it, e.g. by adding offset and eof fields to Matlabv5File or creating a "SubIO" type, and then just read that file to populate the subsystem Dict. You know more about the subsystem format than I do, though, and depending on the particulars (e.g. if the matrix holding the subsystem could be compressed) your current approach may be substantially simpler.


function read_subsystem_matfile{N}(data::Array{Uint8,N})
# A Matfile is stored within this matrix's data
f = IOBuffer(data[:])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be IOBuffer(vec(data)) to avoid making a copy.

@mbauman
Copy link
Member Author

mbauman commented Jan 7, 2014

Not a problem. It's just something that I've been spending bits of spare time on as I get a chance. I did initially have the subsystem wrapped in a MatlabFile type, and we still can do that, but unfortunately it doesn't help with parsing the header fields. They skip the text header (which is reasonable and we can add an offset to the IOBuffer), but then don't have the same sort of spacing between the endianness and version fields and the data start. So we can't punt to MatOpen to do the endianness/version checking, but there could be other advantages. (I have a hunch that in their implementation, they never seek to offset 128, and instead jump to the next multiple of 8 after reading the header, which allows for this craziness).

The reason I outlined a dict for the subsystem within the Matlabv5File type is because I think I'll do some upfront work in parsing a binary data structure that contains the class information — they store an array of sequential strings (without lengths) for the class structure information and then refer to those strings by array index (and not total offset).

I appreciate your feedback! I'm still learning Julia.

@mbauman
Copy link
Member Author

mbauman commented Feb 20, 2014

I'm still working on this as time permits. I've not had much time and the format ended up being way more convoluted than I initially thought. It's really wonky. I've been slowly reverse engineering and documenting the format in an IJulia notebook… and I think I finally have something working!

Now I need to start moving the code over into MAT.jl and put together some more test cases that might more thoroughly explore the format's fields. But it's looking promising. And maybe others can make use of this information, too (ping @matthew-brett — you wrote the SciPy parser, no? You may be interested in this, too.)

Matlab's opaque classes (handle classes) are stored in an undocumented manner.
This defines the class id for mxOPAQUE_OBJECT and enables reading of the
object data.
If subsys_offset is nonzero (or not spaces, yay backwards compatibility!), then
there's an extra unnamed miMATRIX at the end of the file. This matrix contains
the data of a complete matfile (except without the header), containing all the
data for class objects in an undocumented layout.  Crazily, this subsystem can
additionally include another matrix element that contains yet another matfile.
Class objects need access to the MCOS subsystem data when read, so read it first if it is there.
Instead of passing the IO stream and swap_bytes boolean around separately, pass the matfile to the higher level functions instead. This will allow them to have access to the subsystem.
(Via parameterization).  This improves performance now that I pass around the matfile more
Still need to add tests and implement parsing of nested objects
@mbauman
Copy link
Member Author

mbauman commented Mar 16, 2014

As an aside, here's a fun Matlab WTF (in 2013b):

>> m = containers.Map('1',uint32([3707764736 1 1 1 1]))
m = 
  Map with properties:
        Count: 1
      KeyType: char
    ValueType: any
>> m('1')
ans =
  3707764736           1           1           1           1
>> save('m.mat','m')
>> load m.mat
>> m('1') % No longer a uint32 array, but rather a reference back to the map itself!
ans = 
  Map with properties:
        Count: 1
      KeyType: char
    ValueType: any
 >> m('2') = uint32([3707764736 0 0]);
 >> m('2')
 ans =
  3707764736           0           0
 >> save('m.mat','m')
 >> load m.mat
 >> m('2') % No longer a uint32 array, but empty instead!
 ans =
     []
>> m('3') = uint32([3707764736 2 1 1 2 1]);
>> save('m.mat','m')
>> load m.mat
% Seg-fault!

Basically, the only thing that Matlab uses to see if it should parse out nested objects is the type and first element or two of the Uint32 array itself. It works with longer arrays, too. Absolutely incredible. Never save any Uint32 arrays as class properties if they might start with 0xdd000000!

It seemed like that's exactly how I'd have to implement this, but I couldn't really believe that it was how Matlab did it. Unfortunately, it's the way they do it and it's all I have to go on, too. That said, it's not hard to be a bit smarter about the failure modes...

@timholy
Copy link
Member

timholy commented Mar 16, 2014

Quite amusing. It's hard to believe how much worse that is than the approach we took with JLD.

@mbauman
Copy link
Member Author

mbauman commented Mar 18, 2014

Crazily, they didn't fix it when they moved to version 7.3, either. The above code behaves identically across all save versions. I've not dug into the format for 7.3 yet, but this would seem to imply that they could possibly share code.

@jebej
Copy link
Contributor

jebej commented Apr 28, 2020

I have adapted a the code in here (and the IJulia notebook linked) to work with v7.3 MAT files. My purpose was to read datetime values, which works fine. Should I make an other PR?

There are also various discussion items that have not been resolved e.g.: should the MCOS variable be read when the file is opened, so that references to the #subsystem# need only to be processed once?

@jebej
Copy link
Contributor

jebej commented Apr 28, 2020

BTW that was some epic reverse engineering @mbauman !

@jjyyxx
Copy link

jjyyxx commented Dec 17, 2020

@mbauman

Your notebook has been extremely helpful for me to implement a v7.3 parser in python, supporting a range of built-in class-like types and custom classes.

However, one thing I feel confused is that:

if flag == 0
    # This means that the property is stored in the names array
    d[names[name_idx]] = names[heap_idx]
elseif flag == 1
    # The property is stored in the MCOS FileWrapper__ heap
    d[names[name_idx]] = heap[heap_idx+3] # But... the index is off by 3!? Crazy.
elseif flag == 2
    # The property is a boolean, and the heap_idx itself is the value
    @assert 0 <= heap_idx <= 1 "boolean flag has a value other than 0 or 1"
    d[names[name_idx]] = bool(heap_idx)
else
    error("unknown flag ",flag, " for property ",names[name_idx], " with heap index ",heap_idx)
end

The comment "The property is a boolean, and the heap_idx itself is the value" makes sense to me at first, before I meet a counterexample with heap_idx equal to -1.

I'm seeking help because I was unable to write a custom class example with fields saved with flag 2. That is, I can't determine the exact condition for flag 2. I can only reproduce by creating instances of RTW.TimingInterface (field Priority) or coder.types.AggregateElement (field TargetOffset). However, they are built-in classes that are hard to inspect.

Things I've tried so far:

  1. Get metaclass of RTW.TimingInterface, and inspect its properties. I made my own class align with RTW.TimingInterface, but still could not reproduce this behavior.
  2. I noticed that fields mentioned above, for example, RTW.TimingInterface#Priority enforces scalar input. In my custom class, add property validation (size specification), hinting the property is a scalar, but still could not reproduce this behavior.

I would like to know how you find flag 2 and its meaning. Could you provide a sample class definition or some assistance?

Sorry for asking under this PR if this is inappropriate.

@Jeroen-van-der-Meer
Copy link
Contributor

Jeroen-van-der-Meer commented Aug 7, 2024

I know this is an ancient PR, but I'm wondering if people are still interested in supporting this in MAT.jl.

I have reverse-engineered how MATLAB stores classes in the MAT v7.3 format. I am planning to write up my findings in a couple of posts on my homepage. So far two have been written up:

  1. In the first post I've just given a general review of how MATLAB stores simple objects in v7.3. None of this is new, and if you're familiar with the MAT.jl code then you can safely ignore this.
  2. In the second post I've outlined how MATLAB stores classes. The content of this post should allow you to parse nested trees of classes that have simple values (doubles, chars, cell arrays, etc.) at their leaves. MATLAB happens to store te entire 'tree structure' in an obfuscated stream of bytes in the hidden /#refs# dataset of the file. Deobfuscating this stream is a bit tricky.

There are a few more things that I'll write up as soon as I find the time.

  • If your class has default values or it has properties that have their types specified, the encoding changes, and I'll highlight how to deal with that.
  • Once you understand classes, it's pretty easy to parse datetimes and tables, and I'll elaborate a bit on how that's done.
  • Strings and enums are encoded in a rather peculiar way, not consistent with other objects, and they have to be dealt with separately.

@mbauman I'd be interested to know to what extent my findings will match up with what you have studied. I have not looked at MAT v5, and I expect that there'll be a lot of similarities.

@ViralBShah
Copy link
Contributor

@Jeroen-van-der-Meer Happy to give you access to this repo if you'd like to improve the package.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants