Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Request: binary serialization/deserialization #358

Closed
CyberBrainX opened this issue Nov 9, 2016 · 35 comments
Closed

Request: binary serialization/deserialization #358

CyberBrainX opened this issue Nov 9, 2016 · 35 comments
Assignees
Labels
aspect: binary formats BSON, CBOR, MessagePack, UBJSON kind: enhancement/improvement state: help needed the issue needs help to proceed
Milestone

Comments

@CyberBrainX
Copy link

Wouldn’t it be useful to have the possibility to serialize the json object to a more compact form?
This would make sense for storing the container or sending it over an internet connection.

The MessagePack (http://msgpack.org/) format or some proprietary format directly derived from your internal representation would be fine.
Or could you probably just provide some generic interface for the binary serialization/deserialization?

I think this approach could be more useable than just gzipping the whole JSON data.

As an example this library did implement some alternative representations of a json container: http://greatpanic.com/progdocs/UniversalContainer.html
(But this library doesn’t work properly and doesn’t use C++ 11)

@nlohmann
Copy link
Owner

nlohmann commented Nov 9, 2016

I also had a look at MessagePack, and it seems to be a reasonable approach. As it can express more than JSON, I hope that users would understand that the library would not be able to parse arbitrary MessagePack input.

@nlohmann nlohmann self-assigned this Nov 11, 2016
@nlohmann
Copy link
Owner

nlohmann commented Nov 11, 2016

I started to implement MessagePack. Commit 543745a contains a basic serializer/deserializer for JSON values as static functions. It works with a std::vector<uint8_t> and currently has no diagnosis information in case of a parse error.

Feedback would be greatly appreciated:

  • naming of the functions
  • static function vs. member function vs. free function
  • move implementation outside the main header?
  • organization of the functions (I am thinking of reorganizing the large if in the parser, maybe using re2c)
  • improvements of the low-level shifting/conversion stuff
  • byte vector vs. containers vs. ???

@nlohmann nlohmann added state: please discuss please discuss the issue or vote for your favorite option kind: improvement and removed kind: improvement labels Nov 11, 2016
@CyberBrainX
Copy link
Author

Wow, I couldn’t believe how fast you did implement a first version. And it seems to be working!
I think it’s a great benefit being able to exchange data with the MessagePack world. Except binary data, but binary data and JSON is a bad match anyway (base 64 encoding).

I did test the space savings in some of my use cases and got the following results.
In simple cases (small data set) with almost no numeric data the saving was 12% over the compacted JSON file. Zipping alone showed a 15% reduction. First converting to msgpack and zipping the result gave a 19% reduction. As expected, with more numeric data the compression rate did grow constantly. Nice!

  1. The naming to_msgpack and from_msgpack makes sense and is future save if you like to add more serializers/deserializers.
  2. I prefer the static function. It’s the right context and can be still used in a flexible way.
  3. At the moment it doesn’t hurt to have just one header. It’s just more convenient.
  4. I’am not sure if re2c would be necessary in the parsing part. A hand crafted parser could be faster (the same is true for the json part - rapid json). Using a switch statement (for the fixed values) or a jump table could possibly speed up parsing of the MessagePack data(needs some tests to be verified).
  5. The low-level shifting/conversion stuff could be improved but the code will get more complicated (dealing with endianness).
  6. A byte vector is fine. It has some nice properties. A lot of frameworks have chosen byte vectors for handling serialized data.

Thank you for your great JSON implementation!

@nlohmann nlohmann added this to the Release 2.0.8 milestone Nov 15, 2016
@user1095108
Copy link

I beg to differ. MsgPack is not standardized, whereas CBOR, for example, is. There are a myriad of binary data formats out there, but, for better or worse, CBOR is standardized in RFC 7049, whereas MsgPack is not.

@marton78
Copy link

+1 for CBOR!

@nlohmann
Copy link
Owner

I shall have a look at CBOR.

@nlohmann nlohmann removed this from the Release 2.0.8 milestone Nov 23, 2016
@CyberBrainX
Copy link
Author

Ok, standards are great, I like them. CBOR is a PROPOSED standard and could still get changed. The proposal has some nice properties and it seems to be an improved MessagePack protocol. Unfortunately it’s not widely adopted and there are a lot of poor implementations. There’s no modern C++ 11 implementation and the old style C++ implementation is looking more like C code.

@user1095108
Copy link

Oh, you wanted to call a library, I thought you wanted to write everything yourselves. But anyway, even it is "only" a proposed standard, this is still better than MSGPACK. Should it change, we will know about it, but we might not find out if MSGPACK changes.

nlohmann added a commit that referenced this issue Nov 26, 2016
@nlohmann
Copy link
Owner

I started to also implement CBOR. As the implementation tends to be rather verbose, I am thinking of moving the binary serialization/deserialization into a new header. In any case, I need to add more test cases.

@nlohmann nlohmann removed the state: please discuss please discuss the issue or vote for your favorite option label Nov 26, 2016
@marton78
Copy link

marton78 commented Nov 27, 2016

Hi @nlohmann, I've had a look at your code. The implementation could be less verbose by

  1. Exploiting the orthogonality between length and content type. You could write a decode_length helper function that is called for all content types that can have a length. The function would take the length from the first byte if first_byte & 0x1F < 0x18 and call decode_uint8 / 16 / 32 / 64 otherwise. Then it would return a pointer to the first content byte. Similar for encoding.
  2. Writing a jump table with 256 entries. This way you would also be sure that you don't forget anything.

@nlohmann
Copy link
Owner

@marton78 Good points. There may be even more possibilities to generalize between the MessagePack and CBOR sections given the two formats are so similar.

@nlohmann
Copy link
Owner

@marton78 I started to refactor the code and indeed it got much more concise, see a820d68. However, I think even after more refactoring, the whole CBOR/MessagePack business needs some 1000 lines of code...

@nlohmann
Copy link
Owner

nlohmann commented Dec 1, 2016

I really want to move the code for MessagePack and CBOR out of the main header (it's over 1000 lines of code...) and I need some ideas how to best do this.

My thoughts:

  • C++ does not allow to extend the basic_json class in an additional file, so I can't just move the static functions (to_msgpack, from_msgpack, etc.) to a new header.
  • I could move these functions in the nlohmann namespace. So far, I think I do not rely on private functions of the basic_json class, so this could be possible.

The syntax would then be:

// assuming vec is a std::vector<uint8_t>
json j = nlohmann::from_msgpack(vec);
auto vec2 = nlohmann::to_msgpack(j);

I would be happy about opinions, pointers to best practices, etc.

Furthermore, any feedback on the interfaces, in particular using std::vector<uint8_t> as input/ouput would be more than welcome.

@nlohmann nlohmann added the state: help needed the issue needs help to proceed label Dec 1, 2016
@gregmarr
Copy link
Contributor

gregmarr commented Dec 1, 2016

You could keep it in the class but in another file through a #include in the middle of the class definition, but it's really ugly.

If you did need access to private functions/data, you could forward declare a msgpack class in the nlohmann namespace and make it a friend of basic_json.

Probably the easiest is to just put the declarations of the long statics in the class definition, and then put the full definitions in another header.

@user1095108
Copy link

Just write a separate library. You are outgrowing the bounds of this one. Slap in some RPC while writing the new one :)

@nlohmann
Copy link
Owner

nlohmann commented Dec 1, 2016

If I would include the other header (say, json_binary.hpp), I would end json.hpp's independence, right? I would like to have the binary stuff separate from the rest, so you would only need to include the other header if you need MessagePack/CBOR support.

@gregmarr
Copy link
Contributor

gregmarr commented Dec 1, 2016

Yes, the second header should only be needed for callers of those functions.

@user1095108
Copy link

I think the best thing to do would be to make a new repository entirely.

@TurpentineDistillery
Copy link

Or keep the functionality in the same file, ifdefed-out by default. Whoever needs it can enable it with #define NLOHMANN_JSON_ENABLE_CBOR In the source, or via a compiler switch.

@nlohmann
Copy link
Owner

nlohmann commented Dec 1, 2016

Any idea how to move the code into a second file without also needing a preprocessor macro?

@TurpentineDistillery
Copy link

Put declarations in the main file, and definitions in another. Such files are typically named *.inl

@TurpentineDistillery
Copy link

I.e what @gregmarr said

@nlohmann
Copy link
Owner

nlohmann commented Dec 1, 2016

Or another idea (just thinking): What about splitting the code into several files for development and have them merge by a script into a single (large) header?

@nlohmann
Copy link
Owner

nlohmann commented Dec 1, 2016

(Maybe my judgement of a large header is wrong, but I fear that letting it grow (a) increases compilation times, (b) makes understanding the code harder, and (c) pushes a lot of functionality into code that may only need a fraction.)

@TurpentineDistillery
Copy link

A big benefit of having one file with declarations and one-or-many files with definitions is that it would allow for explicit template instantiation, so that the compilation overhead related to json.hpp can be factored out into a separate translation unit.

@nlohmann
Copy link
Owner

I am currently cleaning up and bringing the test coverage to 100%. I hope to release CBOR/MessagePack by next week.

@nlohmann
Copy link
Owner

Merged e906933.

@marton78
Copy link

Hi @nlohmann, great job! I have looked at your code, some remarks:

  • When converting single / double, you have reimplemented memcpy. Why don't you just use it? (or std::copy, would be actually even nicer).
  • you could #define all those magic numbers (or put them in an enum)
  • msgpack / cbor are so similar that a lot of code is duplicated. probably you could template those conversions based on serialization format (so, two structs: msgpack and cbor, both of which contain all magic numbers, limits for the integers to fit in the first byte and the conversion functions from small integer to one byte)

All in all very nice job, thanks!

@nlohmann
Copy link
Owner

@marton78 Thanks for the feedback.

  • Regarding memcpy: I need to have another look.
  • Regarding the magic numbers / reduce code duplication: There are in fact a lot of similarities between CBOR and MessagePack, but I doubt that creating a configurable super type from which both formats inherit makes the code much more readable. I may play around with it, but right now I'd rather focus on writing down the limitations of the implementations (i.e., which types are not supported).

@nlohmann
Copy link
Owner

@marton78 Sorry, but I can't find the spot where I can use memcpy/std::copy.

@TurpentineDistillery
Copy link

Maybe it's this part?

            case value_t::number_float:
            {
                // Double-Precision Float
                v.push_back(0xfb);
                const uint8_t* helper = reinterpret_cast<const uint8_t*>(&(j.m_value.number_float));
                for (size_t i = 0; i < 8; ++i)
                {
                    v.push_back(helper[7 - i]);
                }
                break;
            }

Also, if this part of code expects the number_float_t to be double, a static_assert would be appropriate?

@marton78
Copy link

marton78 commented Dec 15, 2016

Yes, it's this part, std::copy_backward could be used here instead of memcpy though, since copying is done in reverse.

@TurpentineDistillery
Copy link

This looks like the endianness handling business?
If so, it can't be assumed that the host platform is little-endian. The ntoh and hton family functions are there to take the host endianness into account and reverse the bytes efficiently as appropriate, but these are not portable. It looks like boost endian library also does this and compiles to efficient byte-swapping instructions too (not saying we should add boost dependency).

@user1095108
Copy link

Lol definetly not, look at my swap file to get some ideas:

https://github.com/user1095108/generic/blob/master/swap.hpp

@nlohmann
Copy link
Owner

FYI: I asked to advertize the latest release on the CBOR (cbor/cbor.github.io#25) and MessagePack (msgpack/msgpack#221) website.

@nlohmann nlohmann added the aspect: binary formats BSON, CBOR, MessagePack, UBJSON label Mar 28, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
aspect: binary formats BSON, CBOR, MessagePack, UBJSON kind: enhancement/improvement state: help needed the issue needs help to proceed
Projects
None yet
Development

No branches or pull requests

6 participants