Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Another idea about type support #774

Closed
nlohmann opened this issue Oct 8, 2017 · 28 comments
Closed

Another idea about type support #774

nlohmann opened this issue Oct 8, 2017 · 28 comments
Labels
state: help needed the issue needs help to proceed state: please discuss please discuss the issue or vote for your favorite option state: stale the issue has not been updated in a while and will be closed automatically soon unless it is updated

Comments

@nlohmann
Copy link
Owner

nlohmann commented Oct 8, 2017

The problem

The library currently offers some template arguments to configure the types used for numbers, strings, booleans, objects, and arrays. However, this type support is very limited, and even simple things like using std::unordered_map for objects or an std::vector with a user-defined allocator for arrays do not work. Even worse, there are tons of std::string usages in the code even though we actually have a type basic_json::string_t which is a typedef for the passed string type (see #766).

With issue #456 came the idea to replace the long template parameter list with a policy class. I like the idea, but I am afraid that this change does not improve things much with respect to the mentioned problems.

Another solution

One possible solution for a true type configuration could be the following: we define interfaces for each type that must be implemented by concrete implementations to be used in the basic_json class which would be refactored to only rely on this interface.

For instance, we could have a class hierarchy

class abstract_value
{
public:
    virtual ~abstract_value() = 0;

    virtual bool empty() const noexcept = 0;
    virtual size_type size() const noexcept = 0;
    virtual size_type max_size() const noexcept = 0;
    virtual void clear() noexcept = 0;
    
    virtual bool is_primitive() const noexcept = 0;
    virtual bool is_structured() const noexcept = 0;
    virtual const char* name() const noexcept = 0;
    virtual value_t type() const noexcept = 0;
    
    ...
};
class abstract_primitive_value : public abstract_value
{
public:
    bool empty() const noexcept override { return false; }
    size_type size() const noexcept override { return 1; }
    size_type max_size() const noexcept override { return 1; }
    
    bool is_primitive() const noexcept override { return true; };
    bool is_structured() const noexcept override { return false; };
    
    ...
};
class abstract_boolean_value : public abstract_primitive_value
{
public:
    const char* name() const noexcept override { return "boolean"; }
    value_t type() const noexcept override { return value_t::boolean; }
    
    ...
};

and finally

class boolean_value_imp : public abstract_boolean_value
{
public:
    boolean_value_imp() = default;
    boolean_value_imp(const bool v) : m_value(v) {}
    
    void clear() noexcept override
    {
        m_value = false;
    }
    
    ...
    
private:
    bool m_value = false;
};

to implement the Boolean value.

We could then have a traits class like

struct default_value_traits {
    using null_value = null_value_imp;
    using boolean_value = boolean_value_imp;
    using string_value = string_value_imp;
    using array_value = array_value_imp;
    using object_value = object_value_imp;
    ...
};

and finally a value like

template<typename traits = default_value_traits>
union json_value
{
    using null_value = typename traits::null_value;
    using boolean_value = typename traits::boolean_value;
    using string_value = typename traits::string_value;
    using array_value = typename traits::array_value;
    using object_value = typename traits::object_value;

    null_value *null;
    boolean_value *boolean;
    string_value *string;
    array_value *array;
    object_value *object;
    
    ...
}

We would provide the default implementations wrapping bool, int64_t, uint64_t, std::string, std::vector, and std::map as default_value_traits. If someone likes a different type, one just needs to implement the respective interface, add the implementation to a different traits class and pass it to json_value.

My questions

  1. What do you think about this approach?

  2. What types should we make configurable. I currently think of

    • null
    • boolean
    • number (integer)
    • number (unsigned integer)
    • number (floating-point)
    • string
    • object
    • array
  3. Should the keys of objects be of the same string type? What about strings used during serialization?

  4. What kind of constructors should we require? I sketched that Booleans should be constructible from bool. But what else? Should we require strings to be constructible from const char*? Or also std::string?

  5. How much overhead do we need to pay with this approach?

  6. How much behavior should we require? For instance, should a type bring its own dump() function? Should a number type bring its from_string(const char*) function for parsing?

This is a rough idea right now, but I would really appreciate feedback. I know there is a lot of work to be done, but I really want to get rid of all the limitations I described above.

@nlohmann nlohmann added state: help needed the issue needs help to proceed state: please discuss please discuss the issue or vote for your favorite option labels Oct 8, 2017
@theodelrieu
Copy link
Contributor

  1. I think we can avoid using virtual functions, seems to me that we should define Concepts for each type you mentioned. Then we can merge those which are similar.

  2. Seems reasonable, I don't really know about null though.

  3. The strict minimum? Should every type only take const char* and handle the parsing itself?

  4. If we don't use virtual functions, the overhead will mostly depend on the interface implementation.

  5. There's a lot of discussions about parsing customization points (especially in the std-proposal mailing list IIRC), we might adopt some of the suggestions? dump/parse is the strict minimum.

@gregmarr
Copy link
Contributor

gregmarr commented Oct 8, 2017

Should the keys of objects be of the same string type? What about strings used during serialization?

The type of the keys of the objects would, I think, be determined by a type in the object implementation. I could see a case for having one trait for the value string type, and another one for the serialization type. This might require conversion functions between the two types.

@theodelrieu
Copy link
Contributor

I've just seen Vinnie's Falco CppCon talk, and I agree that defining and using concepts is the way to go (the part about solving the allocator awareness problem was really interesting).

@nlohmann
Copy link
Owner Author

Thanks, I'll have a look.

@nlohmann
Copy link
Owner Author

@theodelrieu Thanks for the reference! Though I did not get all the details, this seems to be exactly the thing we need!!! I also realized @vinniefalco commented on some issues here before. :)

@vinniefalco
Copy link

Thanks for the mention!

I don't think the feature is necessary. I believe users want performance and ease of use. They want to customize the allocator, they want fast parsing, they want efficient manipulation of the contents of a JSON container. I don't think it is important to let users control the type of integer for representing numbers. Furthermore by allowing customization points, the implementation becomes highly constrained.

In fact we need the opposite, we need to refactor the interface so that the implementation is even less constrained. std::string should not be used at all in any part of a JSON library. Instead it should use std::string_view (or its equivalent).

Every JSON library that I have seen uses separate classes to represent each of the JSON values. These classes are first-class types. In this library you can see them with names such as json::array or json::object. These objects are copyable and assignable and form the building blocks for the JSON container. In my opinion, this is the wrong approach. The problem with using these first class types is that it makes using an allocator difficult. You have the same problem as the standard containers. The allocator must be propagated to every new element. And the syntax is going to be terrible. Imagine:

// create a new, empty array
auto js = json::array<my_alloc>(param);

Another problem with storing an allocator in every JSON value is that of storage. A stateful allocator is going to require at least 4 or 8 bytes depending on architecture. This will quickly add up for even modestly sized JSON documents.

If we can't use first-class types then how will we construct and return values? I think it is worth investigating the use of iterators. A JSON document can have a single associated allocator and expose begin and end functions to allow traversal of the top layer. Upon dereferencing, the iterator would return a proxy. So these would be InputIterator types. The proxy will expose sufficient functionality to allow the caller to determine the type of value and then call an appropriate member to view the value as a standard type: string_view, double, or long long. For the case of arrays and objects there will again be a proxy value that exposes a range. The array range would be integer-indexed while the object range would be indexed by strings.

I have to stress that the returned proxy would not be copyable, or else we are back to the problem of having to store numerous copies of an allocator.

I also suspect that in the "ideal" JSON library, the data structure produced by parsing should be different from the data structure produced by assembling a JSON object procedurally from code. It seems there could be performance gains from simply retaining the input buffer on a parse and setting up views over that data. For things like arrays and objects which require some extra data, an associated allocator may be invoked. To support parsing of multiple discontiguous input strings, a parser can handle the case where a string is split across buffers by performing a small allocation to linearize the string. I believe this hybrid approach could produce very nice results.

But if we use a different representation for parsing than editing how would that work? There could be a json_view type of class which represents a read-only, immutable JSON document. This abstracts the implementation for storing the tree from the way the tree is accessed by callers. In other words the JSON equivalent of string_view.

I have not explored these ideas very much but if you think they are interesting I would be happy to schedule a Google Hangout where we can get together and talk about them, and do some exploratory coding up of declarations in a shared code document (like http://codeshare.io/).

What do you think?

@gregmarr
Copy link
Contributor

gregmarr commented Oct 25, 2017

In fact we need the opposite, we need to refactor the interface so that the implementation is even less constrained. std::string should not be used at all in any part of a JSON library. Instead it should use std::string_view (or its equivalent).

This library currently only requires C++11, not C++17. Using std::string_view everywhere would require dropping support for compilers that only support C++14 and earlier. Replacing std::string with std::string_view everywhere would not be possible in general, as std::string_view is non-owning. There may be some cases where it's possible, but in others, it would require separate storage for the string, so you're back to having std::string or equivalent for that storage.

@vinniefalco
Copy link

vinniefalco commented Oct 25, 2017

std::string_view (or its equivalent)

@gregmarr
Copy link
Contributor

I really need to see Vinnie's talk. I was in the SG14 meeting, so I couldn't see it live.

@gregmarr
Copy link
Contributor

Yes, we could have an equivalent class for compilers that don't support C++17, but that doesn't eliminate the other part of the problem, the fact that it doesn't actually provide any storage.

@vinniefalco
Copy link

The storage is provided by the JSON container

@gregmarr
Copy link
Contributor

Exactly: There may be some cases where it's possible, but in others, it would require separate storage for the string, so you're back to having std::string or equivalent for that storage.

So now you have some arbitrary storage, and a string_view. Why is that better than string?

@nlohmann
Copy link
Owner Author

Thanks for the quick response!

I agree that ease of use is the main reason for people using this library. Secondary reasons are performance (and allocators, of course #766 #25), but also a lot of people ask how they can control various details of the library's behavior such as the number serialization format (#777). I also heard requests for specific number formats (#799) or maintaining the insertion order in objects (#727). Also the parser strategy (pull vs. push) is a frequently discussed issue.

With the sketched approach, I wanted to decouple the value types from the rest of the library. @theodelrieu made a similar proposal to split the actual JSON-specific code from the underlying values. I hoped that by specifying the interfaces, the whole library becomes cleaner.

At the same time, I also fear that this not only would mean a complete rewrite, but also make things "configurable" in such a clumsy way, that hardly anyone would use this feature...

Your description of allocator usage shows me that the described approach would be less helpful as I thought - but I have little to no knowledge on allocators in the first place...

I have not fully understood how your iterator-based approach looks like. So I really second the idea of getting together in a Hangout.

On the idea of having a view on parsed data: This sounds interesting, but I am not sure whether we should prioritize this. I am also not sure whether this is possible, because strings may need to be processed to cope with "\uxxxx" literals.

In any way - I would love to progress the whole type idea, because it has been a hack from the start and it would feel good to have solved this in a nice way. However, I would like to maintain support for C++11 compilers (which still seems to be an issue for many systems), and also would like to keep the library's interface as "STL-like" as it is right now.

@vinniefalco
Copy link

the parser strategy (pull vs. push) is a frequently discussed issue.

Yes, and also incremental parsing (does this library support that?) I have not yet found a JSON library that allows pieces of input to be supplied in separate calls. This would be very useful for implementing a beast Body container.

On the idea of having a view on parsed data: ... I am not sure whether we should prioritize this. I am also not sure whether this is possible, because strings may need to be processed to cope with "\uxxxx" literals.

Right, so a bit of background. I approach library development from an interface perspective. It seems to me that the interface is the most important aspect of a library, as implementation can always be changed later in a manner that is transparent to users.

I'm not suggesting that having a view on parsed data is a priority for implementation, but rather I am saying that a good design will allow for this use-case. It should not be the only parser implementation - a good design will allow multiple parser implementations to present a unified interface for the caller to view the results, while each implementation can have radically different techniques. For example, if the user cares about "\uxxxx" literals then they use a general purpose, allocating parser. If they want pure speed without decoding those literals (a common use-case in many REST APIs) they can choose a different parser. In each case the parser exports its results in some kind of view that has a uniform interface.

I have not fully understood how your iterator-based approach looks like. So I really second the idea of getting together in a Hangout.

To be honest I don't understand it fully either :) I just have a collection of ideas gained from working with JSON in a few projects. They need to be refined to be usable. I think with your considerable experience (I didn't even know about "\uxxxx" literals) we might come up with something good.

I would like to maintain support for C++11 compilers

I totally agree. It is far too early for anyone to be dropping support for C++11 (Beast requires only C++11 and likely will remain that way for years). A JSON library could provide its own string_view.

@vinniefalco
Copy link

Consider joining the cpplang Slack: https://cpplang.now.sh/
My userid is @vinnie

@theodelrieu
Copy link
Contributor

Thanks for your input! I'll gladly join the Hangouts :)

@gregmarr
Copy link
Contributor

FYI, there is also a Slack workspace for this project. https://nlohmannjson.slack.com/ I am on both.
I don't have time today for a hangout, so if you could summarize the conclusions, that would be great. Thanks for your interest @vinniefalco

@vinniefalco
Copy link

I need an invite to the JSON slack (vinnie.falco@gmail.com)

@gregmarr
Copy link
Contributor

Invite sent.

@nlohmann
Copy link
Owner Author

I had a brief chat with @vinniefalco on Wednesday and he explained to me the issues he sees in the current library implementation. I try to summarize them as I understood them - please correct me if I'm wrong:

  • The way values are stored is unfriendly for user-defined allocators. Vinnie would rather see a storage that is independent from STL containers and uses some node-based approach that uses a single allocator for each added node.
  • Currently, a json value owns all data it represents. Hence, combining values (e.g., with j["other"] = j2;) leads to copies which could be avoided if we take an approach where we store a reference to j2 instead.
  • When a JSON text is parsed, all data is sequentially copied into the JSON value hierarchy. If we only want to read from the JSON value, then we could avoid copies if we would only store references to the JSON text and provide a "JSON view" on the input buffer.

As for a Hangout: How about 8 p.m. CET (http://everytimezone.com/#2017-10-30,1860,cn3) on Monday, October 30?

@theodelrieu
Copy link
Contributor

I'll be available from November 3rd unfortunately, if you don't mind reporting the meeting, I'd really like to participate :)

@nlohmann
Copy link
Owner Author

I am OK with waiting until November 3.

@vinniefalco
Copy link

vinniefalco commented Oct 28, 2017

@nlohmann Also, there is no single implementation which is optimal for all use-cases. The simplest example is "parse to read-only view" versus "edit a new json document". Another example, consider an "insert-only" container.

@nlohmann
Copy link
Owner Author

nlohmann commented Nov 3, 2017

Sorry, I won’t make it tonight.

@theodelrieu
Copy link
Contributor

Hi, let's post on Slack our disponibilities, to try to meet next week ;)

@quicknir
Copy link

I just wanted to throw in my 0.02 here. It's not possible to be optimal for all use cases, but one valuable technique is for your API to have layers: there are lower layers that are still public. These layers make fewer assumptions, are less convenient for common use cases, but are very flexible and "zero cost".

IMHO, for a json library there should always be a lowest level public API which is applying a visitor to a stream of data assumed to be in json format (or I guess, multiple functions, one for each fundamental stream of data you can deal with).

template <class F>
void apply_json_visitor(const InputTypeStream&, F&& f);

The exact interface provided by the visitor and how it receives callbacks is obviously something that requires some thought. But it is not really that complicated. The library should actually call to this function for all of its high level use cases (like creating json objects), so you have that as a sanity check. This very cleanly separates parsing logic, from data structure logic. The link between them is of course an appropriate visitor, that simply populates a recursive datastructure as it receives callbacks.

Fun fact: passing a trivial visitor to the apply_json_visitor will be a maximally efficient way to verify that the stream is valid json. AFAIK this can only currently be done in this library by actually extracting a data structure which will trigger many heap allocations. Just a fun example!

Once you make this separation, it becomes easier to inject customization points, because it's clear where they belong. You can also progressively layer things: you can create visitors that make certain assumptions, but allow certain customizations as well, to simplify semi-common use cases like changing the canonical data structures it's parsed into. You can also write a visitor that provides the input iterator type interface. To tie in with some of @vinniefalco 's ideas: if you want to ignore certain types of tokens and quickly parse others, you may not need a data structure at all, just a very simple visitor that has null implementations for some callbacks and immediately handles others on the spot.

The only thing this doesn't give you I suppose, is if you really want to manually control the progress of the parser as it steps through. AFAIK though, in order to do this you have to step away from a true recursive descent algorithm and actually keep an explicit stack (which may or may not be slower; would definitely require some thought). I think though that the approach I'm suggesting will cover most use cases. It's basically inversion of control: for loop for std::for_each. The main thing you get when you can manually control step through is the ability to exit parsing early. You can still do this with an exception, but if you want to exit early as an optimization then an exception will probably be too slow.

@stale
Copy link

stale bot commented Dec 21, 2017

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the state: stale the issue has not been updated in a while and will be closed automatically soon unless it is updated label Dec 21, 2017
@stale stale bot closed this as completed Dec 28, 2017
@tiagomacarios
Copy link

@nlohmann Any update on using wstrings? I am working on a project were this library would be the perfect fit, but the whole codebase is wstring...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
state: help needed the issue needs help to proceed state: please discuss please discuss the issue or vote for your favorite option state: stale the issue has not been updated in a while and will be closed automatically soon unless it is updated
Projects
None yet
Development

No branches or pull requests

6 participants