-
Notifications
You must be signed in to change notification settings - Fork 57
NaN canonicalization #134
Comments
Good observation! I agree. |
I can think of two design principles that would motivate this:
Are those the principles motivating this suggestion, or am I off base? Regardless, are the underlying design principles documented anywhere? |
I'm thinking about this in the context of this part of the interface-types motivation:
To promote maximal reuse, we should discourage APIs that rely on NaN bits to convey meaningful information, because such APIs wouldn't be accessible from JS or other languages with a single NaN. |
The reasoning seems to be that interface types should restrict communication between components to information that all languages can retain in their obvious representation without loss. That seems to be adding yet another scope creep to interface types, and one that's fairly fuzzy. Also, there's nothing that says that an interface-types There also might be a cost to this. I imagine some day we will have bulk operations for lifting and lowering |
To be sure, I'm still exploring the space here. One option would be to say that it's nondeterministic whether NaNs are canonicalized at boundaries. That would let implementations skip the overhead, but still signal the intent of Another option would be to observe that NaN canonicalization is SIMD-optimizable, so we may be able to make it fairly fast. It still wouldn't be free though, especially for small-to-medium arrays. Does OCaml have a way of representing a value which is unboxed if it's in the 31-bit integer range, and boxed otherwise? Could it use its tagged integer representation for this? If so, it wouldn't have to box in the common case, and it wouldn't have to fail on values that other languages accept. |
Regardless of what the answer to this technical question is, as of the vote on the scope yesterday, this sort of question seems squarely within the scope of interface types (and the component model), specifically under promoting cross-language interopability. This isn't the first time this sort of question has popped up and it won't be the last. In all these sorts of questions, there is an inherent tension between expressing every last thing a particular language might want and defining language-agnostic interfaces that are widely implementable. By way of comparison, I'd be surprised if COM or protobufs makes any guarantees about preserving NaN payloads across interfaces. That would mean that they've implicitly chosen the "non-deterministically canonicalize NaNs" route @sunfishcode suggested is another option above. Given wasm's emphasis on determinism, it makes sense for us to ask if we should choose differently. I think there's a third motivation in addition to the two @tlively gave: if |
Yes, these sorts of questions will come up regularly. One way to resolve them is to have everyone fight over an answer, pick a winner, and then apply that solution to everyone. That strategy results in lots of fights and comes out with winners and losers. Another way to resolve them is to find a way to accommodate everyone's needs. Adapters (even simple ones) and fusion provide a great way to make this possible. For example, a producer of an API for which NaNs are supposed to be insignificant can use an Meanwhile, an API intended for efficient numeric programs (and which has little interest in JS programs) can still use interface types as a means of efficient shared-nothing data transfer. Their needs are not bottlenecked by others' irrelevant needs. As a bonus, if the numeric program using the API happens to rely on NaN canonicalization for its own purposes, it can use I would rather interface types offer options to people rather than impose choices on people (of course, so long as it also provides sufficient isolation so that others' choice of the available options does not interfere with one's own choice). |
This is perfectly reasonable: to have more than one coercion operation. That is one of the fundamental merits of the adapter fusion approach. |
I'm not sure I follow the counter argument. The consumer is generally free to do whatever it wants with the data. If you didn't provide |
Should it be possible to declare an interface with an Choices so far include:
|
The choice I suggested was:
|
In fact, the (non-default) lowerer for JS could lower to the number type for non-NaNs and lower to BigInt for NaNs. |
It seems like a reasonable balance that also allows for incremental progress would be
|
Depends on what the goal is. If maximum interoperability is then canonicalization seems essential |
When I was helping with the early days of Kotlin, they wanted to interop with Java but to also be nulll-safe, which posed an obvious problem. The solution strategy I developed for them at a high-level was to have an ergonomic default—the type-checker would treat Java values as non-null but automatically insert checks where such assumptions were made—but also a more manual fallback—the type-checker would also recognize Java values were potentially non-null and still permit programmers to test them for nullness before automatically inserting checks. That interop strategy was quite successful and is analogous to what I am suggesting here. |
The central question is whether the abstract of values admitted by an The concrete experience we have from years of IEEE754 usage in the wild is that non-canonical NaNs aren't useful in practice (other than for NaN-boxing, which wouldn't be a cross-component thing) and mostly only serve to cause language/toolchain vendors to waste time worrying about them, so if indeed the runtime cost is negligible, then I don't see why we wouldn't take the opportunity to (slightly) improve the robustness and simplicity of the ecosystem. |
That is no longer an interoperability argument. That's fine; I'm just pointing out that the argument has moved to cleaning up the past (in a non-opt-in fashion). Regardless of what we decide here, languages and tooling will have to worry about NaNs. I don't see how canonicalizing over the boundary will help with that. Even for languages that rely on NaN-canonicalization (e.g. for branch-free NaN-insensitive equality comparisons or hashing), if here you choose a different canonical NaN than the one they chose for their runtime, then they'll have to recanonicalize everything anyways. I worry that extending the scope of interface types beyond efficient transfer/communication/interaction to the point that we have to try to anticipate/review all programs' needs to come to an answer makes for an infeasible and contentious goal. |
That's not the goal; see NaN-boxing
In practice (e.g., JS engines today), the canonical NaN bitpattern is an arbitrary global configurable constant, so as long as it is standardized, it can be
The context here is the Component Model, and the goals and use cases (recently confirmed by the CG) now definitely extend past simply "transfer/communication/interaction" but, rather, robust composition of components implemented in different languages. Maybe it's a bad idea and we'll fail -- but I believe that's scope for this proposal. |
So a component I would expect to be supported by such a model is floating-point libraries (e.g. ones providing functions like If a goal of the component model is to be able to make libraries like libc into components
There are existing multi-language systems, some of which have formal guarantees about how components from different languages compose and interact. The norm in these systems is, when composing components from different languages, to insert coercions that are necessary for the two specific languages at hand at the boundary point. (Typically these coercions are auto-generated from the types, though sometimes they can be explicitly specified by the person doing the composing.) So, if you're composing two components of the same language, then you insert identity coercions. But if you're converting between different languages with different representations of the same concept, then you insert a coercion that maps between those representations as faithfully as possible (e.g. mapping arbitrary 64-bit doubles to [non-NaN doubles + NaN] in the obvious fashion, and then selecting a specific 64-bit NaN representation in the reverse direction). The relevant composition theorems still hold with such coercions. One common composition theorem multi-language systems strive for is that composing two components of language L within language L results in the same behavior as composing two components of language L each as part of the multi-language system. That makes it possible for a program to, say, On the other hand, having a "global" bottleneck representation hinders, rather than aids, multi-language systems. It limits what languages you can add to the system because you've required that they're natural coercions must be at least as expressive as the bottleneck (more formally speaking, they have a semi-invertible surjection to the bottleneck representation). Or, it means that as you consider more languages you'll have to narrow your bottleneck further. This is the issue I raised with 31-bit integers being the "natural/ergonomic" counterpart in OCaml. So my understanding of multi-language systems suggests that |
Libraries like libc are specific examples given of what would stay as shared-everything modules (possibly imported and reused inside a component). Components are not meant to subsume intra-language modules and linking -- even when components exist, you'll naturally want to use language-native modules and ABIs for rich, deep language integration.
Which systems are you talking about, concretely? Because, if you look at COM or microservices-composed-with-gRPC (which you could roughly summarize as the two large industrial endeavors in this multi-language-composition space in the past), none of the things you're saying hold. It's possible you're thinking about systems that compose, say, within the .NET universe, or within the JVM universe, and those systems will have a natural inclination to unify the world on the runtime's native object/string/etc concepts, but with both wasm and the component model we're both talking about a very different setting. Fundamentally, the problem in having canonicalization be a pairwise detail is that you lose any abstract semantics for a component in isolation. That is, if I give you a component A, you can't tell me, semantically, what it precisely does without knowing which component B uses it and which languages the two components were implemented in. That's the opposite of "black box reuse", which is one of the fundamental goals of components. |
Hmm, one side effect of rebasing interface types on top of the component model that I hadn't thought of before is that we no longer have a roadmap for to intra-component FFI. Previously we had been punting all issues of inter-language communication to IT, even for boundaries between modules in the same trust domain. For that fully trusted FFI use case, preserving NaN bits is a perfectly reasonable thing to want to do, and I think that's what @RossTate is getting at above. Do we have a story for creating polyglot components? One way to sidestep the whole problem would be to have both canonicalized and non-canonicalized float interface types, but that seems like a big hammer. Would it also be possible to lift an f64 with NaN boxing into a sum type and lower it back to a NaN boxing f64 on the other side as a runtime no-op? |
My operating assumption is: regardless of same-language vs. different-language, all the code inside a single component is either generated from the same toolchain or adhering to the same ABI (like this one). Because of that, there's no strong need for interface types or shared-nothing linking: everyone just follows the same ABI using the same linear memory and trusts that everyone else does as well. (Interface Types are for when you don't want to assume those things.)
Yes: use core wasm and use the module-linking features to link those core modules together (noting that module-linking features don't require interface types and are happy to link up core modules as shown). |
Thanks, @tlively, for effectively rephrasing one of my concerns. My connection to However, the floating-point libraries I mentioned still seem like perfect examples of what a (multi-language) component model should support. They are easily shared-nothing: they provide solely "pure" input-output functions that don't even have state or even need a shadow stack. And they are regularly used in multi-language settings: Java programs using But I think the higher-level issue here is agreeing upon what a Multi-Language Component Model is. From your presentation, I understood a component to be a piece of software that implements some API and is self-contained/isolated (i.e. shared-nothing) except through explicit data exchanges (via interface types). To me, the above floating-point libraries match that description. I suspect that we roughly agree on that high-level description of components—where we disagree is on what multi-language means. From the discussion above, my sense is that the interpretation of multi-language y'all are arguing for is that all expressible component APIs can be conveniently implemented and utilized to their full extent by "any" language. But to me, multi-language means that a component implementing an API can be implemented by "any" language capable of implementing that API. So if the API is to maintain a stateful counter, then "pure" languages like Haskell (or, even more constraining, Coq) are probably not what you're going to implement your component with. And if the API requires preserving NaNs, then JavaScript is probably not what you're going to implement your component with. And if that API offers more precision than what some other components (or languages) need, then those other components simply won't utilize the full benefits of the services your component offers (and no one is hurt by that). I consider my interpretation to be additive—the more languages you include in "any", the more APIs you can (conveniently) support—whereas the other interpretation seems to be subtractive—the more languages you include in "any", the fewer APIs you're allowed to support. I don't see the value of the subtractive interpretation (are we going to restrict all components to be pure functions so that Haskell/Coq can call them conveniently?), but I do see value in the additive interpretation. An example that comes to mind is decimal values. C# and .NET offer direct support for 128-bit decimal floating-point values, including syntactic integration into the C# language and hardware acceleration in the runtime. This is extremely valuable to financial programs. With the subtractive interpretation, we wouldn't add something like With the additive interpretation, we would add something like I would like interface types to provide a system where people can deliberately write different components of a program in different languages according to the strengths of those languages and then conveniently compose those components in order to collectively construct programs that no one programming language could write well. That's what a multi-language component model means to me, and to me that means that interface types should broaden rather than restrict interactions. |
I don't think the general question you're asking can be answered definitively in the abstract with either of the extremes positions you're suggesting ("only if all languages" vs. "only if any language"). It's easy to think of examples where either extreme position will lead to bad outcomes, and thus I don't think we can simply argue for one or the other in the abstract. Rather, as I think is usual with standards design, we have to consider the practical reality of use cases and weigh pros vs. cons, case by case. There are real practical downsides (listed above) with allowing non-canonical NaNs to cross boundaries and I think all the use cases for supporting non-canonical NaNs are hypothetical. Moreover, in line with what @tlively suggested, if real use cases did emerge, the right way to support them would be to add a second type; this would be a clear and useful signal to all implementations involved that they should take the extra pains to preserve the NaN payload. (E.g., a JS binding could then produce something other than a |
Okay, so we're back to not treating this as a problem about multi-language interop, but rather specifically about floating-point.
I gave real existing libraries that real existing languages currently link to in a cross-language shared-nothing manner, and the specifications of those APIs explicitly state requirements (in line with IEEE 754-2019 recommendations) that cannot be supported with NaN canonicalization. As many language runtimes link against a foreign-implemented library for these floating-point libraries (and expect them to preserve NaN payloads per IEEE 754 recommendations), such a component would provide a service that could be shared by many programs implemented across many languages. Could you articulate why you believe this is not a viable use case for interface types? While you've listed some hypothetical downsides, they did not seem to me to be articulated in sufficient depth to assert that they were real and practical. It would help me understand your perspective better if you were to elaborate on (one of) them further. For example, you mention tooling, but I don't see how NaN canonicalization would affect tools like LLVM/Binaryen—you have to know how the rest of the program handles (or ignores) NaNs, which only the programmer knows and hence there already exist various compiler flags to indicate how much flexibility to grant the compiler with respect to floating-point values. Maybe you have something else in mind, but without an elaboration on what that something is I have a hard time seeing how NaN canonicalization would have a real practical benefit for tooling. |
Late to this party, and I am not an expert in this area so feel free to ignore, but my gut reaction is that defaulting to canonicalization would not be desirable. My thinking:
So my solution would be for That, or introduce a |
If we were to compile that OpenLibm |
It would be easy to modify the code to do what .NET does to ensure IEEE compliance. Right now, the functions in those libraries have typically just one line that relies on the fact that either |
More generally, on a platform where |
Two thoughts:
I get why |
I've looked around for a source for .NET needing NaN payload propagation functionality and not been able to find one yet. |
@sunfishcode There are some links I can fish up on this that you might like, but it looks like I won't have time to re-find them until Friday. Sorry for the delay. |
First, for some context on IEEE 754, this document provides some useful background for the most recent changes:
Among these new operations are As for .NET, unfortunately the language spec significantly underspecifies many things. So you have to dig through the code comments and repo history for the specification "in practice". That link above is one example I found indicating they want IEEE conformance. This comment suggests they believe it makes the platform more appealing to numerical libraries:
Interestingly, there happens to be this recent issue about a problem caused by (unintentional) NaN canonicalization filed by a customer here expressing this concern:
That seems like a concern that would apply to components for (de)seriallizing numerical data. Digging around, it seems that some statistics software indeed make use of NaN payloads. One concrete example I found was that R uses payloads to distinguish between
Hope that helped satiate your curiosity! |
The evidence I would like to see is users (of floating point libraries) actually caring about NaN payloads. From what I've heard of the history of this feature, the use case that motivated the whole concept of NaN payloads (stuffing error codes into NaNs that implicitly propagate along with the float values) never actually materialized. Or, said differently: if IT canonicalized NaN payloads, it sounds like no end user would actually notice, much less complain. And, to reiterate what I said above: if someone did actually notice, due again to the high divergence in NaN-payload propagation, the right solution would be to add a separate type that explicitly called out "I care about the NaN payload", so that the interface contract is explicit and extra steps could be taken on languages that didn't propagate NaN payloads by default. This isn't hypothetical: this is what SpiderMonkey actually had to do in the past to implement the .wast test suite when it was compiled to JS and what SpiderMonkey-on-wasm would want to do if such NaN-payload use cases ever materialized. |
R programs are linked against a component (not written in R) providing IEEE-compliant implementations of common floating-point numbers. These programs rely on the fact that such a component preserves payloads and thereby maintains the distinction between |
Googling how R uses NaN payloads, I only see mention of the special "NA" value, with links to R documentation explaining that the propagation of NA is unreliable and varies by platform and compiler, which reinforces the overall point made by this issue that non-canonical NaNs are a source of importability and should not be relied upon. Moreover, due to this loose R spec language, coercing NA to a canonical NaN is allowed, so it's not clear whether, to my final question, any end user would actually notice NA vs. NaN. More generally, my assumption here is that the cross-language interop in the particular case of R is only between the usual C-family set of ABI-compatible languages which would naturally depend on shared-everything linking due to the prevalent use of shared linear memory in these math-intensive types libraries (e.g., when passing a matrix or large vector). Thus, I don't think this concrete example would be an instance of cross-component NaN payload propagation, but rather another instance of shared-everything linking. Another example would be Rust using C-implemented Most generally, due to the R-documented portability problems, if R wanted to provide deterministic NA semantics and support fully cross-language shared-nothing-linking, then the only robust option would be to keep NA as an implementation detail of the R component and reflect NA in public interfaces as explicit types that will show up well in all languages, like variants with error cases that have explicit payloads. |
I know we are talking about IEEE 754 here, where NaN bit patterns were designed to provide error handling over the different scenarios where NaN arises... but such error handling is not really used. A modern and serious revision of numerical representations used for computations (this is what In this format, there is no such thing as multiple types of NaNs, under the premise that it is a waste of bit patterns that can be used to improve accuracy. |
Ah, very good point: if there is some new variation on IEEE754 floats on the horizon that is a drop-in replacement for IEEE754 floats except for the removal of non-canonical NaNs, that seems like an additional argument in favor of not including them in component interface float semantics. Incidentally, I'm actively working on a rebase of this proposal onto the recently-merged Module Linking rebase (itself rebased onto the Component Model, as proposed earlier in the year), and I'm planning to include the fix to this issue. In particular, the current idea is for |
It seems potentially noteworthy that even though NaN representation distinguishability is itself left implementation-defined by ECMA-262, behavior is specified to be deterministic in the operation that makes it potentially observable:
Edit: lol I meant to set b[0], not a[0] there, oops. In V8 the representation propagates, in Spidermonkey it canonicalizes, but in both it does so reliably. This seems to make a case for at least determinacy, esp. given the role TypedArray plays in bridging JS & WASM. I'm totally out of my depth so this hardly counts as a meaningful opinion - I just figured it might be worth describing the specifics of how JS handles it in the spec & in practice because JS has been mentioned as a factor a few times and determinacy is one of the questions, but the specified-determinacy of Float32Array and Float64Array did not appear to have been previously noted. |
Some popular source languages don't have the ability to preserve NaN bit patterns. For example, JS has only a single
NaN
value, and makes no guarantees about preserving specific bit patterns. NaN bit patterns could become an API feature that some implementation source languages support that others don't. Consequently, interface types should consider canonicalizing NaNs inf32
andf64
values at interface boundaries.The text was updated successfully, but these errors were encountered: