-
Notifications
You must be signed in to change notification settings - Fork 601
IV vs PV serialisation #18958
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
IV vs PV serialisation #18958
Conversation
|
Sub::Quote, JSON::Tiny, and Mojo::JSON also look at the flags (indirectly) to decide how to serialize scalars. They all pass their tests with this branch. |
|
I've raised this PR for further discussion on p5p@ |
|
The intent to preserve initial-stringiness can’t be achieved without making numification not set IOK/NOK. Consider (on upstream): The Basically, in order to preserve initial number/string it seems like we have to:
Per @Grinnz via IRC:
Sereal also prefers the number, as does CBOR::Free. CBOR::XS prefers the string. |
|
@FGasper Sure - we haven't claimed this PR is in any way sufficient for making all code able to distinguish. It's just a good start. It helps in some situations and - we hope - doesn't make anything else worse. There are of course many follow-ups that can be done afterwards. |
|
Just to establish a destination for this journey, I believe it should ideally be the case that after my $x = "6";
my $y = $x + 36;the flags on Basically:
I must add that I have no sense of the feasibility of getting there without too much breakage, but is there consensus that this is the platonic ideal? |
This PR on its own is not intended to fix how existing serialization code picks types, but to allow new code to be written that can make those choices accurately. Existing serializers need to cope with how perl currently sets its flags, so they need to use heuristics. With this PR merged, any value with POK set is guaranteed to have started as a string. So a POK+IOK value, like in your example, is a string. No heuristic needed. Taking advantage of this would require updates to serializers, who would need to have different behavior on different perl versions. It would also be worth trying to avoid setting IOK/NOK when numfying strings. That can be addressed separately though. |
|
On Wed, Jul 28, 2021 at 01:46:23AM -0700, Aristotle Pagaltzis wrote:
Just to establish a destination for this journey, I believe it should ideally be the case that after
````perl
my $x = "6";
my $y = $x + 36;
````
the flags on `$x` would be POK=on IOK=off and on `$y` POK=off IOK=on, but `$x` would also have pIOK=on (assuming my understanding of the flags is correct…).
Note that currently the IOK vs pIOK flags (and similarly for NOK/pNOK)
are used to distinguish between an accurate and an approximate conversion:
$ p -MDevel::Peek -e'my $x = "6"; my $y = int($x); Dump $x' 2>&1 | grep FLAGS
FLAGS = (IOK,POK,IsCOW,pIOK,pPOK)
$ p -MDevel::Peek -e'my $x = "6.1"; my $y = int($x); Dump $x' 2>&1 | grep FLAGS
FLAGS = (NOK,POK,IsCOW,pIOK,pNOK,pPOK)
…--
You're only as old as you look.
|
|
General ping here - What's the overall state? The general feel I get from reading the comments is "this is fine, but not sufficient to give us proper type-intent distinction". Which is what we expected. I don't feel anyone has said "Don't do this", just people have said "This isn't enough". Is that accurate? If so perhaps we can merge this and continue working |
|
On Tue, Aug 10, 2021 at 02:17:07AM -0700, Paul Evans wrote:
General ping here - What's the overall state?
The general feel I get from reading the comments is "this is fine, but not sufficient to give us proper type-intent distinction". Which is what we expected. I don't feel anyone has said "Don't do this", just people have said "This isn't enough".
Is that accurate? If so perhaps we can merge this an
I'm worried about the performance implications. Various hot places in the
core (such as pp_add()) take optimised paths in the presence of IOK, but
not pIOK.
…--
The crew of the Enterprise encounter an alien life form which is
surprisingly neither humanoid nor made from pure energy.
-- Things That Never Happen in "Star Trek" #22
|
|
Relating to @iabyn's comment, and a few thoughts of my own on other PRs: Do we have a good way to benchmark these sorts of things? Primarily in terms of CPU time, but also perhaps in memory usage. I've had a few things I've wanted to change but wanted to find out what the {CPU, memory} impact of doing so would be. |
|
|
On Wed, Aug 11, 2021 at 6:19 PM iabyn ***@***.***> wrote:
I'm worried about the performance implications. Various hot places in the
core (such as pp_add()) take optimised paths in the presence of IOK, but
not pIOK.
Can this concern be alleviated by instead having the core base those
optimisation choices on the presence of the pIOK flag ?
(That's my fallback plan if I find that the loss of the IOK flag affects
those modules of mine that are looking for the IOK flag. In fact, I'm
thinking I should probably just make the switch to the pIOK flag anyway.)
Cheers,
Rob
|
No, pIOK can be set when IV isn't an accurate representation: |
|
Rebased, and with better comments in the commit Update SvPV() etc to avoid calling sv_2pv_flags() for cached IV strings |
As far as I can see this doesn't change when the IOK or pIOK flags are set. |
This was the intent. The macro code goes from to ie just the flags test changes. If I understand superscalar CPUs correctly, likely this only adds one cycle to the execution, as the two flag tests can dispatch in parallel. Old and new code both depends on the result of (all the) flags tests before it can resolve the branch direction. I don't know if it's even written down anywhere, but the basic "data structure with cached values and flags" design paradigm seems to be
The last one meaning that at times where the structure is officially and so pointer as shown is stored, and the code uses the offset shown, what might actually be allocated is this: and the reading code is always "memory at stored pointer plus offset" independent of what is actually allocated. |
As far as I know this invokes undefined behaviour (a pointer pointing outside an object), but we've been doing it forever. |
|
Summarizing the various objections:
To distinguish between "started as string" and "started as number" we can simply test: There's no need for extra flags changes. Are there any other objections to this PR? |
POK vs POKp have the same problem. SvPV uses the cached string representation only when SVf_POK is set, so merging this PR will break it. It seems that in practice "p" flags don't mean anything. |
|
On Thu, 27 Jan 2022, 10:55 Karen Etheridge, ***@***.***> wrote:
If someone has this branch running somewhere, a test run against
JSON::Schema::Modern would be appreciated, as this distribution peeks at
the PV, IV and NV flags for deriving "true" string/numeric properties.
thanks
FWIW, that never worked reliably. A classic case is "0e0" another is " 0 ",
another is "0000". Variants thereof are also problematic. Sereal had to
address many of these cases with string inspection. IMO the change proposal
can only improve things.
Yves
… |
|
I rebased, resolved conflicts and added two commits @tonycoz offered on a PR to my branch on my fork. "It's not easy being green" We're good? |
| Perl itself can represent but do not map one-to-one into external formats, so | ||
| need some amount of approximation or encapsulation. | ||
|
|
||
| =item * |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this still related here? This looks to be unrelated change from a different commit elsewhen
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you try hard enough you can dereference a Regexp as a scalar. (Assignable) LVALUEs are also scalars:
$ perl -MDevel::Peek -e '$a = \(substr $^X, 0, 1); Dump $$a'
SV = PVLV(0x20ae4e8) at 0x2085fb0
REFCNT = 1
FLAGS = (GMG,SMG)
IV = 0
NV = 0
PV = 0
MAGIC = 0x209d648
MG_VIRTUAL = &PL_vtbl_substr
MG_TYPE = PERL_MAGIC_substr(x)
TYPE = x
TARGOFF = 0
TARGLEN = 1
TARG = 0x20860e8
FLAGS = 0
SV = PV(0x20869c0) at 0x20860e8
REFCNT = 2
FLAGS = (POK,pPOK)
PV = 0x209d6d0 "/usr/bin/perl"\0
CUR = 13
LEN = 15
I'm not confident that these are the only two cases, so didn't want to claim a definitive list.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a few stylistic/docs things, plus a request for a quick-and-easy tool to implement what’s written out here manually.
Thanks!
| else if (flags & SVf_POK) | ||
| serealise as string | ||
| else | ||
| the existing guesswork ... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know there’s been discussion of this, but FWIW this might be bit nicer if it were wrapped up some new API call … maybe SvORIGTYPE()? Then the delta could just mention that API call and maybe some of the flag-manipulation changes, with most of these other details in some sort of .pod document. This discussion may be of especial interest outside the scope of perldelta.
It just seems like a lot of explanation for a changelog.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think adding something like that would be useful, but I don't think it needs to hold up this particular PR. We'd verymuch like to get this change in before 5.35.9 (due in 3 days from now). Extending the XS-author API later would be a fine addition which we can do afterwards, perhaps as part of a wider look at "how we do type-like information".
| Perl itself can represent but do not map one-to-one into external formats, so | ||
| need some amount of approximation or encapsulation. | ||
|
|
||
| =item * |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems in particular like it ought to be a different section from the much-more-public original-types change.
The audiences for these two things are different: C API callers on the one, Perl-maintainer Jedi wizards on the other.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, probably we can do better. But for a dev release perldelta, I didn't think we needed to get this perfect.
(As in, I don't think that this should block this PR)
| } | ||
| else { | ||
| /* something special or undef */ | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment as above: wrapping this up as SvORIGTYPE or some such would be amazing and simplify both the documentation and the lives of folks (like me!) who want to use this new hotness.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, agree, but I wasn't sure on exactly what, and I didn't see it as a blocker on this PR.
|
On Mon, 5 Jul 2021 at 11:45, Nicholas Clark ***@***.***> wrote:
Some external serialised data formats such as JSON can distinguish between
values that are numbers and values that are strings, and services that
communicate in JSON often rely on this distinction being possible. Until
now, Perl has a problem with reliably generating output that meets these
expectations, because the internal state of $answer after these two code
paths had been functionally identical:
my $answer = 42;
print "The answer is $answer\n";
my $answer = "42";
print "That doesn't look right\n"
unless $answer == 6 * 9;
This branch
- changes Perl_sv_2pv_flags so that it does not set the SVf_POK flag
when caching the string representation of an integer value
- updates the macros SvPV etc so that they continue to use the cached
value directly (and hence do not start calling into Perl_sv_2pv_flags
where they had not previously)
- updates Perl_sv_setsv_flags to copy the changed flag state accurately
This permits serealisers to reliably distinguish between values first
written as numbers, and values first written as strings.
These changes exposed a bug in Data::Dumper (now fixed), where it was
already failing to handle NVs and overloaded references for certain cases.
All core tests still pass, along with the CPAN modules Sereal, JSON::XS,
Cpanel::JSON::XS and all their dependencies.
Either I have misunderstood the discussion we had on this, and the
explanation above or the patch isn't working as expected:
nicholas/IV-vs-PV:/git_tree/perl$ ./perl -Ilib -MDevel::Peek -e'my $s= " 0
"; my $i=0+$s; Dump($s)'
SV = PVIV(0x55d490dacc08) at 0x55d490daa8e0
REFCNT = 1
FLAGS = (IOK,POK,IsCOW,pIOK,pPOK)
IV = 0
PV = 0x55d490db64c8 " 0 "\0
CUR = 3
LEN = 10
COW_REFCNT = 1
The result is that $s is IOK and POK, even tho $s started off as a string,
and contains leading and trailing spaces in it.
Is this what you intended? I thought the intent of this patch was to ensure
that code like this would leave $s pIOK but NOT IOK.
Yves
|
|
@demerphq There's two sub-problems: Sometimes "stringy" values end up with IOK or NOK accidentally, and sometimes numbers end up with POK accidentally. They are two separate sides of the same overall concern. This patch only addresses the second kind - it ensures that numbers don't accidentally gain POK as well. So if you see a value with POK you know it was intended to be a string. This patch itself doesn't address the converse problem (of strings gaining IOK|NOK); we'd have to do that separately. |
…an IV. This permits XS code (such as serialisers) to distinguish between values that started as IVs but had a string representation cached, and values that started as PVs but had an (exact) integer representation cached. As implemented, the change in flags means that Perl_sv_2pv_flags() will be entered each time the string for an IV is wanted. The next commit will fix SvPV() and the other macros to avoid calling Perl_sv_2pv_flags() more than once, restoring the previous behaviour.
This "teaches" them the new SV flags combination implemented by the previous commit.
…lags. Previously Perl_sv_setsv_flags() would gleefully turn on SVf_POK for these values, which meant that any copy no longer propagated the (new) state that says "this value started as an integer".
This needs an update to known_pod_issues.dat because the Pod has trailing whitespace to avoid parts of the same verbatim code block getting separated.
I don't think that I can explain it better than that (or add to that usefully, without clarification on what). |
|
Anyone have any reasons why this shouldn't be accepted for inclusion in 5.36? |
leonerd
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
|
On Fri, 18 Feb 2022 at 15:35, Neil Bowers ***@***.***> wrote:
Anyone have any reasons why this shouldn't be accepted for inclusion in
5.36?
It might cause some turbulence, but it will be worth it if it does. I am in
favour of merging it.
cheers,
Yves
|
Considering that this branch is enough to ensure originally-strings and originally-numbers stay distinguishable as such, is any strong reason left to still do that part? |
Since PR Perl#18958, values that start as IVs will not get their POK flags set when they are used as strings. This is meant to aid in serialization, allowing the "original" type of a value to be preserved. For NV values, the POK flag was already usually not being set, because the string form of a float could change based on the locale changing. However, for Inf and NaN values, the POK flag would still be enabled. Also, POK would be set for all floats if USE_LOCALE_NUMERIC was not defined. Update Perl_sv_2pv_flags to only enable the POKp flag when storing the PV for Inf or NaN values, or all NVs when USE_LOCALE_NUMERIC is not defined.
Since PR #18958, values that start as IVs will not get their POK flags set when they are used as strings. This is meant to aid in serialization, allowing the "original" type of a value to be preserved. For NV values, the POK flag was already usually not being set, because the string form of a float could change based on the locale changing. However, for Inf and NaN values, the POK flag would still be enabled. Also, POK would be set for all floats if USE_LOCALE_NUMERIC was not defined. Update Perl_sv_2pv_flags to only enable the POKp flag when storing the PV for Inf or NaN values, or all NVs when USE_LOCALE_NUMERIC is not defined.
Some external serialised data formats such as JSON can distinguish between values that are numbers and values that are strings, and services that communicate in JSON often rely on this distinction being possible. Until now, Perl has a problem with reliably generating output that meets these expectations, because the internal state of
$answerafter these two code paths had been functionally identical:This branch
Perl_sv_2pv_flagsso that it does not set theSVf_POKflag when caching the string representation of an integer valueSvPVetc so that they continue to use the cached value directly (and hence do not start calling intoPerl_sv_2pv_flagswhere they had not previously)Perl_sv_setsv_flagsto copy the changed flag state accuratelyThis permits serealisers to reliably distinguish between values first written as numbers, and values first written as strings.
These changes exposed a bug in Data::Dumper (now fixed), where it was already failing to handle NVs and overloaded references for certain cases. All core tests still pass, along with the CPAN modules
Sereal,JSON::XS,Cpanel::JSON::XSand all their dependencies.