Skip to content

Conversation

@nwc10
Copy link
Contributor

@nwc10 nwc10 commented Jul 5, 2021

Some external serialised data formats such as JSON can distinguish between values that are numbers and values that are strings, and services that communicate in JSON often rely on this distinction being possible. Until now, Perl has a problem with reliably generating output that meets these expectations, because the internal state of $answer after these two code paths had been functionally identical:

my $answer = 42;
print "The answer is $answer\n";

my $answer = "42";
print "That doesn't look right\n"
    unless $answer == 6 * 9;

This branch

  • changes Perl_sv_2pv_flags so that it does not set the SVf_POK flag when caching the string representation of an integer value
  • updates the macros SvPV etc so that they continue to use the cached value directly (and hence do not start calling into Perl_sv_2pv_flags where they had not previously)
  • updates Perl_sv_setsv_flags to copy the changed flag state accurately

This permits serealisers to reliably distinguish between values first written as numbers, and values first written as strings.

These changes exposed a bug in Data::Dumper (now fixed), where it was already failing to handle NVs and overloaded references for certain cases. All core tests still pass, along with the CPAN modules Sereal, JSON::XS, Cpanel::JSON::XS and all their dependencies.

@haarg
Copy link
Contributor

haarg commented Jul 5, 2021

Sub::Quote, JSON::Tiny, and Mojo::JSON also look at the flags (indirectly) to decide how to serialize scalars. They all pass their tests with this branch.

@leonerd
Copy link
Contributor

leonerd commented Jul 22, 2021

I've raised this PR for further discussion on p5p@
https://www.nntp.perl.org/group/perl.perl5.porters/2021/07/msg260916.html

@FGasper
Copy link
Contributor

FGasper commented Jul 22, 2021

The intent to preserve initial-stringiness can’t be achieved without making numification not set IOK/NOK.

Consider (on upstream):

perl -MCpanel::JSON::XS -e'my $x = "6"; my $y = $x + 0; print encode_json([$x])
[6]

The $x + 0 sets IOK, which causes JSON::XS to output a number.

Basically, in order to preserve initial number/string it seems like we have to:

  • prevent stringification from setting POK (done here)
  • prevent numification from setting IOK (not done here)

Per @Grinnz via IRC:

JSON::XS prefers the string value, JSON::PP and Mojo::JSON and Cpanel::JSON::XS prefer the number value because accidentally successfully numifying a stringy number is less of a false positive

Sereal also prefers the number, as does CBOR::Free. CBOR::XS prefers the string.

@leonerd
Copy link
Contributor

leonerd commented Jul 22, 2021

@FGasper Sure - we haven't claimed this PR is in any way sufficient for making all code able to distinguish. It's just a good start. It helps in some situations and - we hope - doesn't make anything else worse. There are of course many follow-ups that can be done afterwards.

@ap
Copy link
Contributor

ap commented Jul 28, 2021

Just to establish a destination for this journey, I believe it should ideally be the case that after

my $x = "6";
my $y = $x + 36;

the flags on $x would be POK=on IOK=off and on $y POK=off IOK=on, but $x would also have pIOK=on (assuming my understanding of the flags is correct…).

Basically:

  • String operators always yield string values, and mathematical operators numeric values
  • Intermediate conversions are cached and kept, but they don’t affect the “official” “type” of the scalar

I must add that I have no sense of the feasibility of getting there without too much breakage, but is there consensus that this is the platonic ideal?

@haarg
Copy link
Contributor

haarg commented Jul 28, 2021

The intent to preserve initial-stringiness can’t be achieved without making numification not set IOK/NOK.

Consider (on upstream):

perl -MCpanel::JSON::XS -e'my $x = "6"; my $y = $x + 0; print encode_json([$x])
[6]

This PR on its own is not intended to fix how existing serialization code picks types, but to allow new code to be written that can make those choices accurately. Existing serializers need to cope with how perl currently sets its flags, so they need to use heuristics. With this PR merged, any value with POK set is guaranteed to have started as a string. So a POK+IOK value, like in your example, is a string. No heuristic needed. Taking advantage of this would require updates to serializers, who would need to have different behavior on different perl versions.

It would also be worth trying to avoid setting IOK/NOK when numfying strings. That can be addressed separately though.

@iabyn
Copy link
Contributor

iabyn commented Aug 3, 2021 via email

@leonerd
Copy link
Contributor

leonerd commented Aug 10, 2021

General ping here - What's the overall state?

The general feel I get from reading the comments is "this is fine, but not sufficient to give us proper type-intent distinction". Which is what we expected. I don't feel anyone has said "Don't do this", just people have said "This isn't enough".

Is that accurate? If so perhaps we can merge this and continue working

@iabyn
Copy link
Contributor

iabyn commented Aug 11, 2021 via email

@leonerd
Copy link
Contributor

leonerd commented Aug 12, 2021

Relating to @iabyn's comment, and a few thoughts of my own on other PRs: Do we have a good way to benchmark these sorts of things? Primarily in terms of CPU time, but also perhaps in memory usage. I've had a few things I've wanted to change but wanted to find out what the {CPU, memory} impact of doing so would be.

@hvds
Copy link
Contributor

hvds commented Aug 12, 2021

Relating to @iabyn's comment, and a few thoughts of my own on other PRs: Do we have a good way to benchmark these sorts of things? Primarily in terms of CPU time, but also perhaps in memory usage. I've had a few things I've wanted to change but wanted to find out what the {CPU, memory} impact of doing so would be.

Porting/bench.pl is very good for instruction-level timing on code snippets. AFAIR it'll show pagefaults, but won't directly show changes to memory usage.

@sisyphus
Copy link
Contributor

sisyphus commented Aug 16, 2021 via email

@tonycoz
Copy link
Contributor

tonycoz commented Sep 1, 2021

Can this concern be alleviated by instead having the core base those optimisation choices on the presence of the pIOK flag ?

No, pIOK can be set when IV isn't an accurate representation:

$ ./perl -Ilib -MDevel::Peek -e 'my $x = "1.1"; my $y = "a" x $x; Dump($x)'
SV = PVNV(0x559f54764340) at 0x559f54788578
  REFCNT = 1
  FLAGS = (NOK,POK,IsCOW,pIOK,pNOK,pPOK)
  IV = 1
  NV = 1.1
  PV = 0x559f547d0610 "1.1"\0
  CUR = 3
  LEN = 10
  COW_REFCNT = 1

@nwc10
Copy link
Contributor Author

nwc10 commented Oct 5, 2021

Rebased, and with better comments in the commit Update SvPV() etc to avoid calling sv_2pv_flags() for cached IV strings

@tonycoz
Copy link
Contributor

tonycoz commented Oct 7, 2021

I'm worried about the performance implications. Various hot places in the core (such as pp_add()) take optimised paths in the presence of IOK, but not pIOK.

As far as I can see this doesn't change when the IOK or pIOK flags are set.

@nwc10
Copy link
Contributor Author

nwc10 commented Oct 7, 2021

As far as I can see this doesn't change when the IOK or pIOK flags are set.

This was the intent. The macro code goes from

((SvFLAGS(sv) & (SVf_POK|SVs_GMG)) == SVf_POK)
? ... do stuff inline
: ... call a function

to

(((SvFLAGS(sv) & (SVf_POK|SVs_GMG)) == SVf_POK)
 || (SvFLAGS(sv) & (SVf_IOK|SVp_POK|SVs_GMG)) == (SVf_IOK|SVp_POK))
? ... do stuff inline
: ... call a function

ie just the flags test changes. If I understand superscalar CPUs correctly, likely this only adds one cycle to the execution, as the two flag tests can dispatch in parallel. Old and new code both depends on the result of (all the) flags tests before it can resolve the branch direction.

I don't know if it's even written down anywhere, but the basic "data structure with cached values and flags" design paradigm seems to be

  • have a data structure with space to cache conversions and flags
  • have macros which inline small code which can quickly test the flags for common cases, and where possible use cached values
  • else call into a function for the slow path, which can hopefully cache values and set flags to permit fast paths in the future
  • arrange all the various SV data structures so that the cache slots are at the same offsets from (stored) pointers, so that the reading code in the macros doesn't need to branch

The last one meaning that at times where the structure is officially

+-------+-------+-------+-------+------+
| stash | magic |  cur  |  len  |  IV  |
+-------+-------+-------+-------+------+
^
|
stored pointer

<----------- offset ---------->

and so pointer as shown is stored, and the code uses the offset shown, what might actually be allocated is this:

                               -+------+-
                                |  IV  |
                               -+------+-
                                ^
                                |
                                real pointer from malloc() or similar                              
^
|
stored pointer points over here

and the reading code is always "memory at stored pointer plus offset" independent of what is actually allocated.

@tonycoz
Copy link
Contributor

tonycoz commented Oct 13, 2021

and so pointer as shown is stored, and the code uses the offset shown, what might actually be allocated is this:

                               -+------+-
                                |  IV  |
                               -+------+-
                                ^
                                |
                                real pointer from malloc() or similar                              
^
|
stored pointer points over here

and the reading code is always "memory at stored pointer plus offset" independent of what is actually allocated.

As far as I know this invokes undefined behaviour (a pointer pointing outside an object), but we've been doing it forever.

@tonycoz
Copy link
Contributor

tonycoz commented Dec 14, 2021

Summarizing the various objections:

Issue Resolution
@FGasper Basically, in order to preserve initial number/string it seems like we have to:... prevent numification from setting IOK (not done here).1 Two other similar comments. 2 3 this isn't an issue, the patch allows a serializer to distinguish between values that started as strings and values that started as numbers.
@iabyn I'm worried about the performance implications. Various hot places in the core (such as pp_add()) take optimised paths in the presence of IOK, but not pIOK. 1 This PR doesn't change when IOK is set or unset.
(edit) Existing serializers aren't using the correct type Existing serializers need to be updated

To distinguish between "started as string" and "started as number" we can simply test:

    if (SvIsBOOL(sv)) {
        # it's boolean
    }
    if (SvPOK(sv)) {
        # started as string
    }
    else if (SvNIOK(sv)) {
        # started as number
    }

There's no need for extra flags changes.

Are there any other objections to this PR?

@xenu
Copy link
Member

xenu commented Dec 14, 2021

I'm worried about the performance implications. Various hot places in the core (such as pp_add()) take optimised paths in the presence of IOK, but not pIOK.

This PR doesn't change when IOK is set or unset.

POK vs POKp have the same problem. SvPV uses the cached string representation only when SVf_POK is set, so merging this PR will break it.

It seems that in practice "p" flags don't mean anything.

@demerphq
Copy link
Collaborator

demerphq commented Jan 27, 2022 via email

@nwc10
Copy link
Contributor Author

nwc10 commented Feb 12, 2022

I rebased, resolved conflicts and added two commits @tonycoz offered on a PR to my branch on my fork.

"It's not easy being green"

We're good?

Perl itself can represent but do not map one-to-one into external formats, so
need some amount of approximation or encapsulation.

=item *
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this still related here? This looks to be unrelated change from a different commit elsewhen

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you try hard enough you can dereference a Regexp as a scalar. (Assignable) LVALUEs are also scalars:

$ perl -MDevel::Peek -e '$a = \(substr $^X, 0, 1); Dump $$a'
SV = PVLV(0x20ae4e8) at 0x2085fb0
  REFCNT = 1
  FLAGS = (GMG,SMG)
  IV = 0
  NV = 0
  PV = 0
  MAGIC = 0x209d648
    MG_VIRTUAL = &PL_vtbl_substr
    MG_TYPE = PERL_MAGIC_substr(x)
  TYPE = x
  TARGOFF = 0
  TARGLEN = 1
  TARG = 0x20860e8
  FLAGS = 0
  SV = PV(0x20869c0) at 0x20860e8
    REFCNT = 2
    FLAGS = (POK,pPOK)
    PV = 0x209d6d0 "/usr/bin/perl"\0
    CUR = 13
    LEN = 15

I'm not confident that these are the only two cases, so didn't want to claim a definitive list.

Copy link
Contributor

@FGasper FGasper left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a few stylistic/docs things, plus a request for a quick-and-easy tool to implement what’s written out here manually.

Thanks!

else if (flags & SVf_POK)
serealise as string
else
the existing guesswork ...
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know there’s been discussion of this, but FWIW this might be bit nicer if it were wrapped up some new API call … maybe SvORIGTYPE()? Then the delta could just mention that API call and maybe some of the flag-manipulation changes, with most of these other details in some sort of .pod document. This discussion may be of especial interest outside the scope of perldelta.

It just seems like a lot of explanation for a changelog.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think adding something like that would be useful, but I don't think it needs to hold up this particular PR. We'd verymuch like to get this change in before 5.35.9 (due in 3 days from now). Extending the XS-author API later would be a fine addition which we can do afterwards, perhaps as part of a wider look at "how we do type-like information".

Perl itself can represent but do not map one-to-one into external formats, so
need some amount of approximation or encapsulation.

=item *
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems in particular like it ought to be a different section from the much-more-public original-types change.

The audiences for these two things are different: C API callers on the one, Perl-maintainer Jedi wizards on the other.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, probably we can do better. But for a dev release perldelta, I didn't think we needed to get this perfect.

(As in, I don't think that this should block this PR)

}
else {
/* something special or undef */
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment as above: wrapping this up as SvORIGTYPE or some such would be amazing and simplify both the documentation and the lives of folks (like me!) who want to use this new hotness.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, agree, but I wasn't sure on exactly what, and I didn't see it as a blocker on this PR.

@demerphq
Copy link
Collaborator

demerphq commented Feb 17, 2022 via email

@leonerd
Copy link
Contributor

leonerd commented Feb 17, 2022

@demerphq There's two sub-problems: Sometimes "stringy" values end up with IOK or NOK accidentally, and sometimes numbers end up with POK accidentally. They are two separate sides of the same overall concern.

This patch only addresses the second kind - it ensures that numbers don't accidentally gain POK as well. So if you see a value with POK you know it was intended to be a string. This patch itself doesn't address the converse problem (of strings gaining IOK|NOK); we'd have to do that separately.

nwc10 and others added 6 commits February 17, 2022 17:55
…an IV.

This permits XS code (such as serialisers) to distinguish between values
that started as IVs but had a string representation cached, and values that
started as PVs but had an (exact) integer representation cached.

As implemented, the change in flags means that Perl_sv_2pv_flags() will be
entered each time the string for an IV is wanted. The next commit will fix
SvPV() and the other macros to avoid calling Perl_sv_2pv_flags() more than
once, restoring the previous behaviour.
This "teaches" them the new SV flags combination implemented by the previous
commit.
…lags.

Previously Perl_sv_setsv_flags() would gleefully turn on SVf_POK for these
values, which meant that any copy no longer propagated the (new) state
that says "this value started as an integer".
This needs an update to known_pod_issues.dat because the Pod has trailing
whitespace to avoid parts of the same verbatim code block getting separated.
@nwc10
Copy link
Contributor Author

nwc10 commented Feb 17, 2022

This patch only addresses the second kind - it ensures that numbers don't accidentally gain POK as well. So if you see a value with POK you know it was intended to be a string. This patch itself doesn't address the converse problem (of strings gaining IOK|NOK); we'd have to do that separately.

I don't think that I can explain it better than that (or add to that usefully, without clarification on what).

@neilb
Copy link
Contributor

neilb commented Feb 18, 2022

Anyone have any reasons why this shouldn't be accepted for inclusion in 5.36?

Copy link
Contributor

@leonerd leonerd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@demerphq
Copy link
Collaborator

demerphq commented Feb 19, 2022 via email

@nwc10 nwc10 closed this Feb 19, 2022
@nwc10 nwc10 deleted the IV-vs-PV branch February 19, 2022 20:17
@ap
Copy link
Contributor

ap commented Feb 20, 2022

So if you see a value with POK you know it was intended to be a string. This patch itself doesn't address the converse problem (of strings gaining IOK|NOK); we'd have to do that separately.

Considering that this branch is enough to ensure originally-strings and originally-numbers stay distinguishable as such, is any strong reason left to still do that part?

haarg added a commit to haarg/perl5 that referenced this pull request Mar 15, 2022
Since PR Perl#18958, values that start as IVs will not get their POK flags
set when they are used as strings. This is meant to aid in
serialization, allowing the "original" type of a value to be preserved.

For NV values, the POK flag was already usually not being set, because
the string form of a float could change based on the locale changing.
However, for Inf and NaN values, the POK flag would still be enabled.
Also, POK would be set for all floats if USE_LOCALE_NUMERIC was not
defined.

Update Perl_sv_2pv_flags to only enable the POKp flag when storing the
PV for Inf or NaN values, or all NVs when USE_LOCALE_NUMERIC is not
defined.
leonerd pushed a commit that referenced this pull request Mar 15, 2022
Since PR #18958, values that start as IVs will not get their POK flags
set when they are used as strings. This is meant to aid in
serialization, allowing the "original" type of a value to be preserved.

For NV values, the POK flag was already usually not being set, because
the string form of a float could change based on the locale changing.
However, for Inf and NaN values, the POK flag would still be enabled.
Also, POK would be set for all floats if USE_LOCALE_NUMERIC was not
defined.

Update Perl_sv_2pv_flags to only enable the POKp flag when storing the
PV for Inf or NaN values, or all NVs when USE_LOCALE_NUMERIC is not
defined.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.