-
Notifications
You must be signed in to change notification settings - Fork 580
Create CvSUBOVERRIDE to speed up common sub idioms #21824
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: blead
Are you sure you want to change the base?
Conversation
This new code path does two separate things designed to allow common function patterns to be replaced with a static C function at runtime: First it adds two additional fields to CV and a check inside pp_entersub that, if set, will call a C function instead of setting up and calling a perl stack frame. This call can return false to indicate it was unsuccessful and allow the original optree to operate as normal. Second it adds an initial optimization that identifies a common accessor method pattern and attaches them with a static hash-key lookup function. The optimization is smart enough to handle cases when the function is also a mutator and passes control back to the optree to be executed. This combination enables common OO-style accessor to return as quick as an XS version is able to. A compiler flag, -Accflags=-DPERL_CV_OVERRIDE, must be given to enable the feature; both the check inside pp_entersub as well as initial optimization.
Two general comments. First, AFAIKT, nowhere do you state what actual "common accessor method pattern"(s) it overrides. Second, You have a couple of non-trivial C functions with no code comments at the top to say what they do. Then a specific observation: I doubt that this code will work correctly under PERL_RC_STACK. As a general rule, no new code in core should include dSP and all the macros associated with it. See the new section "Reference-counted argument stack" in perlguts.pod. I haven't understood the PR well enough yet (see the the two general comments above) to form an opinion on it. Do you have any firm ideas/plans for further common method patterns to override? |
Plus my overall thinking is that this is quite a large body of work to surprise on folks from a new contributor. I would have much preferred to have some initial discussion on the aims and directions first, so we can give some more specific advice on how best to go about doing it. |
Some further observations:
|
I apologize, you are correct, I did not state it explicitly. It is hidden away in the
I will plan to add those, it was simply an oversight on my part. The quick summary of the two functions are: S_defgv_hek_accessor: This is the function that runs at runtime attempting to optimize an accessor pattern. It will check that the number of arguments to a function call is exactly 1, and confirms that the single item is a hashref. If so, it returns the hash value of a static key stored in CvSUBOVERRIDEAUX. S_cv_override_create: This is the optimizer step that looks at the function as a whole and determines if an optimzation can be made. If so, it enables CvIsSUBOVERRIDE, sets the function to call to S_defgv_hek_accessor, and saves the constant key.
That's very possible. Some time ago I did attempt to compile with PERL_RC_STACK and PERL_CV_OVERRIDE, and my memory is that nothing immediately failed. I will make sure to check again and work on adjusting it to take care of the related macros before taking the draft status off this PR.
My goal initially was to support how OO framework generate their accessors, and the patterns above covers some of the more common frameworks, Moo and Moose specifically. I have thought about what else it could be used for, and I'm not going to say all of those ideas are valuable, just that they are simple patterns that I've used that maybe don't require a full stack to be created in order to function in all cases.
The strength of the way it works currently is that the static function doesn't have to handle the entire body of the function; it can see that the conditional it is optimizing for is not true and allow the optree to take over. So in the case of This PR is the result of an exercise in digging deeper into the perl internals, something I feel has been a good exercise for me, personally. I opened this PR because I had this code complete and seems from a surface level to be an improvement for at least some workloads. I opened it as a draft PR specifically to check to see if the changes were something others would be interested in. If this is the type of optimization that doesn't fit in the project, I am happy to close the PR and move on to something else. If the idea has legs, I'm happy to do any changes in comments, style or methodology to bring it up to par with expectations. |
That is a good description of what it's doing. I plan to add documentation about it.
That section is copied from
Had not thought about a CV getting freed. I will see what I can do. |
A similar one that I have often seen is |
On Fri, Jan 12, 2024 at 10:06:25AM -0800, Jon Gentle wrote:
> 3. The SV stored in CvSUBOVERRIDEAUX(cv) doesn't appear to be freed when the CV is freed.
Had not thought about a CV getting freed. I will see what I can do.
And I've just remembered: in addition, the pointer to the SV needs copying
(and the SV's ref count incremented) each time the CV is cloned by
S_cv_clone() - cloning being making a new anon sub at runtime by copying
from a sub prototype, as used in 'my $s = sub { ....}'.
You also need to duplicate the SV (in the SVt_PVCV/SVt_PVFM case in
S_sv_dup_common()) when the CV is being duplicated, which is what happens
when a new ithread is being created.
It would also be worthwhile adding tests which fail without these fixups.
t/op/svleak.t is a good place to detect the SV not being freed. In the
eleak() loop, you could both declare a named sub and then delete it from
the symbol table.
Not being cloned might be harder to test: anon CVs are only cloned when
they are acting as a closure, i.e. they reference a lexical var in an
outer scope, and I guess in that case it wouldn't be a candidate for
CvSUBOVERRIDE() anyway?
In t/op/threads.t: in a child thread you could call the accessor method
and see if it returns the right hash element.
…--
A walk of a thousand miles begins with a single step...
then continues for another 1,999,999 or so.
|
@atrodo, this pull request has acquired merge conflicts. Could you resolve them and re-push? Thanks. |
CV body struct needs a brainstorming. CVFILE still leaks like a thunderstorm with CvDYNFILE(), |
I32 xcv_depth /* >= 2 indicates recursive call */ | ||
I32 xcv_depth; /* >= 2 indicates recursive call */ \ | ||
bool (*xcv_suboverride)(pTHX_ CV *); \ | ||
SV * xcv_suboverride_aux |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This adds 16 bytes to all SV body structs, does anyone have ideas how that can be prevented?
HV* xmg_stash (1st field of all SV body structs, regardless if a ghost field or alloced field), is 99.999% unused for CV*s.
CUR/LEN/SvPVX() is 10% used, and are part of the minimally used PP prototypes system and also used by XSUB AUTOLOAD layer.
HV* xcv_stash and GV*/HEK* xcv_gv_u are weird and I don't understand why they can't be a union. HV* xcv_stash and HEK* xcv_gv_u (16 bytes) only exist for my $sub
and default off RV2CV no GV optimization, to regenerate the GV* if needed from just a CV* ptr. There needs to be 2 ptrs to something for RV2CV to work, but why blessed HV*, CV specific HV*, and (GV* || HEK*) at all times? That is 3 ptrs of data.
CV body field HV* xcv_stash and CV body field HV* xcv_gv_u.xcv_gv->sv_any->xnv_u.xgv_stash are always identical in the Perl VM AFAIK.
CV* CvOUTSIDE() is a PP sub only concept, XSUBs can't be nested, static/hidden private scope/layered or closured against other XSUBs or C functions or against PP CV* PP lang level scopes. Maybe that CvOUTSIDE can be recycled for something else. Maybe field CvOUTSIDE() which is 8 bytes/void *
can be replaced by a U16 offset into a PP sub's PADLIST SV* array. U32 cv_flags_t
is in great danger of running out of bits. highest flag right now is 0x40 00 00, Im counting only 9 unused bits left in that U32.
An XSUB CV* currently has ALOT of unused void *s. 1 goes to CPAN XS authors as XS_ANY. but there are atleast 2 more void s that are unused. cv_outside is always NULL. and cv_padlist I repurposed for no-threads perl XS handshake ABI compatibility metadata since the CV pointer is the 1 and only C language argument passed to a boot XSUB, and handshake metadata has no choice but to be delivered through the CV* cv argument
because there is no my_perl
variable on a no thd perl. That handshake metadata void * doesn't need to exist, except for boot XSUBs, and only for a very short temporary time, only when that boot XSUB is 1st time ever executed using DynaLoader or XSLoader into a brand new in address space .so/.dll. 100% proven on all OSes and all build Configure permutations, static linked, boot XSUBs dont need a XS handshake test since it was proved at libperl.so CC time or link time that, that boot XSUB can't be separated by the end user from libperl.so.
The optimizations in this PR, should be part of a "nextgen" XSUB C level ABI of some sort that many people have wanted for 20+ years. The mandatory prolog in C/machine code, of all XSUBs Why so much mandatory work in the 100K-1 million unique XSUBs, that are currently being executed/or sit inside physical ram this wall clock secon, on planet earth ? Why isn't incoming arg count, a size_t or U32 passed on C stack (nowadays always a CPU register except for i386) like all other C/C++ vararg code on earth? A cleaner/simplified "BEGIN{} time burn my XSUB into the caller as a keyword/OP struct" public API, vs the current cargo culted CPAN XS copy pastes, would be nice. |
Can't this PR just be a very well written and very well C lang level optimized CPAN XS module instead if thats all this PR is supposed to do? But counterpoint, other than its own maintainer/author and max a dozen other humans, nobody else will know of or use that CPAN XS module, |
Someone who knows how to change the OP structs and OP tree needs to implement that, you dont need an XSUB, that all looks const foldable to my eye.
Watch a video on youtube how to code in C. This is too many turning machine or per machine things to do in a row or pattern match for automatically without developer intervention, in my opinion
IDK, maybe lint or perlcritic should be used instead to identify pointless PP lang derefs. How many times SV GMG executes is another question if you wanna XS-ify that PP src code. Someone might complain, maybe nobody complains.
This is better done on a optree level than a hybrid of an XSUB and then a PP SUB executing. A sub or P5 "call frame" should be one or the other on a whiteboard, not both.
This is already done, or non-merged PRs or P5P RT patches exist to do this. Someone asked I worked this particular optimization and I submitted a RT patch finishing the optimization, but it stalled out because I didn't know how to proceed b/c core tests failed, because the exactly working of fatal runtime errors slightly changed, and there were no responses if we (perl ecosystem) cares on STDERR byte equality for fatal developer errors.
blead perl's If pp_ref() is doing what I described above, its performance would be 2-5 CPU opcodes and nothing more than HEK* == addr comparison instead of QEMU/Bochs style ultra heavy weight and slow current impl. |
This new code path does two separate things designed to allow common function patterns to be replaced with a static C function at runtime:
First it adds two additional fields to CV and a check inside pp_entersub that, if set, will call a C function instead of setting up a perl stack frame and calling the original function. This call can return false to indicate it was unsuccessful and allow the original optree to operate as normal.
Second it adds an initial optimization that identifies a common accessor method pattern and attaches them with a static hash-key lookup function. The optimization is smart enough to handle cases when the function is also a mutator and passes control back to the optree to be executed.
This combination enables common OO-style accessor to return as quick as an XS version is able to.
A compiler flag, -Accflags=-DPERL_CV_OVERRIDE, must be given to enable the feature; both the check inside pp_entersub as well as initial optimization.
I have opened this as draft PR, I believe the concept in general is technically viable but I am unsure if it is a good fit. I can appreciate that as a new contributor this is touching some important bits, but I had an interest in learning about perl's internals and had this as a specific goal in mind. I had previously opened an earlier draft PR with a much different in-progress version; the feedback I received from that significantly changed the final approach.
I benchmarked the change with Porting/bench.pl, which my interpretation of the information indicates that on average the call:: benchmarks are about 2% slower overall, but the new call::sub::accessor benchmark is about 220% faster. I understand that making every call slower, even if by a small amount, is not going to work with every workload, and so I decided to put the optimization behind a compiler flag.
I did another benchmark by creating a script to compare several popular object systems. The results of those appear to show that this optimization allows accessors that use the optimized pattern to be within the range of speed of some popular XS accessor modules.
Both the script I created above as well as my output from Porting/bench.pl is available in the below gist. In both cases
blead
refers to a compiled version of perl without-Accflags=-DPERL_CV_OVERRIDE
, andhacked
refers to a compiled version of perl with that flag defined.https://gist.github.com/atrodo/98f6cde36c8fa9954de34957be876493