-
Notifications
You must be signed in to change notification settings - Fork 780
Accelerate multianewArray on P for 2D arrays #21870
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Related issue: #2424 |
|
@zl-wang FYI |
|
The code is working and as far as I can tell without error for arr[0][0], arr[n][0], and arr[n][n] (where it should correctly engage the out of line code). |
|
Performance: where the load is basically me allocating a bunch of 2D arrays: arr[n][0]
The throughput unit is 1e6. |
|
The offheap implementation is not done yet, but shouldn't be hard. |
|
i supposed the "length" above is the 1st dimension size (i.e. n). the 2nd dimension size is also n? or it varied in your tests? |
|
The 2nd dimension is zero as outlined in the problem statement. When the 2nd dimension is not zero, the performance of default and accelerated code are pretty much the same (since the accelerated code just call anyway). |
|
In what is the weirdest bug I have seen, the evaluator wouldn't work if I set Update: resolved, I was not handling the case where the shiftAmount for compressed pointers is zero correctly, which happens to be the case when AOT was disabled. |
943358d to
6eb0434
Compare
|
A I will rebase the code and try again. https://hyc-runtimes-jenkins.swg-devops.com/job/Pipeline-Build-Test-Personal/28072/console Update: test passed https://hyc-runtimes-jenkins.swg-devops.com/job/Pipeline-Build-Test-Personal/28154/ |
6eb0434 to
a6a3981
Compare
|
Generated machine code, for reference: |
| TR::Node *callNode = outlinedHelperCall->getCallNode(); | ||
| TR::Register *reg; | ||
| if (callNode->getFirstChild() == node->getFirstChild()) | ||
| { | ||
| reg = callNode->getFirstChild()->getRegister(); | ||
| if (reg) | ||
| dependencies->addPostCondition(reg, TR::RealRegister::NoReg); | ||
| } | ||
| if (callNode->getSecondChild() == node->getSecondChild()) | ||
| { | ||
| reg = callNode->getSecondChild()->getRegister(); | ||
| if (reg) | ||
| dependencies->addPostCondition(reg, TR::RealRegister::NoReg); | ||
| } | ||
| if (callNode->getThirdChild() == node->getThirdChild()) | ||
| { | ||
| reg = callNode->getThirdChild()->getRegister(); | ||
| if (reg) | ||
| dependencies->addPostCondition(reg, TR::RealRegister::NoReg); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am pretty certain these chunk of code does nothing, since if a child is shared by the callNode and the node (first if block), then the register must have already been added to the post condition in the code block above.
However, the x evaluator included it, so I am leaving it here for now. Feedback appreciated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if what you mentioned is true, these blocks of code can only cause bugs, instead of nothing. you cannot add a register to a dep condition repeatedly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right. I forgot that while on x there is union to handle repeated registers gracefully, there isn't on p.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a check so at least the same register won't be inserted twice. I'm still not sure if this is needed at all, though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
to me, less code is always better
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just worry that there might be a case where this is required that I haven't thought about. After all the x evaluator should not just have a block of useless code.
da6339d to
25294da
Compare
|
I have eliminated all the initialisation to zero to the heap since per our discussion offline they should already by 0. |
|
give it a thorough test to confirm everything is right |
|
The functional test is still running; here's the result for the performance test:
newest is the build after I removed the code to initialise the zero fields. Functional test https://hyc-runtimes-jenkins.swg-devops.com/job/Build_JDK21_ppc64le_linux_Personal/1013/ |
0c83a81 to
e78e7ed
Compare
|
please take a look at non-zero 2nd dimension or 3rd dimension here: https://github.ibm.com/runtimes/openj9-jit-z/issues/1010 |
|
I read through the issue and I agree that it doesn't seem compilcated, it's just going to be very large. Should I start with the implementation? |
|
yes, give it a try. we may be able to do it in one go ... |
d31a119 to
0bf7e23
Compare
|
Newest performance numbers, including non-zero second dimensions
|
|
Functional test running: https://hyc-runtimes-jenkins.swg-devops.com/job/Grinder/52887/ |
|
nice results. at least, 20x improvement? the perf numbers' unit is unclear |
|
The unit is 1e6 iterations per cycle. |
| // oolJumpLabel is a common point that all branches will jump to. | ||
| // From this label, we branch to OOL code. | ||
| // We do this instead of jumping directly to OOL code from mainline | ||
| // since according to the x and z implementation that is not possible |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is the reason it isn't possible? any insights here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was no further elaboration in the x and z implementation, and during my tests the direct jump was not failing, but I can't be sure if any subtle failure occured.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it might be for arranging a single return-point for the rest of OOL and/or Snippet.
| TR::Register *dimsPtrReg = cg->evaluate(node->getFirstChild()); | ||
| // number of dimensions - compile time constant | ||
| uint32_t nDims = node->getSecondChild()->get32bitIntegralValue(); | ||
| // class pointer of objects in the array - in the 2D case this is the class of the subarray |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this comment looks not right: you directly store classReg to firstDim array's clazz field. That means classReg is firstDim's array class, instead of subarray's class.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we are on the same page here, it's the class of the subarray, not the class of the component of the subarray.
zl-wang
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
review complete. once comments are addressed, i can approve it.
| // === jump point because according to the x and z implementation we can't jump straight to oolFail | ||
| generateLabelInstruction(cg, TR::InstOpCode::label, node, oolJumpLabel); | ||
| generateLabelInstruction(cg, TR::InstOpCode::b, node, oolFailLabel); | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since dimsPtrReg and classReg (arguments for the call later) are never modified in this seq, i would agree these callUses* are not necessary. re the secondChild, it will need to be put in a register before the call, hopefully OOL adds the necessary dependency there. you can adjust the code below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I removed them now.
| } | ||
|
|
||
| generateDepLabelInstruction(cg, TR::InstOpCode::label, node, endLabel, dependencies); | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i am wondering why a targetRegisterFinal is necessary. isn't targetReg good already (presumably OOL jumps back to endLabel if we went down to it)? this mr just makes the path-length longer without much purposes.
On a second thought, could it be due to: in OOL, dimsPtrReg and targetReg should be both dependent on gr3 (first arg and result respectively). Here, targetReg is not hard-dep on gr3 ... there might be unforeseen occasions without targetRegisterFinal.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again, I'm not sure about it other than to say it's copied from the x/z implementation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not the most ideal situation, but acceptable for the time being. need to examine register dependency overall to come up with the ideal solution.
|
|
||
| // the maximum number of reference fields in the outer array | ||
| uintptr_t maxObjectSizeInRefFields = cg->getMaxObjectSizeGuaranteedNotToOverflow() / referenceFieldSize; | ||
| // the maximum number of elements in the inner array |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't want you to waste instructions on these kind of useless big constants. use the single constant of maxObjectSizeInRefFields as the guard is good enough to me. any reasonable constant will do actually. re-materializing different constants is a total waste. imagine a max TLH of 128KB or even 2MB, this guard testing never catch the overflow ... the overflow will be caught at heapTop comparison.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated
|
Performance on dacapo graphchi using the same long option from z: Numbers are benchmark time in msecs; lower is better Forty iterations:
Averages:
|
|
wow ... really nice improvement. @vijaysun-omr @r30shah |
|
Indeed, very good results. Thanks for sharing. |
| // 1. the header of a contiguous array | ||
| // 2. (firstDimLen * referenceFieldSize) | ||
| // 3. alignment padding | ||
| // temp1Reg = firstDimLenReg * referenceFieldSize; referenceFieldSize can only be 4 or 8 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
isn't the previous codegen better with a single shiftLeft of trailingZero(refFieldSize)?
| int32_t shiftAmount = TR::Compiler->om.compressedReferenceShift(); | ||
| if (shiftAmount != 0) | ||
| { | ||
| // firstDimLenReg can be used as a temp |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can do a single shift outside the loop. but [n][0] probably isn't common enough in reality, this is acceptable for the time being as it is.
|
Jenkins test sanity aix,plinux jdk8,jdk21 |
|
The updated zero-second-dimension case had no discernable impact on throughput, but at least nothing broke. |
8d13e54 to
a92c611
Compare
|
The last build will crash, should be fixed now |
a92c611 to
d349d7e
Compare
|
It turns out the callUses* code in the register dependencies is doing something after all, for the build will crash without it. I rolled back the changes now and it should be good now. |
zl-wang
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
| } | ||
|
|
||
| generateDepLabelInstruction(cg, TR::InstOpCode::label, node, endLabel, dependencies); | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not the most ideal situation, but acceptable for the time being. need to examine register dependency overall to come up with the ideal solution.
|
Jenkins test sanity aix,plinux jdk8,jdk21 |
|
The code is working, finally. The problem was that the size of an array header is not the power of 2 as I assumed when using offheap, so I changed the shift instruction into a multiply instruction. I also updated the final register to duplicate what was done on x, just in case that matters. |
|
Jenkins test sanity aix,plinux jdk8,jdk21 |
The procedure to calculate the size of 2D array component was implemented as part of eclipse-openj9#21870. This part of code is usable on other platforms so I moved it to a cross platform area. signed-off-by: Ehsan Kiani Far <ehsan.kianifar@gmail.com>
The procedure to calculate the size of 2D array component was implemented as part of eclipse-openj9#21870. This part of code is usable on other platforms so I moved it to a cross platform area. signed-off-by: Ehsan Kiani Far <ehsan.kianifar@gmail.com>
The procedure to calculate the size of 2D array component was implemented as part of eclipse-openj9#21870. This part of code is usable on other platforms so I moved it to a cross platform area. signed-off-by: Ehsan Kiani Far <ehsan.kianifar@gmail.com>
The procedure to calculate the size of 2D array component was implemented as part of eclipse-openj9#21870. This part of code is usable on other platforms so I moved it to a cross platform area. signed-off-by: Ehsan Kiani Far <ehsan.kianifar@gmail.com>
The procedure to calculate the size of 2D array component was implemented as part of eclipse-openj9#21870. This part of code is usable on other platforms so I moved it to a cross platform area. signed-off-by: Ehsan Kiani Far <ehsan.kianifar@gmail.com>

Supports the inlining of multianewArray on P when a new 2D array is created with the second dimension having a length of zero.