-
Notifications
You must be signed in to change notification settings - Fork 396
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add the bitPermute Evaluator for Z #2330
Conversation
b07cbb0
to
9601a73
Compare
Leaving this as a WIP for now as there are a couple of MemoryReference related things I need to verify and we might be able to come up with a better sequence of instructions to perform this bitPermute operation after @fjeremic takes a look. |
Functionality of the bitPermute opcode is summarized well by this comment: For comparison on Z, I generated what the IL and subsequent instructions look like on Z when bitPermute is not enabled on the platform (this is with no optimizations running, lastOptIndex=0):
And the instructions for the entire bitPermute operation (i.e the for loop):
looks like ~18 instructions in the loop |
The implemented evaluator here will generate the following set of instructions to do the entire bitPermute operation if the array length is constant: IL:
Instructions:
A similar sequence to that of lines 737-743 is generated 30 more times (constant length array size=32). This approach avoids branching that the old sequence of instructions perform and has less memory operations compared to before. In the case where we don't the array length is not a constant, the generated instructions are similar but now require branching. Instructions generated by the evaluator for the above IL will be:
~10 instructions repeated in loop. |
@fjeremic Let me know if you can spot an area where we can use more optimal instructions. |
@@ -14117,6 +14117,168 @@ bool isSettingOp(uint8_t byteValue, TR::InstOpCode::Mnemonic SI_opcode) | |||
} | |||
|
|||
|
|||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's clean up this area of code and replace the triple new line with a single new line.
TR::Register * | ||
OMR::Z::TreeEvaluator::bitpermuteEvaluator(TR::Node *node, TR::CodeGenerator *cg) | ||
{ | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we remove the unnecessary new line?
OMR::Z::TreeEvaluator::bitpermuteEvaluator(TR::Node *node, TR::CodeGenerator *cg) | ||
{ | ||
|
||
TR_ASSERT(node->getNumChildren() == 3, "Wrong number of children in bitpermuteEvaluator"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While I have no quarrel with additional asserts, shouldn't this be the job of the IL verifier? i.e. we already have a guard for this here:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed. I didn't know about the IL Verifier check. This check also exists in the X/codegen evaluator, so it should be removed from there as well.
|
||
bool nodeIs64Bit = node->getSize() == 8; | ||
auto valueReg = cg->evaluate(value); // this will either be a constant or a load | ||
auto addrReg = cg->evaluate(addr); // this will either be a constant or a load |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think these comments are true. We only know what the data type will be. IMO they are misleading.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed.
|
||
TR::Register *tmpReg = cg->allocateRegister(TR_GPR); | ||
|
||
//Zero result reg |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Space needed after //
(there are several occurences of this in the PR - let's fix them all and capitalize the first word). Should also move this comment down to the generateRRInstruction
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
|
||
if (!(node->getDataType() == TR::Int32 || node->getDataType() == TR::Int16 || node->getDataType() == TR::Int8 || node->getDataType() == TR::Int64)) | ||
{ | ||
TR_ASSERT(0, " Unsupported DataType for bitPermute op"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar to above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
generateRRInstruction(cg, TR::InstOpCode::getXORRegOpCode(), node, tmpReg, tmpReg); | ||
|
||
//load the shift amount into tmpReg | ||
//TODO: Check to see if we need sourceMR->stopUsingMemRefRegister(cg) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Technically yes, though in practice we have code that handles this automatically.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah okay. Then I'll remove these from here.
auto lengthReg = cg->evaluate(length); | ||
auto indexReg = cg->allocateRegister(TR_GPR); | ||
|
||
TR::RegisterDependencyConditions* deps = generateRegisterDependencyConditions((uint8_t)0, (uint8_t)4, cg); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need the casts in the arguments?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed.
|
||
// Load array length into index and test to see if it's already zero | ||
// cc=0 if register value is 0, cc=1 if value < 1, cc=2 if value>0 | ||
generateRRInstruction(cg, TR::InstOpCode::LTGR, node, indexReg, lengthReg); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to be careful here if running on 31-bit? What is the type of lengthReg
on 31-bit?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually I think we can get away with using an LTR
here because lengthReg is always a TR::Int32
as per the IL.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correction: the correct implementation here would be:
generateRRInstruction(cg, TR::InstOpCode::getLoadTestRegOpCode(), node, indexReg, lengthReg);
where the getLoadTestRegOpCode()
query is defined as below:
https://github.com/eclipse/omr/blob/58194bc93b1569ed1fe2a994fb0b0d7276a9e0e2/compiler/z/codegen/InstOpCode.cpp#L254-L255
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was done incorrectly before. The correct query was added here: f1f0697
generateS390LabelInstruction(cg,TR::InstOpCode::LABEL, node, startLabel); | ||
|
||
// now subtract 1 from indexReg | ||
generateRILInstruction(cg, node->getDataType() == TR::Int64 ? TR::InstOpCode::AGFI : TR::InstOpCode::AFI, node, indexReg, -1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we use
generateS390ImmOp(cg, TR::InstOpCode::getAddOpCode(), node, indexReg, indexReg, -1);
here instead and let the helper pick the best instruction? All add instructions produce the same condition code results.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
I'll stop the review here and let you do a first round of cleanup before reviewing again. |
We should definitely not do this. The amount of icache we will consume will be way worse than using a loop with a branch on count. This will be well predicted. In the case of constant length do we know what the typical constants that get passed in are? Is it always 8, 16, 32, or 64, or some other value? Perhaps we can have the best of both worlds and unroll the loop manually a few times via Duff's device or unroll fully for say maximum length of 4, otherwise fall back to a single iteration loop. |
Some more reviews on the generated code:
I feel like there is a use case for |
9601a73
to
0233803
Compare
Sounds like that might be a better option. However this was done in reference to the original X implementation (So we will need to change it there as well). As mentioned in the comment I linked (#2156 (comment)), this was considered to be an optimization for a common case under JProfiling:
I don't observe any patterns in how the tril tests are written for this. But they seem to be testing the boundaries. And I'm not familiar with how JProfiling works exactly to know what the typical constants would be.
The XGR was used to zero out
I addressed this in one of your code review comments. I agree, we can save two instructions by removing this. But I'm not sure who the right person is that should perform this check. There are other areas in the JIT where we do this masking. And the architecture doesn't say anything about considering only the first 5 bits (it considers the first 6, hence why we never mask for 64 bit shifts). |
0233803
to
ceca9a7
Compare
@fjeremic I've addressed your comments. PTAL. |
I don't think tril tests will tell you common use patterns. The point of tril tests is to have maximal coverage and test every single possibility that may arise. It does not tell us which ones are actually common in production code. Since this IL is currently only used for profiling we need to understand the common number of "things" we profile at a time. Surely we don't profile more than a handful of values at a single site, but quantitatively how many is "handful"? |
TR::Node *addr = node->getChild(1); | ||
TR::Node *length = node->getChild(2); | ||
|
||
bool nodeIs64Bit = node->getSize() == 8; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This variable is unused. Can we remove?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed in 537a564
TR::MemoryReference * tempMR = generateS390MemoryReference(tmpReg, 0, cg); | ||
generateRSInstruction(cg, TR::InstOpCode::getShiftRightLogicalSingleOpCode(), node, tmpReg, valueReg, tempMR); | ||
|
||
if (!(node->getDataType() == TR::Int64)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This path has an if
and an else
. Can we swap the blocks to avoid the negative expression in the if
clause?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed in dcdefce
// This is equivalent to doing a `tmpReg & 0x1`. But on 64-bit we would have to use | ||
// two AND immediate instructions. So instead we use a single RISBG instruction. | ||
TR::InstOpCode::Mnemonic opCode = cg->getS390ProcessorInfo()->supportsArch(TR_S390ProcessorInfo::TR_zEC12) ? TR::InstOpCode::RISBGN : TR::InstOpCode::RISBG; | ||
generateRIEInstruction(cg, opCode, node, tmpReg, tmpReg, 63, 191, 0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we able to use generateShiftAndKeepSelected64Bit
here? Also what if we're on a z196? This would cause a SIGILL.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I follow what you mean. Why would this cause a SIGILL on z196? The RISBG Instruction is available on it from what what I can see.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry I meant to say z9, not z196. Apologies for the mistake. RISBG
is only supported on z10 onward:
We should use the above API and not worry about this (it already handles the z9 case).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adressed in ef34b3e
As mentioned, the |
I'm not sure I follow your comment here. Would you be able to provide an example? To give a bit of background, the first two instructions are trying to do what a |
There may be a possible point of improvement here. If the result of |
generateRXYInstruction(cg, TR::InstOpCode::LGB, node, tmpReg, sourceMR); | ||
|
||
// Create memory reference using tmpReg (shift amount), then shift valueReg by shift amount | ||
TR::MemoryReference * tempMR = generateS390MemoryReference(tmpReg, 0, cg); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we get a better name for this memory reference (similarly in the else
path)? Perhaps shiftAmountMR
? Also would it be possible to fix the pointer location inconsistencies throughout this change? I'm seen all three flavors of type* name
, type * name
, and type *name
. Let's stick to one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed in: 62c4af1
// This will generate a RISBG instruction (if supported). | ||
// This is equivalent to doing a `tmpReg & 0x1`. But on 64-bit we would have to use | ||
// two AND immediate instructions. So instead we use a single RISBG instruction. | ||
generateShiftAndKeepSelected64Bit(node, cg, tmpReg, tmpReg, 63, 63, 0, true, false); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why can't we carry out the shift and AND operation using a single RISBG
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, just saw RISBG
type instructions do not have register flavors. I don't think this would be possible then.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, to address your earlier question as well, I did explore the use of RISBG, but due to shift amounts being specified as immediates, it wasn't possible. I was thinking that the RISBG could be used as a way to preserve a specific bit (where the bit position is inside a register) and then we can shift it right accordingly. However due to no register flavours, this wasn't possible.
But taking a second look, there may still an opportunity for the constant length case. The left shift (done after the RISBG) amount is a constant here:
generateRSInstruction(cg, TR::InstOpCode::getShiftLeftLogicalSingleOpCode(), node, tmpReg, tmpReg, x);
And we can specify the left shift amount (x
) as a constant via RISBG. So we may not need this final shift and can do it via RISBG.
I'll test this out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay I tested this out and it looks like it's working. For the constant length case, we now have a guaranteed 4 instructions for every "loop iteration". Before it was anywhere from 5-6 instructions. The following code snippet from the commit summarizes what the RISBG instruction accomplishes:
// This will generate a RISBG instruction (if it's supported, otherwise two shift instructions).
// A RISBG instruction is equivalent to doing a `(tmpReg & 0x1) << x`. But for a 64-bit value we would have to use
// two AND immediate instructions and a shift instruction to do this. So instead we use a single RISBG instruction.
generateShiftAndKeepSelected64Bit(node, cg, tmpReg, tmpReg, 63 - x, 63 - x, x, true, false);
In the case where a RISBG instruction is not supported by the architecture, we will get two shift instructions instead, which will result in 5 instructions instead of 4.
The commit for this change is here: c6f746f
// Create memory reference using tmpReg (shift amount), then shift valueReg by shift amount | ||
TR::MemoryReference * tempMR = generateS390MemoryReference(tmpReg, 0, cg); | ||
generateRSInstruction(cg, TR::InstOpCode::getShiftRightLogicalSingleOpCode(), node, tmpReg, valueReg, tempMR); | ||
if (node->getDataType() == TR::Int64) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to handle register pairs on 31-bit here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As discussed, we are no longer handling the 31-bit case. So the bitPermute IL will be disabled on those architectures. This is addressed here: 1807405
Introducing a branch here would have negative effect on performance as these operations are very fast. A shift and an OR is not expensive at all.
I assumed there was a flavor of |
if (length->getOpCode().isLoadConst()) | ||
{ | ||
// Manage the constant length case | ||
int arrayLen = length->getIntegerNodeValue<int>(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Try to avoid using primitive types as they have no guarantees on lower bound width. Use int32_t
instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed in: 62c4af1
// Zero result reg | ||
generateRRInstruction(cg, TR::InstOpCode::getXORRegOpCode(), node, resultReg, resultReg); | ||
|
||
if (length->getOpCode().isLoadConst()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd like to see this become if (length->getOpCode().isLoadConst() && length->getIntegerNodeValue<int32_t>() <= bitPermuteConstantUnrollThreshold)
where this static const
is defined to be some sane value so we don't end up unrolling 32 or 64 times in the worst cases. We need to determine by the use case what that threshold should be.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Final note before merging I'd like to see a z14 path using |
return true; | ||
} | ||
|
||
return false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not just:
return TR::Compiler->target.is64Bit() || self()->use64BitRegsOn32Bit();
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done:
e3cd762
|
||
TR::Register *tmpReg = cg->allocateRegister(TR_GPR); | ||
|
||
static const int8_t BIT_PERMUTE_CONSTANT_UNROLL_THRESHOLD = 4; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a local variable and should be named as such (bitPermuteConstantUnrollThreshold
). I'd like to see some documentation about this variable and what it means along with why it is currently defined to be 4. If you've done empirical measurements to find out the best value I'd like to see the test posted to either the PR or a new issue and the comment here should link to where future users may reference and reproduce your results.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Renamed in : 95356e1
Code Documentation : 1d71fdf
Comments detailing empirical measurements: #2330 (comment)
#2330 (comment)
|
||
TR::Register *resultReg = cg->allocateRegister(TR_GPR); | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we remove the double space?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if (length->getOpCode().isLoadConst() && length->getIntegerNodeValue<int32_t>() <= BIT_PERMUTE_CONSTANT_UNROLL_THRESHOLD) | ||
{ | ||
// Manage the constant length case | ||
int32_t arrayLen = length->getIntegerNodeValue<int>(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we pull this outside of the above if and use arrayLen
instead of length->getIntegerNodeValue<int32_t>()
? Also this line seems to be extracting an int
but the type of the variable is int32_t
. Can we fix this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed here: 95356e1
/// for arrays of size 5 and greater) | ||
else if (cg->getS390ProcessorInfo()->supportsArch(TR_S390ProcessorInfo::TR_z14) && | ||
length->getOpCode().isLoadConst() && | ||
length->getIntegerNodeValue<int>() > BIT_PERMUTE_CONSTANT_UNROLL_THRESHOLD && |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This line shouldn't be necessary. It is always true.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah yes, some old code I forgot to remove. Fixed here: 5231065
length->getIntegerNodeValue<int>() <= 16 ) | ||
{ | ||
char mask[16] = {15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0}; | ||
char inverse[16] = {63, 63, 63, 63, 63, 63, 63, 63, 63, 63, 63, 63, 63, 63,63, 63}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These should be marked as const
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As discussed, marking these as const will require a const_cast
to remove the const
ness when calling findOrCreateConstant(node, mask, 16)
.
The correct solution would be to change those APIs to accept const void *
but this PR wouldn't be the right place for that. I'll open an issue to address this larger problem.
char mask[16] = {15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0}; | ||
char inverse[16] = {63, 63, 63, 63, 63, 63, 63, 63, 63, 63, 63, 63, 63, 63,63, 63}; | ||
|
||
int32_t arrayLen = length->getIntegerNodeValue<int>(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can reuse this variable after making the above change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed: 95356e1
|
||
TR::Register *VectorIndices = cg->allocateRegister(TR_VRF); //V17 | ||
TR::Register *VectorSource = cg->allocateRegister(TR_VRF); //V16 and V18 | ||
TR::Register *tmpVector = cg->allocateRegister(TR_VRF); //V20 and V21 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do these comments mean? Also variable names should be camelCase to be consistent with the rest of the codebase.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Stale comments I forgot to remove. Fixed: 76d2d7b
|
||
cg->stopUsingRegister(VectorIndices); | ||
cg->stopUsingRegister(VectorSource); | ||
cg->stopUsingRegister(tmpVector); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There seems to be a space missing here in these 3 lines (alignment with brace).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed: 76d2d7b
for (int j = 1; j < arrayLen; j++) | ||
{ | ||
resultMask = ( (resultMask << 1) | 0x1); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not replace the whole loop with resultMask = (1 << arrayLen) - 1;
- double check this is correct.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed here: 35f3057
For future reference, I am pasting the assembly for each implementation below: z13/ loop unrolling technique
z14CurrentIL:
z14NewIL:
I'll post the benchmark I used personally to collect numbers for these assembly sequences shortly. |
For future reference, I wanted to add some context as to why a new IL was suggested. Background: On the IBM Z architecture, bit positions inside a register are numbered left to right. Meanwhile, on a Little Endian Architecture like Intel's X86 bit positions in a register are numbered right to left. The two images below show this. As can be seen above, for IBM Z's Big Endian architecture the MSB is in bit position 0. Meanwhile for the Little Endian case, the LSB is in bit position 0. Bytes within a register are also numbered in this fashion. The 0th byte on Z register is the leftmost byte, while the 0th byte on an x86 register will be the rightmost byte. This leads into the first problem: If we wanted to extract the 0th bit position (as per the bit permute IL), the result would be different based on which architecture we are on. The second problem is that the To solve these two problems, a new IL was proposed ("BigEndianIL"): One where the indices in the array are subtracted from 63 (so the correct positions are selected) and the order of the array is reversed (so the 0th byte in the original array will specify the value to put in the LSB of the result register). |
Benchmark: |
The failure in the build doesn't seem related. Here's the failure message:
|
@dchopra001 whenever this is ready for final review please remove the WIP and we'll go through it again. |
@fjeremic It should be ready for final review now. I've already addressed all requested changes from the last review. Removing the WIP tag. |
/// | ||
/// More info on the measurements is available here: https://github.com/eclipse/omr/pull/2330#issuecomment-378380538 | ||
/// The differences between the generated code for the different | ||
/// implementations can be seen here: https://github.com/eclipse/omr/pull/2330#issuecomment-378387516 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doxygen does not do anything for local comments so the triple slash is unnecessary for this entire change. Can we revert to double slash?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
} | ||
/// Use Z14's VBPERM instruction if possible. (Note: VBPERM supports permutation on arrays | ||
/// of upto size 16, and beats the performance the loop unrolling technique used above | ||
/// for arrays of size 5 and greater) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should be z14 and up to. I would avoid referencing hardcoded constants (of previously declared variables) in comments such as "5". It makes refactoring easier should this constant ever change as comments do not need to be updated, i.e. it avoids eventually having stale comments which can be quite deceptive to future readers of this code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've instead referred to the constant that holds the threshold value now. That should prevent stale comments from occurring.
arrayLen <= 16 ) | ||
{ | ||
char mask[16] = {15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0}; | ||
char inverse[16] = {63, 63, 63, 63, 63, 63, 63, 63, 63, 63, 63, 63, 63, 63,63, 63}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Space missing between second last element.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// The result of the VBPERM op is now in resultReg's 48-63 bit locations. Extract the bits you want via a mask | ||
// and store the final result in resultReg. This is necessary because VBPERM operates on 16 bit positions at a time. | ||
// If the array specified by the bitPermute IL was smaller than 16, then invalid bits can be selected. | ||
int resultMask = (1 << arrayLen) - 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
resultMask
should be of type int32_t
to match that of arrayLen
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good overall. Just a couple of minor things to address and we can merge this. Good work! |
@fjeremic Let me know when you're ready to merge. I'll do a squash + rebase before this gets merged. |
@dchopra001 latest changes look good. Please squash the commits and we'll merge. @genie-omr build all |
Enable support for and add an evaluator for the BitPermute IL for the Z code generator. Currently the IL is only supported for 64-bit. Signed-off-by: Dhruv Chopra <Dhruv.C.Chopra@ibm.com>
c7895fe
to
8be2d1a
Compare
@fjeremic This should be good to merge now. |
@genie-omr build all |
Implement the bitPermute Evaluator for z/codegen
Signed-off-by: Dhruv Chopra Dhruv.C.Chopra@ibm.com