-
Notifications
You must be signed in to change notification settings - Fork 14.4k
[lld][LoongArch] GOT indirection to PC relative optimization #123743
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: users/ylzsx/r-tlsdesc-to-iele-relax
Are you sure you want to change the base?
[lld][LoongArch] GOT indirection to PC relative optimization #123743
Conversation
@llvm/pr-subscribers-backend-loongarch @llvm/pr-subscribers-lld-elf Author: Zhaoxin Yang (ylzsx) ChangesIn LoongArch, this optimization is only supported when relaxation is enabled.
If the original code sequence can be relaxed into a single instruction FIXME: Althouth the optimization has been performed, the GOT entries still exists, similarly to AArch64. Eliminating the entries may be require additional marking in the common code. Full diff: https://github.com/llvm/llvm-project/pull/123743.diff 2 Files Affected:
diff --git a/lld/ELF/Arch/LoongArch.cpp b/lld/ELF/Arch/LoongArch.cpp
index 5f49b23e8ffb1a..6ae45e109a6dec 100644
--- a/lld/ELF/Arch/LoongArch.cpp
+++ b/lld/ELF/Arch/LoongArch.cpp
@@ -47,6 +47,8 @@ class LoongArch final : public TargetInfo {
void tlsIeToLe(uint8_t *loc, const Relocation &rel, uint64_t val) const;
void tlsdescToIe(uint8_t *loc, const Relocation &rel, uint64_t val) const;
void tlsdescToLe(uint8_t *loc, const Relocation &rel, uint64_t val) const;
+ bool tryGotToPCRel(uint8_t *loc, const Relocation &rHi20,
+ const Relocation &rLo12, uint64_t secAddr) const;
};
} // end anonymous namespace
@@ -1150,6 +1152,58 @@ void LoongArch::tlsdescToLe(uint8_t *loc, const Relocation &rel,
}
}
+// Try GOT indirection to PC relative optimization when relaxation is enabled.
+// From:
+// * pcalau12i $a0, %got_pc_hi20(sym_got)
+// * ld.w/d $a0, $a0, %got_pc_lo12(sym_got)
+// To:
+// * pcalau12i $a0, %pc_hi20(sym)
+// * addi.w/d $a0, $a0, %pc_lo12(sym)
+//
+// FIXME: Althouth the optimization has been performed, the GOT entries still
+// exists, similarly to AArch64. Eliminating the entries may be require
+// additional marking in the common code.
+bool LoongArch::tryGotToPCRel(uint8_t *loc, const Relocation &rHi20,
+ const Relocation &rLo12, uint64_t secAddr) const {
+ if (!rHi20.sym->isDefined() || rHi20.sym->isPreemptible ||
+ rHi20.sym->isGnuIFunc() ||
+ (ctx.arg.isPic && !cast<Defined>(*rHi20.sym).section))
+ return false;
+
+ Symbol &sym = *rHi20.sym;
+ uint64_t symLocal = sym.getVA(ctx) + rHi20.addend;
+ // Check if the address difference is within +/-2GB range.
+ // For simplicity, the range mentioned here is an approximate estimate and is
+ // not fully equivalent to the entire region that PC-relative addressing can
+ // cover.
+ int64_t pageOffset =
+ getLoongArchPage(symLocal) - getLoongArchPage(secAddr + rHi20.offset);
+ if (!isInt<20>(pageOffset >> 12))
+ return false;
+
+ Relocation newRHi20 = {RE_LOONGARCH_PAGE_PC, R_LARCH_PCALA_HI20, rHi20.offset,
+ rHi20.addend, &sym};
+ Relocation newRLo12 = {R_ABS, R_LARCH_PCALA_LO12, rLo12.offset, rLo12.addend,
+ &sym};
+
+ const uint32_t currInsn = read32le(loc);
+ const uint32_t nextInsn = read32le(loc + 4);
+ // Check if use the same register.
+ if (getD5(currInsn) != getJ5(nextInsn) || getJ5(nextInsn) != getD5(nextInsn))
+ return false;
+
+ uint64_t pageDelta =
+ getLoongArchPageDelta(symLocal, secAddr + rHi20.offset, rHi20.type);
+ // pcalau12i $a0, %pc_hi20
+ write32le(loc, insn(PCALAU12I, getD5(currInsn), 0, 0));
+ relocate(loc, newRHi20, pageDelta);
+ // addi.w/d $a0, $a0, %pc_lo12
+ write32le(loc + 4, insn(ctx.arg.is64 ? ADDI_D : ADDI_W, getD5(nextInsn),
+ getJ5(nextInsn), 0));
+ relocate(loc + 4, newRLo12, SignExtend64(symLocal, 64));
+ return true;
+}
+
// During TLSDESC GD_TO_IE, the converted code sequence always includes an
// instruction related to the Lo12 relocation (ld.[wd]). To obtain correct val
// in `getRelocTargetVA`, expr of this instruction should be adjusted to
@@ -1259,6 +1313,22 @@ void LoongArch::relocateAlloc(InputSectionBase &sec, uint8_t *buf) const {
tlsdescToLe(loc, rel, val);
}
continue;
+ case RE_LOONGARCH_GOT_PAGE_PC:
+ // In LoongArch, we try GOT indirection to PC relative optimization only
+ // when relaxation is enabled. This approach avoids determining whether
+ // relocation types are paired and whether the destination register of
+ // pcalau12i is only used by the immediately following instruction.
+ // Moreover, if the original code sequence can be relaxed to a single
+ // instruction `pcaddi`, the first instruction will be removed and it will
+ // not reach here.
+ if (isPairRelaxable(relocs, i) && rel.type == R_LARCH_GOT_PC_HI20 &&
+ relocs[i + 2].type == R_LARCH_GOT_PC_LO12 &&
+ tryGotToPCRel(loc, rel, relocs[i + 2], secAddr)) {
+ i = i + 3; // skip relocations R_LARCH_RELAX, R_LARCH_GOT_PC_LO12,
+ // R_LARCH_RELAX
+ continue;
+ }
+ break;
default:
break;
}
diff --git a/lld/test/ELF/loongarch-relax-pc-hi20-lo12.s b/lld/test/ELF/loongarch-relax-pc-hi20-lo12.s
index 760fe77d774e30..ae3b29e14fb3c1 100644
--- a/lld/test/ELF/loongarch-relax-pc-hi20-lo12.s
+++ b/lld/test/ELF/loongarch-relax-pc-hi20-lo12.s
@@ -30,24 +30,26 @@
## offset = 0x410000 - 0x10000: 0x400 pages, page offset 0
# NORELAX32-NEXT: 10000: pcalau12i $a0, 1024
# NORELAX32-NEXT: addi.w $a0, $a0, 0
+## Not relaxation, convertion to PCRel.
# NORELAX32-NEXT: pcalau12i $a0, 1024
-# NORELAX32-NEXT: ld.w $a0, $a0, 4
+# NORELAX32-NEXT: addi.w $a0, $a0, 0
# NORELAX32-NEXT: pcalau12i $a0, 1024
# NORELAX32-NEXT: addi.w $a0, $a0, 0
# NORELAX32-NEXT: pcalau12i $a0, 1024
-# NORELAX32-NEXT: ld.w $a0, $a0, 4
+# NORELAX32-NEXT: addi.w $a0, $a0, 0
# NORELAX64-LABEL: <_start>:
## offset exceed range of pcaddi
## offset = 0x410000 - 0x10000: 0x400 pages, page offset 0
# NORELAX64-NEXT: 10000: pcalau12i $a0, 1024
# NORELAX64-NEXT: addi.d $a0, $a0, 0
+## Not relaxation, convertion to PCRel.
# NORELAX64-NEXT: pcalau12i $a0, 1024
-# NORELAX64-NEXT: ld.d $a0, $a0, 8
+# NORELAX64-NEXT: addi.d $a0, $a0, 0
# NORELAX64-NEXT: pcalau12i $a0, 1024
# NORELAX64-NEXT: addi.d $a0, $a0, 0
# NORELAX64-NEXT: pcalau12i $a0, 1024
-# NORELAX64-NEXT: ld.d $a0, $a0, 8
+# NORELAX64-NEXT: addi.d $a0, $a0, 0
.section .text
.global _start
|
@llvm/pr-subscribers-lld Author: Zhaoxin Yang (ylzsx) ChangesIn LoongArch, this optimization is only supported when relaxation is enabled.
If the original code sequence can be relaxed into a single instruction FIXME: Althouth the optimization has been performed, the GOT entries still exists, similarly to AArch64. Eliminating the entries may be require additional marking in the common code. Full diff: https://github.com/llvm/llvm-project/pull/123743.diff 2 Files Affected:
diff --git a/lld/ELF/Arch/LoongArch.cpp b/lld/ELF/Arch/LoongArch.cpp
index 5f49b23e8ffb1a..6ae45e109a6dec 100644
--- a/lld/ELF/Arch/LoongArch.cpp
+++ b/lld/ELF/Arch/LoongArch.cpp
@@ -47,6 +47,8 @@ class LoongArch final : public TargetInfo {
void tlsIeToLe(uint8_t *loc, const Relocation &rel, uint64_t val) const;
void tlsdescToIe(uint8_t *loc, const Relocation &rel, uint64_t val) const;
void tlsdescToLe(uint8_t *loc, const Relocation &rel, uint64_t val) const;
+ bool tryGotToPCRel(uint8_t *loc, const Relocation &rHi20,
+ const Relocation &rLo12, uint64_t secAddr) const;
};
} // end anonymous namespace
@@ -1150,6 +1152,58 @@ void LoongArch::tlsdescToLe(uint8_t *loc, const Relocation &rel,
}
}
+// Try GOT indirection to PC relative optimization when relaxation is enabled.
+// From:
+// * pcalau12i $a0, %got_pc_hi20(sym_got)
+// * ld.w/d $a0, $a0, %got_pc_lo12(sym_got)
+// To:
+// * pcalau12i $a0, %pc_hi20(sym)
+// * addi.w/d $a0, $a0, %pc_lo12(sym)
+//
+// FIXME: Althouth the optimization has been performed, the GOT entries still
+// exists, similarly to AArch64. Eliminating the entries may be require
+// additional marking in the common code.
+bool LoongArch::tryGotToPCRel(uint8_t *loc, const Relocation &rHi20,
+ const Relocation &rLo12, uint64_t secAddr) const {
+ if (!rHi20.sym->isDefined() || rHi20.sym->isPreemptible ||
+ rHi20.sym->isGnuIFunc() ||
+ (ctx.arg.isPic && !cast<Defined>(*rHi20.sym).section))
+ return false;
+
+ Symbol &sym = *rHi20.sym;
+ uint64_t symLocal = sym.getVA(ctx) + rHi20.addend;
+ // Check if the address difference is within +/-2GB range.
+ // For simplicity, the range mentioned here is an approximate estimate and is
+ // not fully equivalent to the entire region that PC-relative addressing can
+ // cover.
+ int64_t pageOffset =
+ getLoongArchPage(symLocal) - getLoongArchPage(secAddr + rHi20.offset);
+ if (!isInt<20>(pageOffset >> 12))
+ return false;
+
+ Relocation newRHi20 = {RE_LOONGARCH_PAGE_PC, R_LARCH_PCALA_HI20, rHi20.offset,
+ rHi20.addend, &sym};
+ Relocation newRLo12 = {R_ABS, R_LARCH_PCALA_LO12, rLo12.offset, rLo12.addend,
+ &sym};
+
+ const uint32_t currInsn = read32le(loc);
+ const uint32_t nextInsn = read32le(loc + 4);
+ // Check if use the same register.
+ if (getD5(currInsn) != getJ5(nextInsn) || getJ5(nextInsn) != getD5(nextInsn))
+ return false;
+
+ uint64_t pageDelta =
+ getLoongArchPageDelta(symLocal, secAddr + rHi20.offset, rHi20.type);
+ // pcalau12i $a0, %pc_hi20
+ write32le(loc, insn(PCALAU12I, getD5(currInsn), 0, 0));
+ relocate(loc, newRHi20, pageDelta);
+ // addi.w/d $a0, $a0, %pc_lo12
+ write32le(loc + 4, insn(ctx.arg.is64 ? ADDI_D : ADDI_W, getD5(nextInsn),
+ getJ5(nextInsn), 0));
+ relocate(loc + 4, newRLo12, SignExtend64(symLocal, 64));
+ return true;
+}
+
// During TLSDESC GD_TO_IE, the converted code sequence always includes an
// instruction related to the Lo12 relocation (ld.[wd]). To obtain correct val
// in `getRelocTargetVA`, expr of this instruction should be adjusted to
@@ -1259,6 +1313,22 @@ void LoongArch::relocateAlloc(InputSectionBase &sec, uint8_t *buf) const {
tlsdescToLe(loc, rel, val);
}
continue;
+ case RE_LOONGARCH_GOT_PAGE_PC:
+ // In LoongArch, we try GOT indirection to PC relative optimization only
+ // when relaxation is enabled. This approach avoids determining whether
+ // relocation types are paired and whether the destination register of
+ // pcalau12i is only used by the immediately following instruction.
+ // Moreover, if the original code sequence can be relaxed to a single
+ // instruction `pcaddi`, the first instruction will be removed and it will
+ // not reach here.
+ if (isPairRelaxable(relocs, i) && rel.type == R_LARCH_GOT_PC_HI20 &&
+ relocs[i + 2].type == R_LARCH_GOT_PC_LO12 &&
+ tryGotToPCRel(loc, rel, relocs[i + 2], secAddr)) {
+ i = i + 3; // skip relocations R_LARCH_RELAX, R_LARCH_GOT_PC_LO12,
+ // R_LARCH_RELAX
+ continue;
+ }
+ break;
default:
break;
}
diff --git a/lld/test/ELF/loongarch-relax-pc-hi20-lo12.s b/lld/test/ELF/loongarch-relax-pc-hi20-lo12.s
index 760fe77d774e30..ae3b29e14fb3c1 100644
--- a/lld/test/ELF/loongarch-relax-pc-hi20-lo12.s
+++ b/lld/test/ELF/loongarch-relax-pc-hi20-lo12.s
@@ -30,24 +30,26 @@
## offset = 0x410000 - 0x10000: 0x400 pages, page offset 0
# NORELAX32-NEXT: 10000: pcalau12i $a0, 1024
# NORELAX32-NEXT: addi.w $a0, $a0, 0
+## Not relaxation, convertion to PCRel.
# NORELAX32-NEXT: pcalau12i $a0, 1024
-# NORELAX32-NEXT: ld.w $a0, $a0, 4
+# NORELAX32-NEXT: addi.w $a0, $a0, 0
# NORELAX32-NEXT: pcalau12i $a0, 1024
# NORELAX32-NEXT: addi.w $a0, $a0, 0
# NORELAX32-NEXT: pcalau12i $a0, 1024
-# NORELAX32-NEXT: ld.w $a0, $a0, 4
+# NORELAX32-NEXT: addi.w $a0, $a0, 0
# NORELAX64-LABEL: <_start>:
## offset exceed range of pcaddi
## offset = 0x410000 - 0x10000: 0x400 pages, page offset 0
# NORELAX64-NEXT: 10000: pcalau12i $a0, 1024
# NORELAX64-NEXT: addi.d $a0, $a0, 0
+## Not relaxation, convertion to PCRel.
# NORELAX64-NEXT: pcalau12i $a0, 1024
-# NORELAX64-NEXT: ld.d $a0, $a0, 8
+# NORELAX64-NEXT: addi.d $a0, $a0, 0
# NORELAX64-NEXT: pcalau12i $a0, 1024
# NORELAX64-NEXT: addi.d $a0, $a0, 0
# NORELAX64-NEXT: pcalau12i $a0, 1024
-# NORELAX64-NEXT: ld.d $a0, $a0, 8
+# NORELAX64-NEXT: addi.d $a0, $a0, 0
.section .text
.global _start
|
// additional marking in the common code. | ||
bool LoongArch::tryGotToPCRel(uint8_t *loc, const Relocation &rHi20, | ||
const Relocation &rLo12, uint64_t secAddr) const { | ||
if (!rHi20.sym->isDefined() || rHi20.sym->isPreemptible || |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need symbol tests to test each condition here. aarch64-adrp-ldr-got-symbols.s has an example
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. I have added this test in a previous patch(#123566).
lld/ELF/Arch/LoongArch.cpp
Outdated
// * pcalau12i $a0, %pc_hi20(sym) | ||
// * addi.w/d $a0, $a0, %pc_lo12(sym) | ||
// | ||
// FIXME: Althouth the optimization has been performed, the GOT entries still |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Delete FIXME. While the GOT entries are not eliminated, this might actually be desired to reduce code complexity.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, I will change it as follows:
Note: Althouth the optimization has been performed, the GOT entries still exists, similarly to AArch64. Eliminating the entries will increase code complexity.
Can you drop the trailing |
30a9eb5
to
e3fc1d6
Compare
e024b7c
to
99a1e07
Compare
e3fc1d6
to
47d84d9
Compare
cc @xen0n |
Support TLSDESC to initial-exec or local-exec optimizations. Introduce a new hook RE_LOONGARCH_RELAX_TLS_GD_TO_IE_PAGE_PC and use existing R_RELAX_TLS_GD_TO_IE_ABS to support TLSDESC => IE, while use existing R_RELAX_TLS_GD_TO_LE to support TLSDESC => LE. In normal or medium code model, there are two forms of code sequences: * pcalau12i $a0, %desc_pc_hi20(sym_desc) * addi.d $a0, $a0, %desc_pc_lo12(sym_desc) * ld.d $ra, $a0, %desc_ld(sym_desc) * jirl $ra, $ra, %desc_call(sym_desc) ------ * pcaddi $a0, %desc_pcrel_20(sym_desc) * ld.d $ra, $a0, %desc_ld(sym_desc) * jirl $ra, $ra, %desc_call(sym_desc) The code sequence obtained is as follows: * pcalau12i $a0, %ie_pc_hi20(sym_ie) * ld.[wd] $a0, $a0, %ie_pc_lo12(sym_ie) Simplicity, whether tlsdescToIe or tlsdescToLe, we always tend to convert the preceding instructions to NOPs, due to both forms of code sequence (corresponding to relocation combinations: R_LARCH_TLS_DESC_PC_HI20+R_LARCH_TLS_DESC_PC_LO12 and R_LARCH_TLS_DESC_PCREL20_S2) have same process. FIXME: When relaxation enables, redundant NOPs can be removed. It will be implemented in a future patch. Note: All forms of TLSDESC code sequences should not appear interleaved in the normal, medium or extreme code model, which compilers do not generate and lld is unsupported. This is thanks to the guard in PostRASchedulerList.cpp in llvm. ``` Calls are not scheduling boundaries before register allocation, but post-ra we don't gain anything by scheduling across calls since we don't need to worry about register pressure. ```
Add loongarch-relax-tlsdesc.s
Complement https://. When relaxation enable, remove redundant NOPs.
In LoongArch, this optimization is only supported when relaxation is enabled. From: * pcalau12i $a0, %got_pc_hi20(sym_got) * ld.w/d $a0, $a0, %got_pc_lo12(sym_got) To: * pcalau12i $a0, %pc_hi20(sym) * addi.w/d $a0, $a0, %pc_lo12(sym) If the original code sequence can be relaxed into a single instruction `pcaddi`, this patch will not be taken (see https://). The implementation related to `got` is split into two locations because the `relax()` function is part of an iteration fixed-point algorithm. We should minimize it to achieve better linker performance. FIXME: Althouth the optimization has been performed, the GOT entries still exists, similarly to AArch64. Eliminating the entries may be require additional marking in the common code.
99a1e07
to
f74a55b
Compare
47d84d9
to
2d92dc3
Compare
f74a55b
to
fd76622
Compare
In LoongArch, this optimization is only supported when relaxation is enabled.
From:
To:
If the original code sequence can be relaxed into a single instruction
pcaddi
, this patch will not be taken (see #123566).The implementation related to
got
is split into two locations because therelax()
function is part of an iteration fixed-point algorithm. We should minimize it to achieve better linker performance.Note: Althouth the optimization has been performed, the GOT entries still exists, similarly to AArch64. Eliminating the entries will increase code complexity.