Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WebAssembly] Support multiple .init_array fragments when writing Wasm objects #111008

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

georgestagg
Copy link

For WebAssembly objects, only functions listed in the first symbol in the .init_array section are invoked.

Consider the C program:

#import <stdio.h>

void init1() { puts("init1"); }
void init2() { puts("init2"); }
void init3() { puts("init3"); }

typedef void (*f)(void);
__attribute__((section(".init_array"))) void (*p_init1)(void) = &init1;
__attribute__((section(".init_array"))) f p_init2[2] = { &init2, &init3 };

int main() {
	puts("main");
	return 0;
}

Compiling for Linux results in each function running before main(). But for WebAssembly (here via Emscripten) only the first function is invoked:

$ clang-20 -c test.c -o test.o
$ clang-20 test.o -o test        
$ ./test
init1
init2
init3
main

$ EM_LLVM_ROOT=/lib/llvm-20/bin emcc -c test.c -o test.o
$ EM_LLVM_ROOT=/lib/llvm-20/bin emcc test.o -o test.js
$ node test.js
init1
main

In previous LLVM versions (<=18, I think?) this also used to throw an error message: fatal error: error in backend: only one .init_array section fragment supported.

This makes things difficult for certain tools that rely on writing multiple objects in the .init_array section for life-before-main initialisation functions.


This PR makes two changes to WasmObjectWriter:

  1. The existing function that converts entries in the .init_array section into start functions, (adding them to an InitFuncs vector for writing out later), now loops through all the subsequent fragments of the section until there are none remaining.

This picks up any functions listed in later objects in the .init_array section, and they are now successfully invoked before main.

However, a related issue is revealed by changing the functions in the example above to read:

void init1() { puts("1"); }
void init2() { puts("2"); }
void init3() { puts("3"); }

Then, with our patched LLVM,

-> $ EM_LLVM_ROOT=/[...]/llvm/bin emcc test.c -o test.js
wasm-ld: error: [...]/test_0.o: invalid data symbol offset: `p_init2` (offset: 4 segment size: 2)
emcc: error: '/[...]/llvm/bin/wasm-ld -o test.wasm [...]

The symbols in the .init_array section are written to the WASM_SYMBOL_TABLE in the custom linking section, with kind WASM_SYMBOL_TYPE_DATA. But, these symbols do not reference content in the DATA section, and so the written offset makes no sense.

Is this a mistake? AFAICT I don't think there's an appropriate kind for these symbols to be written to the WASM_SYMBOL_TABLE subsection with. In fact, symbols for the functions referenced by the .init_array sections are independently written to a different WASM_INIT_FUNC subsection in any case.

Interestingly, you can also reveal this problem in an unpatched LLVM by writing a Wasm object file that has no data section but does supply an .init_array function:

void init1() {}
__attribute__((section(".init_array"))) void (*p_init1)(void) = &init1;
int main() { return 0; }
$ clang-20 bug.c -o bug
$ ./bug

$ EM_LLVM_ROOT=/lib/llvm-20/bin emcc bug.c -o bug.js
wasm-ld: error: /[...]/bug_0.o: invalid data segment index: 0

So, this leads me to the related second change:

  1. Do not write symbols that have been defined in the .init_array section to the WASM_SYMBOL_TABLE subsection of the linking custom section, because these are handled elsewhere and written to the WASM_INIT_FUNCS subsection instead.

With the two changes above, the modified test program above compiles with Emscripten and runs on Wasm with the expected behaviour:

$ EM_LLVM_ROOT=/[...]/llvm/bin emcc test.c -o test.js
$ node test.js                                                                      
1
2
3
main

Finally, I have tried to include a unit test. Hopefully it looks reasonable -- it's based on copying similar test files in that directory.

Copy link

github-actions bot commented Oct 3, 2024

Thank you for submitting a Pull Request (PR) to the LLVM Project!

This PR will be automatically labeled and the relevant teams will be notified.

If you wish to, you can add reviewers by using the "Reviewers" section on this page.

If this is not working for you, it is probably because you do not have write permissions for the repository. In which case you can instead tag reviewers by name in a comment by using @ followed by their GitHub username.

If you have received no comments on your PR for a week, you can request a review by "ping"ing the PR by adding a comment “Ping”. The common courtesy "ping" rate is once a week. Please remember that you are asking for valuable time from other developers.

If you have further questions, they may be answered by the LLVM GitHub User Guide.

You can also ask questions in a comment on this PR, on the LLVM Discord or on the forums.

@llvmbot llvmbot added backend:WebAssembly mc Machine (object) code labels Oct 3, 2024
@llvmbot
Copy link
Collaborator

llvmbot commented Oct 3, 2024

@llvm/pr-subscribers-backend-webassembly

@llvm/pr-subscribers-mc

Author: George Stagg (georgestagg)

Changes

For WebAssembly objects, only functions listed in the first symbol in the .init_array section are invoked.

Consider the C program:

#import &lt;stdio.h&gt;

void init1() { puts("init1"); }
void init2() { puts("init2"); }
void init3() { puts("init3"); }

typedef void (*f)(void);
__attribute__((section(".init_array"))) void (*p_init1)(void) = &amp;init1;
__attribute__((section(".init_array"))) f p_init2[2] = { &amp;init2, &amp;init3 };

int main() {
	puts("main");
	return 0;
}

Compiling for Linux results in each function running before main(). But for WebAssembly (here via Emscripten) only the first function is invoked:

$ clang-20 -c test.c -o test.o
$ clang-20 test.o -o test        
$ ./test
init1
init2
init3
main

$ EM_LLVM_ROOT=/lib/llvm-20/bin emcc -c test.c -o test.o
$ EM_LLVM_ROOT=/lib/llvm-20/bin emcc test.o -o test.js
$ node test.js
init1
main

In previous LLVM versions (<=18, I think?) this also used to throw an error message: fatal error: error in backend: only one .init_array section fragment supported.

This makes things difficult for certain tools that rely on writing multiple objects in the .init_array section for life-before-main initialisation functions.


This PR makes two changes to WasmObjectWriter:

  1. The existing function that converts entries in the .init_array section into start functions, (adding them to an InitFuncs vector for writing out later), now loops through all the subsequent fragments of the section until there are none remaining.

This picks up any functions listed in later objects in the .init_array section, and they are now successfully invoked before main.

However, a related issue is revealed by changing the functions in the example above to read:

void init1() { puts("1"); }
void init2() { puts("2"); }
void init3() { puts("3"); }

Then, with our patched LLVM,

-&gt; $ EM_LLVM_ROOT=/[...]/llvm/bin emcc test.c -o test.js
wasm-ld: error: [...]/test_0.o: invalid data symbol offset: `p_init2` (offset: 4 segment size: 2)
emcc: error: '/[...]/llvm/bin/wasm-ld -o test.wasm [...]

The symbols in the .init_array section are written to the WASM_SYMBOL_TABLE in the custom linking section, with kind WASM_SYMBOL_TYPE_DATA. But, these symbols do not reference content in the DATA section, and so the written offset makes no sense.

Is this a mistake? AFAICT I don't think there's an appropriate kind for these symbols to be written to the WASM_SYMBOL_TABLE subsection with. In fact, symbols for the functions referenced by the .init_array sections are independently written to a different WASM_INIT_FUNC subsection in any case.

Interestingly, you can also reveal this problem in an unpatched LLVM by writing a Wasm object file that has no data section but does supply an .init_array function:

void init1() {}
__attribute__((section(".init_array"))) void (*p_init1)(void) = &amp;init1;
int main() { return 0; }
$ clang-20 bug.c -o bug
$ ./bug

$ EM_LLVM_ROOT=/lib/llvm-20/bin emcc bug.c -o bug.js
wasm-ld: error: /[...]/bug_0.o: invalid data segment index: 0

So, this leads me to the related second change:

  1. Do not write symbols that have been defined in the .init_array section to the WASM_SYMBOL_TABLE subsection of the linking custom section, because these are handled elsewhere and written to the WASM_INIT_FUNCS subsection instead.

With the two changes above, the modified test program above compiles with Emscripten and runs on Wasm with the expected behaviour:

$ EM_LLVM_ROOT=/[...]/llvm/bin emcc test.c -o test.js
$ node test.js                                                                      
1
2
3
main

Finally, I have tried to include a unit test. Hopefully it looks reasonable -- it's based on copying similar test files in that directory.


Full diff: https://github.com/llvm/llvm-project/pull/111008.diff

2 Files Affected:

  • (modified) llvm/lib/MC/WasmObjectWriter.cpp (+52-42)
  • (added) llvm/test/MC/WebAssembly/init-array.ll (+56)
diff --git a/llvm/lib/MC/WasmObjectWriter.cpp b/llvm/lib/MC/WasmObjectWriter.cpp
index f25dc92fa235a2..034e058690744a 100644
--- a/llvm/lib/MC/WasmObjectWriter.cpp
+++ b/llvm/lib/MC/WasmObjectWriter.cpp
@@ -1769,6 +1769,11 @@ uint64_t WasmObjectWriter::writeOneObject(MCAssembler &Asm,
       WS.setIndex(InvalidIndex);
       continue;
     }
+    // Contents of .init_array sections are handled elsewhere.
+    if (WS.isDefined() &&
+        WS.getSection().getName().starts_with(".init_array")) {
+      continue;
+    }
     LLVM_DEBUG(dbgs() << "adding to symtab: " << WS << "\n");
 
     uint32_t Flags = 0;
@@ -1853,49 +1858,54 @@ uint64_t WasmObjectWriter::writeOneObject(MCAssembler &Asm,
     if (EmptyFrag.getKind() != MCFragment::FT_Data)
       report_fatal_error(".init_array section should be aligned");
 
-    const MCFragment &AlignFrag = *EmptyFrag.getNext();
-    if (AlignFrag.getKind() != MCFragment::FT_Align)
-      report_fatal_error(".init_array section should be aligned");
-    if (cast<MCAlignFragment>(AlignFrag).getAlignment() !=
-        Align(is64Bit() ? 8 : 4))
-      report_fatal_error(".init_array section should be aligned for pointers");
-
-    const MCFragment &Frag = *AlignFrag.getNext();
-    if (Frag.hasInstructions() || Frag.getKind() != MCFragment::FT_Data)
-      report_fatal_error("only data supported in .init_array section");
-
-    uint16_t Priority = UINT16_MAX;
-    unsigned PrefixLength = strlen(".init_array");
-    if (WS.getName().size() > PrefixLength) {
-      if (WS.getName()[PrefixLength] != '.')
+    const MCFragment *nextFrag = EmptyFrag.getNext();
+    while (nextFrag != nullptr) {
+      const MCFragment &AlignFrag = *nextFrag;
+      if (AlignFrag.getKind() != MCFragment::FT_Align)
+        report_fatal_error(".init_array section should be aligned");
+      if (cast<MCAlignFragment>(AlignFrag).getAlignment() !=
+          Align(is64Bit() ? 8 : 4))
         report_fatal_error(
-            ".init_array section priority should start with '.'");
-      if (WS.getName().substr(PrefixLength + 1).getAsInteger(10, Priority))
-        report_fatal_error("invalid .init_array section priority");
-    }
-    const auto &DataFrag = cast<MCDataFragment>(Frag);
-    const SmallVectorImpl<char> &Contents = DataFrag.getContents();
-    for (const uint8_t *
-             P = (const uint8_t *)Contents.data(),
-            *End = (const uint8_t *)Contents.data() + Contents.size();
-         P != End; ++P) {
-      if (*P != 0)
-        report_fatal_error("non-symbolic data in .init_array section");
-    }
-    for (const MCFixup &Fixup : DataFrag.getFixups()) {
-      assert(Fixup.getKind() ==
-             MCFixup::getKindForSize(is64Bit() ? 8 : 4, false));
-      const MCExpr *Expr = Fixup.getValue();
-      auto *SymRef = dyn_cast<MCSymbolRefExpr>(Expr);
-      if (!SymRef)
-        report_fatal_error("fixups in .init_array should be symbol references");
-      const auto &TargetSym = cast<const MCSymbolWasm>(SymRef->getSymbol());
-      if (TargetSym.getIndex() == InvalidIndex)
-        report_fatal_error("symbols in .init_array should exist in symtab");
-      if (!TargetSym.isFunction())
-        report_fatal_error("symbols in .init_array should be for functions");
-      InitFuncs.push_back(
-          std::make_pair(Priority, TargetSym.getIndex()));
+            ".init_array section should be aligned for pointers");
+
+      const MCFragment &Frag = *AlignFrag.getNext();
+      nextFrag = Frag.getNext();
+      if (Frag.hasInstructions() || Frag.getKind() != MCFragment::FT_Data)
+        report_fatal_error("only data supported in .init_array section");
+
+      uint16_t Priority = UINT16_MAX;
+      unsigned PrefixLength = strlen(".init_array");
+      if (WS.getName().size() > PrefixLength) {
+        if (WS.getName()[PrefixLength] != '.')
+          report_fatal_error(
+              ".init_array section priority should start with '.'");
+        if (WS.getName().substr(PrefixLength + 1).getAsInteger(10, Priority))
+          report_fatal_error("invalid .init_array section priority");
+      }
+      const auto &DataFrag = cast<MCDataFragment>(Frag);
+      const SmallVectorImpl<char> &Contents = DataFrag.getContents();
+      for (const uint8_t *
+               P = (const uint8_t *)Contents.data(),
+              *End = (const uint8_t *)Contents.data() + Contents.size();
+           P != End; ++P) {
+        if (*P != 0)
+          report_fatal_error("non-symbolic data in .init_array section");
+      }
+      for (const MCFixup &Fixup : DataFrag.getFixups()) {
+        assert(Fixup.getKind() ==
+               MCFixup::getKindForSize(is64Bit() ? 8 : 4, false));
+        const MCExpr *Expr = Fixup.getValue();
+        auto *SymRef = dyn_cast<MCSymbolRefExpr>(Expr);
+        if (!SymRef)
+          report_fatal_error(
+              "fixups in .init_array should be symbol references");
+        const auto &TargetSym = cast<const MCSymbolWasm>(SymRef->getSymbol());
+        if (TargetSym.getIndex() == InvalidIndex)
+          report_fatal_error("symbols in .init_array should exist in symtab");
+        if (!TargetSym.isFunction())
+          report_fatal_error("symbols in .init_array should be for functions");
+        InitFuncs.push_back(std::make_pair(Priority, TargetSym.getIndex()));
+      }
     }
   }
 
diff --git a/llvm/test/MC/WebAssembly/init-array.ll b/llvm/test/MC/WebAssembly/init-array.ll
new file mode 100644
index 00000000000000..d19ddcf10d5d9c
--- /dev/null
+++ b/llvm/test/MC/WebAssembly/init-array.ll
@@ -0,0 +1,56 @@
+; RUN: llc -mcpu=mvp -filetype=obj %s -o - | obj2yaml | FileCheck %s
+
+target triple = "wasm32-unknown-unknown"
+
+@p_init1 = hidden global ptr @init1, section ".init_array", align 4
+@p_init2 = hidden global ptr @init2, section ".init_array", align 4
+
+define hidden void @init1() #0 { ret void }
+define hidden void @init2() #0 { ret void }
+
+
+; CHECK:        - Type:            IMPORT
+; CHECK-NEXT:     Imports:
+; CHECK-NEXT:       - Module:          env
+; CHECK-NEXT:         Field:           __linear_memory
+; CHECK-NEXT:         Kind:            MEMORY
+; CHECK-NEXT:         Memory:
+; CHECK-NEXT:           Minimum:         0x0
+; CHECK-NEXT:       - Module:          env
+; CHECK-NEXT:         Field:           __indirect_function_table
+; CHECK-NEXT:         Kind:            TABLE
+; CHECK-NEXT:         Table:
+; CHECK-NEXT:           Index:           0
+; CHECK-NEXT:           ElemType:        FUNCREF
+; CHECK-NEXT:           Limits:
+; CHECK-NEXT:             Minimum:         0x0
+; CHECK-NEXT:   - Type:            FUNCTION
+; CHECK-NEXT:     FunctionTypes:   [ 0, 0 ]
+; CHECK-NEXT:   - Type:            CODE
+; CHECK-NEXT:     Functions:
+; CHECK-NEXT:       - Index:           0
+; CHECK-NEXT:         Locals:          []
+; CHECK-NEXT:         Body:            0B
+; CHECK-NEXT:       - Index:           1
+; CHECK-NEXT:         Locals:          []
+; CHECK-NEXT:         Body:            0B
+; CHECK-NEXT:    - Type:            CUSTOM
+; CHECK-NEXT:      Name:            linking
+; CHECK-NEXT:      Version:         2
+; CHECK-NEXT:      SymbolTable:
+; CHECK-NEXT:        - Index:           0
+; CHECK-NEXT:          Kind:            FUNCTION
+; CHECK-NEXT:          Name:            init1
+; CHECK-NEXT:          Flags:           [ VISIBILITY_HIDDEN ]
+; CHECK-NEXT:          Function:        0
+; CHECK-NEXT:        - Index:           1
+; CHECK-NEXT:          Kind:            FUNCTION
+; CHECK-NEXT:          Name:            init2
+; CHECK-NEXT:          Flags:           [ VISIBILITY_HIDDEN ]
+; CHECK-NEXT:          Function:        1
+; CHECK-NEXT:      InitFunctions:
+; CHECK-NEXT:        - Priority:        65535
+; CHECK-NEXT:          Symbol:          0
+; CHECK-NEXT:        - Priority:        65535
+; CHECK-NEXT:          Symbol:          1
+; CHECK-NEXT: ...

@hoodmane
Copy link

@sbc100 who should we ask to review this?

@hoodmane
Copy link

From git blame looks like @sbc100 would be the most appropriate reviewer.

@sbc100 sbc100 self-requested a review October 11, 2024 19:48
Copy link
Collaborator

@sbc100 sbc100 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems reasonable. I suppose that C++ and rust always generate just a single fragment which is why we haven't seen issues with this before?

@@ -1769,6 +1769,11 @@ uint64_t WasmObjectWriter::writeOneObject(MCAssembler &Asm,
WS.setIndex(InvalidIndex);
continue;
}
// Contents of .init_array sections are handled elsewhere.
if (WS.isDefined() &&
WS.getSection().getName().starts_with(".init_array")) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How is it possible to define a symbol within this section? Can you give and example? It seems like it should be just a sequence of pointers, but I guess somehow you can have symbol that points into that sequence?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the following C code does so:

#import <stdio.h>

void init1() { puts("init1"); }
__attribute__((section(".init_array"))) void (*p_init[3])(void) = { &init1, &init1, &init1 };

int main() { return 0; }

Under (e.g.) Linux, a list of pointers is written to the .init_array section and a symbol p_init points to the sequence.

$ clang -c symbol.c -o symbol.o
$ obj2yaml symbol.o
[...]
Symbols:
[...]
  - Name:            p_init
    Type:            STT_OBJECT
    Section:         .init_array
    Binding:         STB_GLOBAL
    Size:            0x18

However, with a WebAssembly object this does not make as much sense. WasmObjectWriter.cpp converts the sequence of pointers in .init_array section into entries in the WASM_INIT_FUNC subsection of the custom linking section instead. There they are no longer written as a list of pointers, but indexes into a table.

So, this particular change makes it so that the p_init symbol is not written into the WebAssembly object file. In the current version a symbol is actually written, but it is broken in any case since it points to the incorrect section (leading to the invalid data segment index: 0 bug shown in the OP).

@sbc100
Copy link
Collaborator

sbc100 commented Oct 11, 2024

BTW viewing this diff with "ignore whitespace" helped a lot to see what was going on.

Regarding the test case, can it be written in assembly instead?

@georgestagg
Copy link
Author

I suppose that C++ and rust always generate just a single fragment which is why we haven't seen issues with this before?

I guess so, let me throw together a quick C++ example and see how it currently handles the constructors...

First, compile and test this C++ code:
#include <iostream>
using namespace std;

class Init1 {
  public: Init1() { cout << "Init1" << endl; }
};

class Init2 {
  public: Init2() { cout << "Init2" << endl; }
};

Init1 init1a;
Init1 init1b;
Init2 init2;

int main() {
  cout << "main" << endl;
  return 0;
}
$ em++ -S -emit-llvm -c test.cpp -o test.ll
$ em++ -c test.cpp -o test.o
$ em++ test.o -o test.js

$ node test.js
Init1
Init1
Init2
main

Let's make sure the WASM_INIT_FUNC subsection is populated in the object test.o:

$ /[...]/bin/obj2yaml test.o
[...]
  - Type:            CUSTOM
    Name:            linking
    Version:         2
    SymbolTable:
[...]
      - Index:           68
        Kind:            FUNCTION
        Name:            _GLOBAL__sub_I_test.cpp
        Flags:           [ BINDING_LOCAL ]
[...]
    InitFunctions:
      - Priority:        65535
        Symbol:          68
[...]

Yes, but just with a single function. I suppose the constructor functions have been coalesced into a single function. Let's take a look in test.ll:

@llvm.global_ctors = appending global [1 x { i32, ptr, ptr }] [{ i32, ptr, ptr } { i32 65535, ptr @_GLOBAL__sub_I_test.cpp, ptr null }]

[...]
; Function Attrs: noinline
define internal void @_GLOBAL__sub_I_test.cpp() #0 {
  call void @__cxx_global_var_init()
  call void @__cxx_global_var_init.1()
  call void @__cxx_global_var_init.2()
  ret void
}

Indeed, the global constructors are combined into a single function. Note that LLVM generates code using @llvm.global_ctors, rather than adding to the .init_array section, presumably because .init_array is not portable.

Yes, at least for simple C++.

BTW viewing this diff with "ignore whitespace" helped a lot to see what was going on.

Yes, apologies. The diff is smaller than it looks due to the change in indentation!

Regarding the test case, can it be written in assembly instead?

Unfortunately, I'm not sure how to express this in WebAssembly text format directly. There is an open issue about defining custom sections (WebAssembly/design#1153) in the text format and I'm not entirely sure how to express the test, since the bug occurs in the code that converts the .init_array pointers into the format required for the custom linking section.

I would be willing to change the test, but would need specific help in how to proceed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backend:WebAssembly mc Machine (object) code
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants