Skip to content

[StrTable] Use string literal emission for intrinsics on non-MSVC platforms #124856

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 6 commits into from
Apr 13, 2025

Conversation

rnk
Copy link
Collaborator

@rnk rnk commented Jan 28, 2025

This mainly transitions the LLVM intrinsic string table from character emission to string literal emission, which I confirmed happens for me locally.

I moved the guts of StringToOffsetTable to a cpp file so I could move the EmitLongStrLiterals cl::opt global to a non-vague linkage home in the TableGen library. I had to add missing FormatVariadic.h includes to account for moving other includes to a cpp file.

This mainly transitions the LLVM intrinsic string table from character
emission to string literal emission.

Add missing FormatVariadic.h includes to account for moving the include
to a cpp file.
@llvmbot
Copy link
Member

llvmbot commented Jan 28, 2025

@llvm/pr-subscribers-tablegen

Author: Reid Kleckner (rnk)

Changes

This mainly transitions the LLVM intrinsic string table from character emission to string literal emission, which I confirmed happens for me locally.

I moved the guts of StringToOffsetTable to a cpp file so I could move the EmitLongStrLiterals cl::opt global to a non-vague linkage home in the TableGen library. I had to add missing FormatVariadic.h includes to account for moving other includes to a cpp file.


Full diff: https://github.com/llvm/llvm-project/pull/124856.diff

7 Files Affected:

  • (modified) llvm/include/llvm/TableGen/StringToOffsetTable.h (+3-101)
  • (modified) llvm/lib/TableGen/CMakeLists.txt (+1)
  • (added) llvm/lib/TableGen/StringToOffsetTable.cpp (+129)
  • (modified) llvm/utils/TableGen/AsmMatcherEmitter.cpp (+1)
  • (modified) llvm/utils/TableGen/Basic/TableGen.cpp (-9)
  • (modified) llvm/utils/TableGen/SDNodeInfoEmitter.cpp (+1)
  • (modified) llvm/utils/gn/secondary/llvm/lib/TableGen/BUILD.gn (+1)
diff --git a/llvm/include/llvm/TableGen/StringToOffsetTable.h b/llvm/include/llvm/TableGen/StringToOffsetTable.h
index e716411514bd63..21795644d4bd67 100644
--- a/llvm/include/llvm/TableGen/StringToOffsetTable.h
+++ b/llvm/include/llvm/TableGen/StringToOffsetTable.h
@@ -12,8 +12,6 @@
 #include "llvm/ADT/SmallString.h"
 #include "llvm/ADT/StringExtras.h"
 #include "llvm/ADT/StringMap.h"
-#include "llvm/Support/FormatVariadic.h"
-#include "llvm/Support/raw_ostream.h"
 #include <optional>
 
 namespace llvm {
@@ -36,17 +34,7 @@ class StringToOffsetTable {
   bool empty() const { return StringOffset.empty(); }
   size_t size() const { return AggregateString.size(); }
 
-  unsigned GetOrAddStringOffset(StringRef Str, bool appendZero = true) {
-    auto [II, Inserted] = StringOffset.insert({Str, size()});
-    if (Inserted) {
-      // Add the string to the aggregate if this is the first time found.
-      AggregateString.append(Str.begin(), Str.end());
-      if (appendZero)
-        AggregateString += '\0';
-    }
-
-    return II->second;
-  }
+  unsigned GetOrAddStringOffset(StringRef Str, bool appendZero = true);
 
   // Returns the offset of `Str` in the table if its preset, else return
   // std::nullopt.
@@ -69,96 +57,10 @@ class StringToOffsetTable {
   // `static` and `constexpr`. Both `Name` and (`Name` + "Storage") must be
   // valid identifiers to declare.
   void EmitStringTableDef(raw_ostream &OS, const Twine &Name,
-                          const Twine &Indent = "") const {
-    OS << formatv(R"(
-#ifdef __GNUC__
-#pragma GCC diagnostic push
-#pragma GCC diagnostic ignored "-Woverlength-strings"
-#endif
-{0}static constexpr char {1}Storage[] = )",
-                  Indent, Name);
-
-    // MSVC silently miscompiles string literals longer than 64k in some
-    // circumstances. When the string table is longer, emit it as an array of
-    // character literals.
-    bool UseChars = AggregateString.size() > (64 * 1024);
-    OS << (UseChars ? "{\n" : "\n");
-
-    llvm::ListSeparator LineSep(UseChars ? ",\n" : "\n");
-    llvm::SmallVector<StringRef> Strings(split(AggregateString, '\0'));
-    // We should always have an empty string at the start, and because these are
-    // null terminators rather than separators, we'll have one at the end as
-    // well. Skip the end one.
-    assert(Strings.front().empty() && "Expected empty initial string!");
-    assert(Strings.back().empty() &&
-           "Expected empty string at the end due to terminators!");
-    Strings.pop_back();
-    for (StringRef Str : Strings) {
-      OS << LineSep << Indent << "  ";
-      // If we can, just emit this as a string literal to be concatenated.
-      if (!UseChars) {
-        OS << "\"";
-        OS.write_escaped(Str);
-        OS << "\\0\"";
-        continue;
-      }
-
-      llvm::ListSeparator CharSep(", ");
-      for (char C : Str) {
-        OS << CharSep << "'";
-        OS.write_escaped(StringRef(&C, 1));
-        OS << "'";
-      }
-      OS << CharSep << "'\\0'";
-    }
-    OS << LineSep << Indent << (UseChars ? "};" : "  ;");
-
-    OS << formatv(R"(
-#ifdef __GNUC__
-#pragma GCC diagnostic pop
-#endif
-
-{0}static constexpr llvm::StringTable {1} =
-{0}    {1}Storage;
-)",
-                  Indent, Name);
-  }
+                          const Twine &Indent = "") const;
 
   // Emit the string as one single string.
-  void EmitString(raw_ostream &O) const {
-    // Escape the string.
-    SmallString<256> EscapedStr;
-    raw_svector_ostream(EscapedStr).write_escaped(AggregateString);
-
-    O << "    \"";
-    unsigned CharsPrinted = 0;
-    for (unsigned i = 0, e = EscapedStr.size(); i != e; ++i) {
-      if (CharsPrinted > 70) {
-        O << "\"\n    \"";
-        CharsPrinted = 0;
-      }
-      O << EscapedStr[i];
-      ++CharsPrinted;
-
-      // Print escape sequences all together.
-      if (EscapedStr[i] != '\\')
-        continue;
-
-      assert(i + 1 < EscapedStr.size() && "Incomplete escape sequence!");
-      if (isDigit(EscapedStr[i + 1])) {
-        assert(isDigit(EscapedStr[i + 2]) && isDigit(EscapedStr[i + 3]) &&
-               "Expected 3 digit octal escape!");
-        O << EscapedStr[++i];
-        O << EscapedStr[++i];
-        O << EscapedStr[++i];
-        CharsPrinted += 3;
-      } else {
-        O << EscapedStr[++i];
-        ++CharsPrinted;
-      }
-    }
-    O << "\"";
-  }
+  void EmitString(raw_ostream &O) const;
 };
 
 } // end namespace llvm
diff --git a/llvm/lib/TableGen/CMakeLists.txt b/llvm/lib/TableGen/CMakeLists.txt
index 84815c77369979..0f9284c8bb999d 100644
--- a/llvm/lib/TableGen/CMakeLists.txt
+++ b/llvm/lib/TableGen/CMakeLists.txt
@@ -7,6 +7,7 @@ add_llvm_component_library(LLVMTableGen
   Record.cpp
   SetTheory.cpp
   StringMatcher.cpp
+  StringToOffsetTable.cpp
   TableGenBackend.cpp
   TableGenBackendSkeleton.cpp
   TGLexer.cpp
diff --git a/llvm/lib/TableGen/StringToOffsetTable.cpp b/llvm/lib/TableGen/StringToOffsetTable.cpp
new file mode 100644
index 00000000000000..6de889e1a10ee3
--- /dev/null
+++ b/llvm/lib/TableGen/StringToOffsetTable.cpp
@@ -0,0 +1,129 @@
+//===- StringToOffsetTable.h - Emit a big concatenated string ---*- C++ -*-===//
+//
+// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
+// See https://llvm.org/LICENSE.txt for license information.
+// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
+//
+//===----------------------------------------------------------------------===//
+
+#include "llvm/TableGen/StringToOffsetTable.h"
+#include "llvm/Support/CommandLine.h"
+#include "llvm/Support/FormatVariadic.h"
+#include "llvm/Support/raw_ostream.h"
+
+using namespace llvm;
+
+namespace llvm {
+cl::opt<bool> EmitLongStrLiterals(
+    "long-string-literals",
+    cl::desc("when emitting large string tables, prefer string literals over "
+             "comma-separated char literals. This can be a readability and "
+             "compile-time performance win, but upsets some compilers"),
+    cl::Hidden, cl::init(true));
+} // end namespace llvm
+
+
+unsigned StringToOffsetTable::GetOrAddStringOffset(StringRef Str,
+                                                   bool appendZero) {
+  auto [II, Inserted] = StringOffset.insert({Str, size()});
+  if (Inserted) {
+    // Add the string to the aggregate if this is the first time found.
+    AggregateString.append(Str.begin(), Str.end());
+    if (appendZero)
+      AggregateString += '\0';
+  }
+
+  return II->second;
+}
+
+void StringToOffsetTable::EmitStringTableDef(raw_ostream &OS, const Twine &Name,
+                                             const Twine &Indent) const {
+  OS << formatv(R"(
+#ifdef __GNUC__
+#pragma GCC diagnostic push
+#pragma GCC diagnostic ignored "-Woverlength-strings"
+#endif
+{0}static constexpr char {1}Storage[] = )",
+                Indent, Name);
+
+  // MSVC silently miscompiles string literals longer than 64k in some
+  // circumstances. In this case, the build system sets EmitLongStrLiterals to
+  // false. When that option is false and the string table is longer than 64k,
+  // emit it as an array of character literals.
+  bool UseChars = !EmitLongStrLiterals && AggregateString.size() > (64 * 1024);
+  OS << (UseChars ? "{\n" : "\n");
+
+  llvm::ListSeparator LineSep(UseChars ? ",\n" : "\n");
+  llvm::SmallVector<StringRef> Strings(split(AggregateString, '\0'));
+  // We should always have an empty string at the start, and because these are
+  // null terminators rather than separators, we'll have one at the end as
+  // well. Skip the end one.
+  assert(Strings.front().empty() && "Expected empty initial string!");
+  assert(Strings.back().empty() &&
+         "Expected empty string at the end due to terminators!");
+  Strings.pop_back();
+  for (StringRef Str : Strings) {
+    OS << LineSep << Indent << "  ";
+    // If we can, just emit this as a string literal to be concatenated.
+    if (!UseChars) {
+      OS << "\"";
+      OS.write_escaped(Str);
+      OS << "\\0\"";
+      continue;
+    }
+
+    llvm::ListSeparator CharSep(", ");
+    for (char C : Str) {
+      OS << CharSep << "'";
+      OS.write_escaped(StringRef(&C, 1));
+      OS << "'";
+    }
+    OS << CharSep << "'\\0'";
+  }
+  OS << LineSep << Indent << (UseChars ? "};" : "  ;");
+
+  OS << formatv(R"(
+#ifdef __GNUC__
+#pragma GCC diagnostic pop
+#endif
+
+{0}static constexpr llvm::StringTable {1} =
+{0}    {1}Storage;
+)",
+                Indent, Name);
+}
+
+void StringToOffsetTable::EmitString(raw_ostream &O) const {
+  // Escape the string.
+  SmallString<256> EscapedStr;
+  raw_svector_ostream(EscapedStr).write_escaped(AggregateString);
+
+  O << "    \"";
+  unsigned CharsPrinted = 0;
+  for (unsigned i = 0, e = EscapedStr.size(); i != e; ++i) {
+    if (CharsPrinted > 70) {
+      O << "\"\n    \"";
+      CharsPrinted = 0;
+    }
+    O << EscapedStr[i];
+    ++CharsPrinted;
+
+    // Print escape sequences all together.
+    if (EscapedStr[i] != '\\')
+      continue;
+
+    assert(i + 1 < EscapedStr.size() && "Incomplete escape sequence!");
+    if (isDigit(EscapedStr[i + 1])) {
+      assert(isDigit(EscapedStr[i + 2]) && isDigit(EscapedStr[i + 3]) &&
+             "Expected 3 digit octal escape!");
+      O << EscapedStr[++i];
+      O << EscapedStr[++i];
+      O << EscapedStr[++i];
+      CharsPrinted += 3;
+    } else {
+      O << EscapedStr[++i];
+      ++CharsPrinted;
+    }
+  }
+  O << "\"";
+}
diff --git a/llvm/utils/TableGen/AsmMatcherEmitter.cpp b/llvm/utils/TableGen/AsmMatcherEmitter.cpp
index 7684387d80fe24..0f2e20eb37649b 100644
--- a/llvm/utils/TableGen/AsmMatcherEmitter.cpp
+++ b/llvm/utils/TableGen/AsmMatcherEmitter.cpp
@@ -110,6 +110,7 @@
 #include "llvm/Support/CommandLine.h"
 #include "llvm/Support/Debug.h"
 #include "llvm/Support/ErrorHandling.h"
+#include "llvm/Support/FormatVariadic.h"
 #include "llvm/TableGen/Error.h"
 #include "llvm/TableGen/Record.h"
 #include "llvm/TableGen/StringMatcher.h"
diff --git a/llvm/utils/TableGen/Basic/TableGen.cpp b/llvm/utils/TableGen/Basic/TableGen.cpp
index 80ac93f2b54fb6..edb7791500699f 100644
--- a/llvm/utils/TableGen/Basic/TableGen.cpp
+++ b/llvm/utils/TableGen/Basic/TableGen.cpp
@@ -26,15 +26,6 @@
 
 using namespace llvm;
 
-namespace llvm {
-cl::opt<bool> EmitLongStrLiterals(
-    "long-string-literals",
-    cl::desc("when emitting large string tables, prefer string literals over "
-             "comma-separated char literals. This can be a readability and "
-             "compile-time performance win, but upsets some compilers"),
-    cl::Hidden, cl::init(true));
-} // end namespace llvm
-
 static cl::OptionCategory PrintEnumsCat("Options for -print-enums");
 static cl::opt<std::string> Class("class",
                                   cl::desc("Print Enum list for this class"),
diff --git a/llvm/utils/TableGen/SDNodeInfoEmitter.cpp b/llvm/utils/TableGen/SDNodeInfoEmitter.cpp
index 63ee0deb871109..64f03dae83e7de 100644
--- a/llvm/utils/TableGen/SDNodeInfoEmitter.cpp
+++ b/llvm/utils/TableGen/SDNodeInfoEmitter.cpp
@@ -9,6 +9,7 @@
 #include "Basic/SequenceToOffsetTable.h"
 #include "Common/CodeGenDAGPatterns.h" // For SDNodeInfo.
 #include "llvm/Support/CommandLine.h"
+#include "llvm/Support/FormatVariadic.h"
 #include "llvm/TableGen/Error.h"
 #include "llvm/TableGen/StringToOffsetTable.h"
 #include "llvm/TableGen/TableGenBackend.h"
diff --git a/llvm/utils/gn/secondary/llvm/lib/TableGen/BUILD.gn b/llvm/utils/gn/secondary/llvm/lib/TableGen/BUILD.gn
index d90df7bc0e57a5..b40fdf154b01a2 100644
--- a/llvm/utils/gn/secondary/llvm/lib/TableGen/BUILD.gn
+++ b/llvm/utils/gn/secondary/llvm/lib/TableGen/BUILD.gn
@@ -10,6 +10,7 @@ static_library("TableGen") {
     "Record.cpp",
     "SetTheory.cpp",
     "StringMatcher.cpp",
+    "StringToOffsetTable.cpp",
     "TGLexer.cpp",
     "TGParser.cpp",
     "TGTimer.cpp",

Copy link
Contributor

@s-barannikov s-barannikov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure I understood the point of moving the option from one cpp file to another.
It is still extern in the other file, SequenceToOffsetTable.h,
and could also be extern in StringToOffsetTable.h(?)

@@ -0,0 +1,129 @@
//===- StringToOffsetTable.cpp - Emit a big concatenated string -*- C++ -*-===//
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Note that the updated coding standard doesn't require this line to contain filename/description/c++.)

@rnk
Copy link
Collaborator Author

rnk commented Jan 29, 2025

Not sure I understood the point of moving the option from one cpp file to another. It is still extern in the other file, SequenceToOffsetTable.h, and could also be extern in StringToOffsetTable.h(?)

The other cpp file is in the llvm-tblgen binary, not the TableGen library used by other subproject tablegen binaries (clang-tblgen and mlir-tblgen), so referencing it this way would cause link errors. It needs to live in some lib/TableGen cpp file, so I made one rather than shoehorning it into an existing cpp file.

@s-barannikov
Copy link
Contributor

I see, thanks

Copy link
Contributor

@s-barannikov s-barannikov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Member

@chandlerc chandlerc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You'll also need to update the Clang CMake to invoke clang-tblgen with this flag at least.

I'd very much like to wait until #120534 is merged before landing this because the string tables in LLVM itself, for an unknown reason, are not enough to trigger the miscompile for some of our MSVC users. We actually need the string tables in Clang as well. So until #120534 lands, you won't easily be able to tell if this PR regresses things. This is also why the failure to pass the flag in clang-tblgen likely has no effect (yet).

@llvmbot llvmbot added the cmake Build system in general and CMake in particular label Jan 30, 2025
@rnk
Copy link
Collaborator Author

rnk commented Jan 30, 2025

You'll also need to update the Clang CMake to invoke clang-tblgen with this flag at least.

That's true, I had already seen the logic to limit this option to the LLVM subproject and I forgot to remove that condition now that this flag is part of lib/TableGen and part of all tblgen binaries. That change is present now.

I'd very much like to wait until #120534 is merged before landing this because the string tables in LLVM itself, for an unknown reason, are not enough to trigger the miscompile for some of our MSVC users. We actually need the string tables in Clang as well. So until #120534 lands, you won't easily be able to tell if this PR regresses things. This is also why the failure to pass the flag in clang-tblgen likely has no effect (yet).

This can wait, but keep in mind that if you sequence it this way you are going to get character arrays, which are slower to parse, for those new large string tables in clang. I don't have data, but I do remember this being a bottleneck at some point in the past for LLVM intrinsic parsing, which is why I'm bothering to retain this optimization.

@rnk rnk merged commit 0a27c4e into llvm:main Apr 13, 2025
11 checks passed
@llvm-ci
Copy link
Collaborator

llvm-ci commented Apr 13, 2025

LLVM Buildbot has detected a new failure on builder lld-x86_64-win running on as-worker-93 while building llvm at step 7 "test-build-unified-tree-check-all".

Full details are available at: https://lab.llvm.org/buildbot/#/builders/146/builds/2701

Here is the relevant piece of the build log for the reference
Step 7 (test-build-unified-tree-check-all) failure: test (failure)
******************** TEST 'LLVM-Unit :: Support/./SupportTests.exe/82/95' FAILED ********************
Script(shard):
--
GTEST_OUTPUT=json:C:\a\lld-x86_64-win\build\unittests\Support\.\SupportTests.exe-LLVM-Unit-23144-82-95.json GTEST_SHUFFLE=0 GTEST_TOTAL_SHARDS=95 GTEST_SHARD_INDEX=82 C:\a\lld-x86_64-win\build\unittests\Support\.\SupportTests.exe
--

Script:
--
C:\a\lld-x86_64-win\build\unittests\Support\.\SupportTests.exe --gtest_filter=ProgramEnvTest.CreateProcessLongPath
--
C:\a\lld-x86_64-win\llvm-project\llvm\unittests\Support\ProgramTest.cpp(160): error: Expected equality of these values:
  0
  RC
    Which is: -2

C:\a\lld-x86_64-win\llvm-project\llvm\unittests\Support\ProgramTest.cpp(163): error: fs::remove(Twine(LongPath)): did not return errc::success.
error number: 13
error message: permission denied



C:\a\lld-x86_64-win\llvm-project\llvm\unittests\Support\ProgramTest.cpp:160
Expected equality of these values:
  0
  RC
    Which is: -2

C:\a\lld-x86_64-win\llvm-project\llvm\unittests\Support\ProgramTest.cpp:163
fs::remove(Twine(LongPath)): did not return errc::success.
error number: 13
error message: permission denied




********************


@rnk
Copy link
Collaborator Author

rnk commented Apr 14, 2025

I checked the lld-x86_64-win bot logs and it looks like this test is flaky. It passed in subsequent runs and the same test failed in a previous build. I filed #135593 to track the flakiness.

mariusz-sikora-at-amd added a commit to mariusz-sikora-at-amd/llvm-dialects that referenced this pull request Apr 15, 2025
LLVM upstream change: llvm/llvm-project#124856
makes EmitLongStrLiterals cl::opt global
Flakebi pushed a commit to GPUOpen-Drivers/llvm-dialects that referenced this pull request Apr 15, 2025
LLVM upstream change: llvm/llvm-project#124856
makes EmitLongStrLiterals cl::opt global
var-const pushed a commit to ldionne/llvm-project that referenced this pull request Apr 17, 2025
…tforms (llvm#124856)

This mainly transitions the LLVM intrinsic string table from character
emission to string literal emission, which I confirmed happens for me
locally.

I moved the guts of StringToOffsetTable to a cpp file so I could move
the `EmitLongStrLiterals` cl::opt global to a non-vague linkage home in
the `TableGen` library. I had to add missing FormatVariadic.h includes
to account for moving other includes to a cpp file.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cmake Build system in general and CMake in particular tablegen
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants