Skip to content

[MLIR] Fix duplicated attribute nodes in MLIR bytecode deserialization #151267

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 12 commits into
base: main
Choose a base branch
from

Conversation

hankluo6
Copy link

Fixes #150163

MLIR bytecode does not preserve alias definitions, so each attribute encountered during deserialization is treated as a new one. This can generate duplicate DISubprogram nodes during deserialization.

The patch adds a StringMap cache that records attributes and fetches them when encountered again.

Copy link

Thank you for submitting a Pull Request (PR) to the LLVM Project!

This PR will be automatically labeled and the relevant teams will be notified.

If you wish to, you can add reviewers by using the "Reviewers" section on this page.

If this is not working for you, it is probably because you do not have write permissions for the repository. In which case you can instead tag reviewers by name in a comment by using @ followed by their GitHub username.

If you have received no comments on your PR for a week, you can request a review by "ping"ing the PR by adding a comment “Ping”. The common courtesy "ping" rate is once a week. Please remember that you are asking for valuable time from other developers.

If you have further questions, they may be answered by the LLVM GitHub User Guide.

You can also ask questions in a comment on this PR, on the LLVM Discord or on the forums.

@llvmbot llvmbot added mlir:core MLIR Core Infrastructure mlir labels Jul 30, 2025
@llvmbot
Copy link
Member

llvmbot commented Jul 30, 2025

@llvm/pr-subscribers-mlir-llvm

@llvm/pr-subscribers-mlir-core

Author: Hank (hankluo6)

Changes

Fixes #150163

MLIR bytecode does not preserve alias definitions, so each attribute encountered during deserialization is treated as a new one. This can generate duplicate DISubprogram nodes during deserialization.

The patch adds a StringMap cache that records attributes and fetches them when encountered again.


Full diff: https://github.com/llvm/llvm-project/pull/151267.diff

4 Files Affected:

  • (modified) mlir/include/mlir/AsmParser/AsmParser.h (+2-1)
  • (modified) mlir/lib/AsmParser/DialectSymbolParser.cpp (+23-2)
  • (modified) mlir/lib/AsmParser/ParserState.h (+3)
  • (modified) mlir/lib/Bytecode/Reader/BytecodeReader.cpp (+5-1)
diff --git a/mlir/include/mlir/AsmParser/AsmParser.h b/mlir/include/mlir/AsmParser/AsmParser.h
index 33daf7ca26f49..f39b3bd853a2a 100644
--- a/mlir/include/mlir/AsmParser/AsmParser.h
+++ b/mlir/include/mlir/AsmParser/AsmParser.h
@@ -53,7 +53,8 @@ parseAsmSourceFile(const llvm::SourceMgr &sourceMgr, Block *block,
 /// null terminated.
 Attribute parseAttribute(llvm::StringRef attrStr, MLIRContext *context,
                          Type type = {}, size_t *numRead = nullptr,
-                         bool isKnownNullTerminated = false);
+                         bool isKnownNullTerminated = false,
+                         llvm::StringMap<Attribute> *attributesCache = nullptr);
 
 /// This parses a single MLIR type to an MLIR context if it was valid. If not,
 /// an error diagnostic is emitted to the context.
diff --git a/mlir/lib/AsmParser/DialectSymbolParser.cpp b/mlir/lib/AsmParser/DialectSymbolParser.cpp
index 8b14e71118c3a..de8e3c1fc1e72 100644
--- a/mlir/lib/AsmParser/DialectSymbolParser.cpp
+++ b/mlir/lib/AsmParser/DialectSymbolParser.cpp
@@ -245,6 +245,14 @@ static Symbol parseExtendedSymbol(Parser &p, AsmParserState *asmState,
       return nullptr;
   }
 
+  if constexpr (std::is_same_v<Symbol, Attribute>) {
+    auto &cache = p.getState().symbols.attributesCache;
+
+    auto cacheIt = cache.find(symbolData);
+    if (cacheIt != cache.end()) {
+      return cacheIt->second;
+    }
+  }
   return createSymbol(dialectName, symbolData, loc);
 }
 
@@ -337,6 +345,7 @@ Type Parser::parseExtendedType() {
 template <typename T, typename ParserFn>
 static T parseSymbol(StringRef inputStr, MLIRContext *context,
                      size_t *numReadOut, bool isKnownNullTerminated,
+                     llvm::StringMap<Attribute> *attributesCache,
                      ParserFn &&parserFn) {
   // Set the buffer name to the string being parsed, so that it appears in error
   // diagnostics.
@@ -348,6 +357,9 @@ static T parseSymbol(StringRef inputStr, MLIRContext *context,
   SourceMgr sourceMgr;
   sourceMgr.AddNewSourceBuffer(std::move(memBuffer), SMLoc());
   SymbolState aliasState;
+  if (attributesCache)
+    aliasState.attributesCache = *attributesCache;
+
   ParserConfig config(context);
   ParserState state(sourceMgr, config, aliasState, /*asmState=*/nullptr,
                     /*codeCompleteContext=*/nullptr);
@@ -358,6 +370,13 @@ static T parseSymbol(StringRef inputStr, MLIRContext *context,
   if (!symbol)
     return T();
 
+  if constexpr (std::is_same_v<T, Attribute>) {
+    // Cache key is the symbol data without the dialect prefix.
+    StringRef cacheKey = inputStr.split('.').second;
+    if (attributesCache && !cacheKey.empty()) {
+      (*attributesCache)[cacheKey] = symbol;
+    }
+  }
   // Provide the number of bytes that were read.
   Token endTok = parser.getToken();
   size_t numRead =
@@ -374,13 +393,15 @@ static T parseSymbol(StringRef inputStr, MLIRContext *context,
 
 Attribute mlir::parseAttribute(StringRef attrStr, MLIRContext *context,
                                Type type, size_t *numRead,
-                               bool isKnownNullTerminated) {
+                               bool isKnownNullTerminated,
+                               llvm::StringMap<Attribute> *attributesCache) {
   return parseSymbol<Attribute>(
-      attrStr, context, numRead, isKnownNullTerminated,
+      attrStr, context, numRead, isKnownNullTerminated, attributesCache,
       [type](Parser &parser) { return parser.parseAttribute(type); });
 }
 Type mlir::parseType(StringRef typeStr, MLIRContext *context, size_t *numRead,
                      bool isKnownNullTerminated) {
   return parseSymbol<Type>(typeStr, context, numRead, isKnownNullTerminated,
+                           /*attributesCache=*/nullptr,
                            [](Parser &parser) { return parser.parseType(); });
 }
diff --git a/mlir/lib/AsmParser/ParserState.h b/mlir/lib/AsmParser/ParserState.h
index 159058a18fa4e..aa53032107cbf 100644
--- a/mlir/lib/AsmParser/ParserState.h
+++ b/mlir/lib/AsmParser/ParserState.h
@@ -40,6 +40,9 @@ struct SymbolState {
 
   /// A map from unique integer identifier to DistinctAttr.
   DenseMap<uint64_t, DistinctAttr> distinctAttributes;
+
+  /// A map from unique string identifier to Attribute.
+  llvm::StringMap<Attribute> attributesCache;
 };
 
 //===----------------------------------------------------------------------===//
diff --git a/mlir/lib/Bytecode/Reader/BytecodeReader.cpp b/mlir/lib/Bytecode/Reader/BytecodeReader.cpp
index 44458d010c6c8..0f97443433774 100644
--- a/mlir/lib/Bytecode/Reader/BytecodeReader.cpp
+++ b/mlir/lib/Bytecode/Reader/BytecodeReader.cpp
@@ -895,6 +895,10 @@ class AttrTypeReader {
   SmallVector<AttrEntry> attributes;
   SmallVector<TypeEntry> types;
 
+  /// The map of cached attributes, used to avoid re-parsing the same
+  /// attribute multiple times.
+  llvm::StringMap<Attribute> attributesCache;
+
   /// A location used for error emission.
   Location fileLoc;
 
@@ -1235,7 +1239,7 @@ LogicalResult AttrTypeReader::parseAsmEntry(T &result, EncodingReader &reader,
         ::parseType(asmStr, context, &numRead, /*isKnownNullTerminated=*/true);
   else
     result = ::parseAttribute(asmStr, context, Type(), &numRead,
-                              /*isKnownNullTerminated=*/true);
+                              /*isKnownNullTerminated=*/true, &attributesCache);
   if (!result)
     return failure();
 

@llvmbot
Copy link
Member

llvmbot commented Jul 30, 2025

@llvm/pr-subscribers-mlir

Author: Hank (hankluo6)

Changes

Fixes #150163

MLIR bytecode does not preserve alias definitions, so each attribute encountered during deserialization is treated as a new one. This can generate duplicate DISubprogram nodes during deserialization.

The patch adds a StringMap cache that records attributes and fetches them when encountered again.


Full diff: https://github.com/llvm/llvm-project/pull/151267.diff

4 Files Affected:

  • (modified) mlir/include/mlir/AsmParser/AsmParser.h (+2-1)
  • (modified) mlir/lib/AsmParser/DialectSymbolParser.cpp (+23-2)
  • (modified) mlir/lib/AsmParser/ParserState.h (+3)
  • (modified) mlir/lib/Bytecode/Reader/BytecodeReader.cpp (+5-1)
diff --git a/mlir/include/mlir/AsmParser/AsmParser.h b/mlir/include/mlir/AsmParser/AsmParser.h
index 33daf7ca26f49..f39b3bd853a2a 100644
--- a/mlir/include/mlir/AsmParser/AsmParser.h
+++ b/mlir/include/mlir/AsmParser/AsmParser.h
@@ -53,7 +53,8 @@ parseAsmSourceFile(const llvm::SourceMgr &sourceMgr, Block *block,
 /// null terminated.
 Attribute parseAttribute(llvm::StringRef attrStr, MLIRContext *context,
                          Type type = {}, size_t *numRead = nullptr,
-                         bool isKnownNullTerminated = false);
+                         bool isKnownNullTerminated = false,
+                         llvm::StringMap<Attribute> *attributesCache = nullptr);
 
 /// This parses a single MLIR type to an MLIR context if it was valid. If not,
 /// an error diagnostic is emitted to the context.
diff --git a/mlir/lib/AsmParser/DialectSymbolParser.cpp b/mlir/lib/AsmParser/DialectSymbolParser.cpp
index 8b14e71118c3a..de8e3c1fc1e72 100644
--- a/mlir/lib/AsmParser/DialectSymbolParser.cpp
+++ b/mlir/lib/AsmParser/DialectSymbolParser.cpp
@@ -245,6 +245,14 @@ static Symbol parseExtendedSymbol(Parser &p, AsmParserState *asmState,
       return nullptr;
   }
 
+  if constexpr (std::is_same_v<Symbol, Attribute>) {
+    auto &cache = p.getState().symbols.attributesCache;
+
+    auto cacheIt = cache.find(symbolData);
+    if (cacheIt != cache.end()) {
+      return cacheIt->second;
+    }
+  }
   return createSymbol(dialectName, symbolData, loc);
 }
 
@@ -337,6 +345,7 @@ Type Parser::parseExtendedType() {
 template <typename T, typename ParserFn>
 static T parseSymbol(StringRef inputStr, MLIRContext *context,
                      size_t *numReadOut, bool isKnownNullTerminated,
+                     llvm::StringMap<Attribute> *attributesCache,
                      ParserFn &&parserFn) {
   // Set the buffer name to the string being parsed, so that it appears in error
   // diagnostics.
@@ -348,6 +357,9 @@ static T parseSymbol(StringRef inputStr, MLIRContext *context,
   SourceMgr sourceMgr;
   sourceMgr.AddNewSourceBuffer(std::move(memBuffer), SMLoc());
   SymbolState aliasState;
+  if (attributesCache)
+    aliasState.attributesCache = *attributesCache;
+
   ParserConfig config(context);
   ParserState state(sourceMgr, config, aliasState, /*asmState=*/nullptr,
                     /*codeCompleteContext=*/nullptr);
@@ -358,6 +370,13 @@ static T parseSymbol(StringRef inputStr, MLIRContext *context,
   if (!symbol)
     return T();
 
+  if constexpr (std::is_same_v<T, Attribute>) {
+    // Cache key is the symbol data without the dialect prefix.
+    StringRef cacheKey = inputStr.split('.').second;
+    if (attributesCache && !cacheKey.empty()) {
+      (*attributesCache)[cacheKey] = symbol;
+    }
+  }
   // Provide the number of bytes that were read.
   Token endTok = parser.getToken();
   size_t numRead =
@@ -374,13 +393,15 @@ static T parseSymbol(StringRef inputStr, MLIRContext *context,
 
 Attribute mlir::parseAttribute(StringRef attrStr, MLIRContext *context,
                                Type type, size_t *numRead,
-                               bool isKnownNullTerminated) {
+                               bool isKnownNullTerminated,
+                               llvm::StringMap<Attribute> *attributesCache) {
   return parseSymbol<Attribute>(
-      attrStr, context, numRead, isKnownNullTerminated,
+      attrStr, context, numRead, isKnownNullTerminated, attributesCache,
       [type](Parser &parser) { return parser.parseAttribute(type); });
 }
 Type mlir::parseType(StringRef typeStr, MLIRContext *context, size_t *numRead,
                      bool isKnownNullTerminated) {
   return parseSymbol<Type>(typeStr, context, numRead, isKnownNullTerminated,
+                           /*attributesCache=*/nullptr,
                            [](Parser &parser) { return parser.parseType(); });
 }
diff --git a/mlir/lib/AsmParser/ParserState.h b/mlir/lib/AsmParser/ParserState.h
index 159058a18fa4e..aa53032107cbf 100644
--- a/mlir/lib/AsmParser/ParserState.h
+++ b/mlir/lib/AsmParser/ParserState.h
@@ -40,6 +40,9 @@ struct SymbolState {
 
   /// A map from unique integer identifier to DistinctAttr.
   DenseMap<uint64_t, DistinctAttr> distinctAttributes;
+
+  /// A map from unique string identifier to Attribute.
+  llvm::StringMap<Attribute> attributesCache;
 };
 
 //===----------------------------------------------------------------------===//
diff --git a/mlir/lib/Bytecode/Reader/BytecodeReader.cpp b/mlir/lib/Bytecode/Reader/BytecodeReader.cpp
index 44458d010c6c8..0f97443433774 100644
--- a/mlir/lib/Bytecode/Reader/BytecodeReader.cpp
+++ b/mlir/lib/Bytecode/Reader/BytecodeReader.cpp
@@ -895,6 +895,10 @@ class AttrTypeReader {
   SmallVector<AttrEntry> attributes;
   SmallVector<TypeEntry> types;
 
+  /// The map of cached attributes, used to avoid re-parsing the same
+  /// attribute multiple times.
+  llvm::StringMap<Attribute> attributesCache;
+
   /// A location used for error emission.
   Location fileLoc;
 
@@ -1235,7 +1239,7 @@ LogicalResult AttrTypeReader::parseAsmEntry(T &result, EncodingReader &reader,
         ::parseType(asmStr, context, &numRead, /*isKnownNullTerminated=*/true);
   else
     result = ::parseAttribute(asmStr, context, Type(), &numRead,
-                              /*isKnownNullTerminated=*/true);
+                              /*isKnownNullTerminated=*/true, &attributesCache);
   if (!result)
     return failure();
 

@hankluo6 hankluo6 changed the title Fix duplicated attribute nodes in MLIR bytecode deserialization [MLIR] Fix duplicated attribute nodes in MLIR bytecode deserialization Jul 30, 2025
@joker-eph
Copy link
Collaborator

Thanks for the fit!
Can you add a test please?

@hankluo6
Copy link
Author

Hi @joker-eph, thanks for reviewing! I've added a test.

#di_subprogram1 = #llvm.di_subprogram<recId = distinct[0]<>, id = distinct[2]<>, compileUnit = #di_compile_unit, scope = #di_file1, name = "main", file = #di_file1, line = 1, scopeLine = 1, subprogramFlags = "Definition|Optimized", type = #di_subroutine_type, retainedNodes = #di_local_variable>
#di_local_variable1 = #llvm.di_local_variable<scope = #di_subprogram1, name = "a", file = #di_file1, line = 2, type = #di_basic_type>

module attributes {dlti.dl_spec = #dlti.dl_spec<i64 = dense<64> : vector<2xi64>, !llvm.ptr = dense<64> : vector<4xi64>>, llvm.ident = "MLIR", llvm.target_triple = "x86_64-unknown-linux-gnu"} {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we check this as a round-tripping to MLIR targeted kind of test (with minimal attribute to show the discrepancy) instead of involving a translation to LLVM IR and relying on the specific of DI.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is a minimal test I believe:

// RUN: mlir-opt -emit-bytecode %s | mlir-opt --mlir-print-debuginfo | FileCheck %s

// CHECK: llvm.di_subprogram
// CHECK-NOT: llvm.di_subprogram
 #di_file = #llvm.di_file<"foo.c" in "/mlir/">
 #di_subprogram = #llvm.di_subprogram<recId = distinct[0]<>, isRecSelf = true>
 #di_basic_type = #llvm.di_basic_type<tag = DW_TAG_base_type, name = "int", sizeInBits = 32, encoding = DW_ATE_signed>
 #di_local_variable = #llvm.di_local_variable<scope = #di_subprogram, name = "a", file = #di_file, line = 2, type = #di_basic_type>

 module attributes {test.alias = #di_local_variable} {
 }loc(fused<#di_subprogram>[])

However I'm not sure I understand exactly what triggers the issue, is this specific to the implementation of the LLVM attributes? Can we reproduce this with one of the test dialect attributes and simplify this further?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we check this as a round-tripping to MLIR targeted kind of test (with minimal attribute to show the discrepancy) instead of involving a translation to LLVM IR and relying on the specific of DI.

The issue is only reproducible during mlir-translate from MLIR bytecode to LLVM IR, because that’s the only place where the LLVM IR verifier checks that two DISubprogram attributes refer to the same object:

// The scopes for variables and !dbg attachments must agree.
DISubprogram *VarSP = getSubprogram(Var->getRawScope());
DISubprogram *LocSP = getSubprogram(Loc->getRawScope());
if (!VarSP || !LocSP)
return; // Broken scope chains are checked elsewhere.
CheckDI(VarSP == LocSP,
"mismatched subprogram between #dbg record variable and DILocation",
&DVR, BB, F, Var, Var->getScope()->getSubprogram(), Loc,
Loc->getScope()->getSubprogram(), BB, F);

mlir-opt will create multiple identical attribute objects with the same content, and since it doesn't check whether they're the exact same object, no error is triggered.

However I'm not sure I understand exactly what triggers the issue, is this specific to the implementation of the LLVM attributes? Can we reproduce this with one of the test dialect attributes and simplify this further?

This is a potential issue for attributes in general: when parsing MLIR bytecode, the parser creates separate attribute instances even if they are logically equal. That isn't a problem usually if we care only about the content, but when we expect the objects should be the same as in the above case, it can lead to issues. We can't reproduce it with other dialects using mlir-translate or mlir-opt since there is no such checking.

Since the problem only happens in mlir-translate when translating to LLVM IR, I’m not sure what the test should be. Should I put it under mlir/test/mlir-translate instead? Thanks!

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The issue is only reproducible during mlir-translate from MLIR bytecode to LLVM IR

I'm slightly confused by your claim here, because I provided above a test that fails before your patch, and passes afterward, using only mlir-opt

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You’re right—I misunderstood earlier. I thought we needed to trigger an assertion in the test.

We can test this with an mlir-opt round trip. I'll update the minimal test.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah it seems from the minimal test that an alias that is referenced from two places is necessary. I am not aware of attributes outside of LLVM dialect that has a distinct attribute and that prints as alias. There would be others in LLVM dialect such as the alias_scope attribute that could be used.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The issue only occurs when the distinct[…] is used inside a nested attribute (e.g., a di_subprogram that is itself nested in a di_local_variable). AttrTypeReader::attributes caches parsed attributes, but that cache is available only in the top-level attribute. Nested attributes are parsed directly from another attribute, which has no access to that cache, so it recreates the distinct record and a second node is produced.

To reproduce, I think we need an attribute that contains distinct attribute and can itself be another attribute's parameter.
I can further reduce the test to only include the test dialect. Something like this:

#attr_ugly = #test<attr_ugly begin distinct[0]<> end>
#attr_ugly1 = #test<attr_ugly begin #attr_ugly end>

module attributes {test.alias = #attr_ugly} {
} loc(fused<#attr_ugly1>[])

Would this be better?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok that makes sense to me. I think the test attribute you suggested would be much nicer than using the self-recursive subprogram attribute (since the self-recursive stuff has its own complexities; producing something that is actually correct is not completely trivial there).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would be nice! Thanks.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, updated.

Copy link

⚠️ C/C++ code formatter, clang-format found issues in your code. ⚠️

You can test this locally with the following command:
git-clang-format --diff HEAD~1 HEAD --extensions cpp,h -- mlir/include/mlir/AsmParser/AsmParser.h mlir/lib/AsmParser/DialectSymbolParser.cpp mlir/lib/AsmParser/ParserState.h mlir/lib/Bytecode/Reader/BytecodeReader.cpp
View the diff from clang-format here.
diff --git a/mlir/lib/AsmParser/DialectSymbolParser.cpp b/mlir/lib/AsmParser/DialectSymbolParser.cpp
index bb31b018b..416d8eb5f 100644
--- a/mlir/lib/AsmParser/DialectSymbolParser.cpp
+++ b/mlir/lib/AsmParser/DialectSymbolParser.cpp
@@ -251,7 +251,7 @@ static Symbol parseExtendedSymbol(Parser &p, AsmParserState *asmState,
     // Skip cached attribute if it has type.
     if (cacheIt != cache.end() && !p.getToken().is(Token::colon))
       return cacheIt->second;
-    
+
     return cache[symbolData] = createSymbol(dialectName, symbolData, loc);
   }
   return createSymbol(dialectName, symbolData, loc);

@@ -0,0 +1,9 @@
// RUN: mlir-opt -emit-bytecode %s | mlir-opt --mlir-print-debuginfo | FileCheck %s

// CHECK: distinct[0]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// CHECK: distinct[0]
// Verify that the distinct attribute which is used transitively
// through two aliases does not end up duplicated when round-tripped
// through bytecode.
// CHECK: distinct[0]

Copy link
Collaborator

@joker-eph joker-eph left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LG, thanks!

@hankluo6
Copy link
Author

Thank you, @joker-eph. Could you please help merge this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
mlir:core MLIR Core Infrastructure mlir:llvm mlir
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[MLIR][LLVM] Recursive debug type not imported properly if input is bytecode
4 participants