Skip to content

[clangd] [C++20] [Modules] Add scanning cache #125988

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Feb 26, 2025

Conversation

ChuanqiXu9
Copy link
Member

Previously, everytime we want to get a source file declaring a specific module, we need to scan the whole projects again and again. The performance is super bad. This patch tries to improve this by introducing a simple cache.

@ChuanqiXu9 ChuanqiXu9 added clangd clang:modules C++20 modules and Clang Header Modules labels Feb 6, 2025
@ChuanqiXu9 ChuanqiXu9 requested a review from kadircet February 6, 2025 03:24
@llvmbot
Copy link
Member

llvmbot commented Feb 6, 2025

@llvm/pr-subscribers-clang-tools-extra
@llvm/pr-subscribers-clang-modules

@llvm/pr-subscribers-clangd

Author: Chuanqi Xu (ChuanqiXu9)

Changes

Previously, everytime we want to get a source file declaring a specific module, we need to scan the whole projects again and again. The performance is super bad. This patch tries to improve this by introducing a simple cache.


Full diff: https://github.com/llvm/llvm-project/pull/125988.diff

6 Files Affected:

  • (modified) clang-tools-extra/clangd/CMakeLists.txt (+1)
  • (modified) clang-tools-extra/clangd/ModulesBuilder.cpp (+10-5)
  • (modified) clang-tools-extra/clangd/ProjectModules.h (+2-1)
  • (added) clang-tools-extra/clangd/ProjectModulesCache.cpp (+56)
  • (added) clang-tools-extra/clangd/ProjectModulesCache.h (+46)
  • (modified) clang-tools-extra/clangd/ScanningProjectModules.cpp (+45-10)
diff --git a/clang-tools-extra/clangd/CMakeLists.txt b/clang-tools-extra/clangd/CMakeLists.txt
index 6f10afe4a5625f5..a0ff8483991bb09 100644
--- a/clang-tools-extra/clangd/CMakeLists.txt
+++ b/clang-tools-extra/clangd/CMakeLists.txt
@@ -99,6 +99,7 @@ add_clang_library(clangDaemon STATIC
   ModulesBuilder.cpp
   PathMapping.cpp
   Protocol.cpp
+  ProjectModulesCache.cpp
   Quality.cpp
   ParsedAST.cpp
   Preamble.cpp
diff --git a/clang-tools-extra/clangd/ModulesBuilder.cpp b/clang-tools-extra/clangd/ModulesBuilder.cpp
index bee31fe51555e02..9e8e66100aeca94 100644
--- a/clang-tools-extra/clangd/ModulesBuilder.cpp
+++ b/clang-tools-extra/clangd/ModulesBuilder.cpp
@@ -361,7 +361,8 @@ void ModuleFileCache::remove(StringRef ModuleName) {
 /// ModuleName in topological order. The \param ModuleName is guaranteed to
 /// be the last element in \param ModuleNames.
 llvm::SmallVector<StringRef> getAllRequiredModules(ProjectModules &MDB,
-                                                   StringRef ModuleName) {
+                                                   StringRef ModuleName,
+                                                   ProjectModulesCache &Cache) {
   llvm::SmallVector<llvm::StringRef> ModuleNames;
   llvm::StringSet<> ModuleNamesSet;
 
@@ -369,7 +370,7 @@ llvm::SmallVector<StringRef> getAllRequiredModules(ProjectModules &MDB,
     ModuleNamesSet.insert(ModuleName);
 
     for (StringRef RequiredModuleName :
-         MDB.getRequiredModules(MDB.getSourceForModuleName(ModuleName)))
+         MDB.getRequiredModules(MDB.getSourceForModuleName(Cache, ModuleName)))
       if (ModuleNamesSet.insert(RequiredModuleName).second)
         Visitor(RequiredModuleName, Visitor);
 
@@ -384,7 +385,9 @@ llvm::SmallVector<StringRef> getAllRequiredModules(ProjectModules &MDB,
 
 class ModulesBuilder::ModulesBuilderImpl {
 public:
-  ModulesBuilderImpl(const GlobalCompilationDatabase &CDB) : Cache(CDB) {}
+  ModulesBuilderImpl(const GlobalCompilationDatabase &CDB) : Cache(CDB) {
+    MDBCache = createProjectModulesCache();
+  }
 
   const GlobalCompilationDatabase &getCDB() const { return Cache.getCDB(); }
 
@@ -395,6 +398,7 @@ class ModulesBuilder::ModulesBuilderImpl {
 
 private:
   ModuleFileCache Cache;
+  std::unique_ptr<ProjectModulesCache> MDBCache;
 };
 
 llvm::Error ModulesBuilder::ModulesBuilderImpl::getOrBuildModuleFile(
@@ -403,7 +407,8 @@ llvm::Error ModulesBuilder::ModulesBuilderImpl::getOrBuildModuleFile(
   if (BuiltModuleFiles.isModuleUnitBuilt(ModuleName))
     return llvm::Error::success();
 
-  PathRef ModuleUnitFileName = MDB.getSourceForModuleName(ModuleName);
+  PathRef ModuleUnitFileName =
+      MDB.getSourceForModuleName(*MDBCache, ModuleName);
   /// It is possible that we're meeting third party modules (modules whose
   /// source are not in the project. e.g, the std module may be a third-party
   /// module for most project) or something wrong with the implementation of
@@ -416,7 +421,7 @@ llvm::Error ModulesBuilder::ModulesBuilderImpl::getOrBuildModuleFile(
         llvm::formatv("Don't get the module unit for module {0}", ModuleName));
 
   // Get Required modules in topological order.
-  auto ReqModuleNames = getAllRequiredModules(MDB, ModuleName);
+  auto ReqModuleNames = getAllRequiredModules(MDB, ModuleName, *MDBCache);
   for (llvm::StringRef ReqModuleName : ReqModuleNames) {
     if (BuiltModuleFiles.isModuleUnitBuilt(ModuleName))
       continue;
diff --git a/clang-tools-extra/clangd/ProjectModules.h b/clang-tools-extra/clangd/ProjectModules.h
index 48d52ac9deb893f..cdf54067cf0b9a7 100644
--- a/clang-tools-extra/clangd/ProjectModules.h
+++ b/clang-tools-extra/clangd/ProjectModules.h
@@ -9,6 +9,7 @@
 #ifndef LLVM_CLANG_TOOLS_EXTRA_CLANGD_PROJECTMODULES_H
 #define LLVM_CLANG_TOOLS_EXTRA_CLANGD_PROJECTMODULES_H
 
+#include "ProjectModulesCache.h"
 #include "support/Function.h"
 #include "support/Path.h"
 #include "support/ThreadsafeFS.h"
@@ -43,7 +44,7 @@ class ProjectModules {
 
   virtual std::vector<std::string> getRequiredModules(PathRef File) = 0;
   virtual PathRef
-  getSourceForModuleName(llvm::StringRef ModuleName,
+  getSourceForModuleName(ProjectModulesCache &Cache, llvm::StringRef ModuleName,
                          PathRef RequiredSrcFile = PathRef()) = 0;
 
   virtual void setCommandMangler(CommandMangler Mangler) {}
diff --git a/clang-tools-extra/clangd/ProjectModulesCache.cpp b/clang-tools-extra/clangd/ProjectModulesCache.cpp
new file mode 100644
index 000000000000000..9eb0723a7a13ede
--- /dev/null
+++ b/clang-tools-extra/clangd/ProjectModulesCache.cpp
@@ -0,0 +1,56 @@
+//===------------------ ProjectModulesCache.cpp ------------------*- C++-*-===//
+//
+// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
+// See https://llvm.org/LICENSE.txt for license information.
+// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
+//
+//===----------------------------------------------------------------------===//
+
+#include "ProjectModulesCache.h"
+#include "llvm/ADT/StringMap.h"
+#include <mutex>
+
+namespace clang::clangd {
+namespace {
+class SharedProjectModulesCache : public ProjectModulesCache {
+public:
+  std::optional<std::string>
+  getSourceForModuleName(llvm::StringRef ModuleName,
+                         PathRef RequiredSrcFile = PathRef()) override {
+    std::lock_guard<std::mutex> Lock(Mutex);
+
+    auto Iter = ModuleNameToSource.find(ModuleName);
+    if (Iter == ModuleNameToSource.end())
+      return std::nullopt;
+
+    return Iter->second;
+  }
+
+  void clearEntry(llvm::StringRef ModuleName,
+                  PathRef RequiredSrcFile = PathRef()) override {
+    std::lock_guard<std::mutex> Lock(Mutex);
+
+    auto Iter = ModuleNameToSource.find(ModuleName);
+    if (Iter == ModuleNameToSource.end())
+      return;
+
+    ModuleNameToSource.erase(Iter);
+  }
+
+  void setEntry(PathRef FilePath, llvm::StringRef ModuleName) override {
+    std::lock_guard<std::mutex> Lock(Mutex);
+
+    ModuleNameToSource[ModuleName] = FilePath;
+  }
+
+private:
+  std::mutex Mutex;
+  llvm::StringMap<std::string> ModuleNameToSource;
+};
+} // namespace
+
+std::unique_ptr<ProjectModulesCache> createProjectModulesCache() {
+  return std::make_unique<SharedProjectModulesCache>();
+}
+
+} // namespace clang::clangd
diff --git a/clang-tools-extra/clangd/ProjectModulesCache.h b/clang-tools-extra/clangd/ProjectModulesCache.h
new file mode 100644
index 000000000000000..8aee77119a80cbe
--- /dev/null
+++ b/clang-tools-extra/clangd/ProjectModulesCache.h
@@ -0,0 +1,46 @@
+//===------------------ ProjectModulesCache.h --------------------*- C++-*-===//
+//
+// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
+// See https://llvm.org/LICENSE.txt for license information.
+// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
+//
+//===----------------------------------------------------------------------===//
+
+#ifndef LLVM_CLANG_TOOLS_EXTRA_CLANGD_PROJECTMODULESCACHE_H
+#define LLVM_CLANG_TOOLS_EXTRA_CLANGD_PROJECTMODULESCACHE_H
+
+#include "support/Path.h"
+#include <memory>
+
+namespace clang {
+namespace clangd {
+
+/// A cache for the module name to file name map in the project.
+/// The rationale is:
+///   (1) It is fast to get the module name to a file.
+///   (2) It may be slow to get a file with a specified module name.
+///   (3) The module name of files may not change drastically and frequently.
+///
+/// The cache itself is not responsible for the validness of cached result.
+/// Users of the cache should check it after getting the result and updating
+/// the cache if the result is invalid.
+class ProjectModulesCache {
+public:
+  virtual ~ProjectModulesCache() = default;
+
+  virtual std::optional<std::string>
+  getSourceForModuleName(llvm::StringRef ModuleName,
+                         PathRef RequiredSrcFile = PathRef()) = 0;
+
+  virtual void clearEntry(llvm::StringRef ModuleName,
+                          PathRef RequiredSrcFile = PathRef()) = 0;
+
+  virtual void setEntry(PathRef FilePath, llvm::StringRef ModuleName) = 0;
+};
+
+std::unique_ptr<ProjectModulesCache> createProjectModulesCache();
+
+} // namespace clangd
+} // namespace clang
+
+#endif
diff --git a/clang-tools-extra/clangd/ScanningProjectModules.cpp b/clang-tools-extra/clangd/ScanningProjectModules.cpp
index e4dc11c1c28958b..3b0cd96e01c7fff 100644
--- a/clang-tools-extra/clangd/ScanningProjectModules.cpp
+++ b/clang-tools-extra/clangd/ScanningProjectModules.cpp
@@ -49,7 +49,8 @@ class ModuleDependencyScanner {
 
   /// Scanning the single file specified by \param FilePath.
   std::optional<ModuleDependencyInfo>
-  scan(PathRef FilePath, const ProjectModules::CommandMangler &Mangler);
+  scan(PathRef FilePath, const ProjectModules::CommandMangler &Mangler,
+       ProjectModulesCache *Cache);
 
   /// Scanning every source file in the current project to get the
   /// <module-name> to <module-unit-source> map.
@@ -58,7 +59,8 @@ class ModuleDependencyScanner {
   /// a global module dependency scanner to monitor every file. Or we
   /// can simply require the build systems (or even the end users)
   /// to provide the map.
-  void globalScan(const ProjectModules::CommandMangler &Mangler);
+  void globalScan(const ProjectModules::CommandMangler &Mangler,
+                  ProjectModulesCache &Cache);
 
   /// Get the source file from the module name. Note that the language
   /// guarantees all the module names are unique in a valid program.
@@ -68,6 +70,12 @@ class ModuleDependencyScanner {
   /// declaring the same module.
   PathRef getSourceForModuleName(llvm::StringRef ModuleName) const;
 
+  /// Validate if source file \c Source declare a module with \c ModuleName.
+  /// If yes, return the source file path. Otherwise, return std::nullopt.
+  std::optional<PathRef>
+  validateSourceForModuleName(PathRef Source, llvm::StringRef ModuleName,
+                              const ProjectModules::CommandMangler &Mangler);
+
   /// Return the direct required modules. Indirect required modules are not
   /// included.
   std::vector<std::string>
@@ -83,15 +91,14 @@ class ModuleDependencyScanner {
 
   clang::tooling::dependencies::DependencyScanningService Service;
 
-  // TODO: Add a scanning cache.
-
   // Map module name to source file path.
   llvm::StringMap<std::string> ModuleNameToSource;
 };
 
 std::optional<ModuleDependencyScanner::ModuleDependencyInfo>
 ModuleDependencyScanner::scan(PathRef FilePath,
-                              const ProjectModules::CommandMangler &Mangler) {
+                              const ProjectModules::CommandMangler &Mangler,
+                              ProjectModulesCache *Cache) {
   auto Candidates = CDB->getCompileCommands(FilePath);
   if (Candidates.empty())
     return std::nullopt;
@@ -124,6 +131,9 @@ ModuleDependencyScanner::scan(PathRef FilePath,
   if (ScanningResult->Provides) {
     ModuleNameToSource[ScanningResult->Provides->ModuleName] = FilePath;
     Result.ModuleName = ScanningResult->Provides->ModuleName;
+
+    if (Cache)
+      Cache->setEntry(FilePath, ScanningResult->Provides->ModuleName);
   }
 
   for (auto &Required : ScanningResult->Requires)
@@ -133,9 +143,9 @@ ModuleDependencyScanner::scan(PathRef FilePath,
 }
 
 void ModuleDependencyScanner::globalScan(
-    const ProjectModules::CommandMangler &Mangler) {
+    const ProjectModules::CommandMangler &Mangler, ProjectModulesCache &Cache) {
   for (auto &File : CDB->getAllFiles())
-    scan(File, Mangler);
+    scan(File, Mangler, &Cache);
 
   GlobalScanned = true;
 }
@@ -155,12 +165,28 @@ PathRef ModuleDependencyScanner::getSourceForModuleName(
 
 std::vector<std::string> ModuleDependencyScanner::getRequiredModules(
     PathRef File, const ProjectModules::CommandMangler &Mangler) {
-  auto ScanningResult = scan(File, Mangler);
+  auto ScanningResult = scan(File, Mangler, /*Cache=*/nullptr);
   if (!ScanningResult)
     return {};
 
   return ScanningResult->RequiredModules;
 }
+
+std::optional<PathRef> ModuleDependencyScanner::validateSourceForModuleName(
+    PathRef Source, llvm::StringRef ModuleName,
+    const ProjectModules::CommandMangler &Mangler) {
+  auto ScanningResult = scan(Source, Mangler, /*Cache=*/nullptr);
+  if (!ScanningResult)
+    return std::nullopt;
+
+  // If the name matches, return the source file path from ModuleNameToSource
+  // cache instead of the input. Since the lifetime of the input may not be long
+  // enough.
+  if (ScanningResult->ModuleName == ModuleName)
+    return ModuleNameToSource[ModuleName];
+
+  return std::nullopt;
+}
 } // namespace
 
 /// TODO: The existing `ScanningAllProjectModules` is not efficient. See the
@@ -190,9 +216,18 @@ class ScanningAllProjectModules : public ProjectModules {
   /// RequiredSourceFile is not used intentionally. See the comments of
   /// ModuleDependencyScanner for detail.
   PathRef
-  getSourceForModuleName(llvm::StringRef ModuleName,
+  getSourceForModuleName(ProjectModulesCache &Cache, llvm::StringRef ModuleName,
                          PathRef RequiredSourceFile = PathRef()) override {
-    Scanner.globalScan(Mangler);
+    if (auto Source =
+            Cache.getSourceForModuleName(ModuleName, RequiredSourceFile)) {
+      if (std::optional<PathRef> Path =
+              Scanner.validateSourceForModuleName(*Source, ModuleName, Mangler))
+        return *Path;
+
+      Cache.clearEntry(ModuleName, RequiredSourceFile);
+    }
+
+    Scanner.globalScan(Mangler, Cache);
     return Scanner.getSourceForModuleName(ModuleName);
   }
 

Copy link
Member

@kadircet kadircet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

instead of introducing a new concept of ProjectModulesCache and requiring each ProjectModules implementation to know about it and work with it; what about introducing a:

class CachingProjectModules : public ProjectModules {
public:
  CachingProjectModules(GlobalCompilationDatabase &CDB);
  // Implements rest of the operations with a cache overlay.
  ...;
  // Whenever cache is invalid/empty, queries `CDB` for underlying `ProjectModules`
  // and updates/builds a cache using it.
};

into ModulesBuilder.cpp and have an instance of this in ModulesBuilderImpl? That way we can centralize how the caching will work and won't need to maintain multiple implementations that play with it.

As for cache validation, i think it's fine to invalidate based on file contents and whenever there's a mismatch we can query CDB again.

WDYT?

@ChuanqiXu9
Copy link
Member Author

instead of introducing a new concept of ProjectModulesCache and requiring each ProjectModules implementation to know about it and work with it; what about introducing a:

class CachingProjectModules : public ProjectModules {
public:
  CachingProjectModules(GlobalCompilationDatabase &CDB);
  // Implements rest of the operations with a cache overlay.
  ...;
  // Whenever cache is invalid/empty, queries `CDB` for underlying `ProjectModules`
  // and updates/builds a cache using it.
};

into ModulesBuilder.cpp and have an instance of this in ModulesBuilderImpl? That way we can centralize how the caching will work and won't need to maintain multiple implementations that play with it.

As for cache validation, i think it's fine to invalidate based on file contents and whenever there's a mismatch we can query CDB again.

WDYT?

It sounds good but it has a small problem that:

  • Now the ScanningProjectModules are owned by different calls to ModulesBuilder::buildPrerequisiteModulesFor. So we don't need to care about thread safety in ScanningProjectModules. And if we make it the underlying ProjectModules, we need to care about the thead safety and then we would have a centralized scanner. It is not bad but it might be a bigger change.

@kadircet
Copy link
Member

  • Now the ScanningProjectModules are owned by different calls to ModulesBuilder::buildPrerequisiteModulesFor. So we don't need to care about thread safety in ScanningProjectModules. And if we make it the underlying ProjectModules, we need to care about the thead safety and then we would have a centralized scanner. It is not bad but it might be a bigger change.

Not sure I understand the argument here.
Please note that I am not suggesting to store a ScanningProjectModules (or any other ProjectModules) inside CachingProjectModules. We can't even if we wanted to, in theory any source file can belong to a different ProjectModules.
Hence we store the GlobalCompilationDatabase, so that we can access a relevant ProjectModules when the cache is invalid or we don't have an entry. Hence we're not sharing any ScanningProjectModules instances across threads.

As for thread-safety concerns in CachingProjectModules itself, I think we have the exact same requirements as this patch (https://github.com/llvm/llvm-project/pull/125988/files#diff-bfbef59d036ce319b5ed9b774f6ddb8dcd7efdba3384de03cb9a37f23ef678ba), basically underlying cache can be accessed concurrently, but rest should be outside the critical path.

Does that make sense? If not could you elaborate on your concern?

@ChuanqiXu9
Copy link
Member Author

  • Now the ScanningProjectModules are owned by different calls to ModulesBuilder::buildPrerequisiteModulesFor. So we don't need to care about thread safety in ScanningProjectModules. And if we make it the underlying ProjectModules, we need to care about the thead safety and then we would have a centralized scanner. It is not bad but it might be a bigger change.

Not sure I understand the argument here. Please note that I am not suggesting to store a ScanningProjectModules (or any other ProjectModules) inside CachingProjectModules. We can't even if we wanted to, in theory any source file can belong to a different ProjectModules. Hence we store the GlobalCompilationDatabase, so that we can access a relevant ProjectModules when the cache is invalid or we don't have an entry. Hence we're not sharing any ScanningProjectModules instances across threads.

As for thread-safety concerns in CachingProjectModules itself, I think we have the exact same requirements as this patch (https://github.com/llvm/llvm-project/pull/125988/files#diff-bfbef59d036ce319b5ed9b774f6ddb8dcd7efdba3384de03cb9a37f23ef678ba), basically underlying cache can be accessed concurrently, but rest should be outside the critical path.

Does that make sense? If not could you elaborate on your concern?

Yeah, I though you were saying to store the underlying ProjectModules. Now it is fine.

Copy link
Member

@kadircet kadircet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks, I think the biggest concern is about initial performance of this approach until cache warms up.

also can we have some tests? it should be possible to unittest by injecting a custom CDB into modules builder.

@ChuanqiXu9
Copy link
Member Author

Update: I asked the Scanner to not run again after global scanned. I felt this was an simple over sight. It doesn't make sense to ask a scanner to scan globally multiple times.

Copy link
Member

@kadircet kadircet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry for the delayed turn around, lgtm!

@ChuanqiXu9 ChuanqiXu9 merged commit ae839b0 into llvm:main Feb 26, 2025
8 checks passed
@ChuanqiXu9 ChuanqiXu9 added this to the LLVM 20.X Release milestone Feb 26, 2025
@ChuanqiXu9
Copy link
Member Author

I hope to backport this to 20.x. It shouldn't be riskful given the modules support in clangd is controlled by an "experiment" option.

@ChuanqiXu9
Copy link
Member Author

/cherry-pick ae839b0

@llvmbot
Copy link
Member

llvmbot commented Feb 26, 2025

/pull-request #128841

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
clang:modules C++20 modules and Clang Header Modules clang-tools-extra clangd
Projects
Development

Successfully merging this pull request may close these issues.

3 participants