Skip to content

[C++20][Modules] The search complexity for header file info is too expensive with a lot of modules. #140860

Open
@mpark

Description

@mpark

Currently, for each header file #include or import, there is a look-up of this header file on each of the loaded PCMs. This results in a search complexity of O(N*M) where N is the # of includes, and M is the # of loaded PCMs. For large numbers, e.g. N = 10,000 and M = 1,000, this becomes very costly.

The synthetic test case I put together takes 14s user time, compared to 0.06s without modules.
In a real world scenario, the difference for us was 18s with modules and 3s without modules.

This problem was encountered previously inside ASTWriter (see #87848). It was solved there by only considering the local information rather than making the query out to ASTReader. I'm not sure that such a solution can be used in general. (CC @jansvoboda11)

Test Case

I wrote a bash script to generate the test case. It produces 1,000 empty header units, and 10,000 empty header files. The main file imports the 1,000 header units, then #includes 10,000 header files. We can measure the preprocessing time on this file. On my machine, this takes 14s, compared to 0.06s without modules.

The script for the synthetic repro:

#/usr/bin/env bash

CLANG=${CLANG:-clang++}
NUM_MODULES=${NUM_MODULES:-1000}
NUM_HEADERS=${NUM_HEADERS:-10000}

mkdir build && cd build

for i in $(seq 1 $NUM_MODULES); do > m${i}.h; done
seq 1 $NUM_MODULES | xargs -I {} -P $(nproc) bash -c "$CLANG -std=c++20 -fmodule-header m{}.h"

for i in $(seq 1 $NUM_HEADERS); do > h${i}.h; done

> a.cpp
for i in $(seq 1 $NUM_MODULES); do echo "import \"m${i}.h\";" >> a.cpp; done
for i in $(seq 1 $NUM_HEADERS); do echo "#include \"h${i}.h\"" >> a.cpp; done
echo "int main() {}" >> a.cpp

time $CLANG -std=c++20 -Wno-experimental-header-units -E a.cpp -o /dev/null \
    $(for i in $(seq 1 $NUM_MODULES); do echo -n "-fmodule-file=m${i}.pcm "; done)

Note: I am aware that -fmodule-file=<pcm> eagerly loads modules and that it's not encouraged. In reality we actually use the -fmodule-file=<name>=<pcm> form with header units.

Code Path

We more or less start here:

  // Do a standard file entry lookup.
  OptionalFileEntryRef FE = HeaderInfo.LookupFile(...);

and proceed through the following:

Here we encounter this:

ModuleMap::KnownHeader
HeaderSearch::findModuleForHeader(FileEntryRef File, bool AllowTextual,
                                  bool AllowExcluded) const {
  if (ExternalSource) {
    // Make sure the external source has handled header info about this file,
    // which includes whether the file is part of a module.
    (void)getExistingFileInfo(File);
  }
  return ModMap.findModuleForHeader(File, AllowTextual, AllowExcluded);
}

where there is an invocation to (void)getExistingFileInfo(File);, which calls ExternalSource->GetHeaderFileInfo(FE);.

We finally arrive at:

HeaderFileInfo ASTReader::GetHeaderFileInfo(FileEntryRef FE) {
  HeaderFileInfoVisitor Visitor(FE);
  ModuleMgr.visit(Visitor);
  if (std::optional<HeaderFileInfo> HFI = Visitor.getHeaderFileInfo())
      return *HFI;

  return HeaderFileInfo();
}

where ModuleMgr.visit(Visitor); will visit every currently loaded PCM.


Every PCM file contains a HEADER_SEARCH_TABLE record, which is an on-disk hash table. For header units, it has at least one entry in that table which is the header file itself. For named modules, we shouldn't need this, but it's still present today (see #78495). This means that every header unit has a HEADER_SEARCH_TABLE with at least one entry in it.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions