Skip to content

Conversation

@mcourteaux
Copy link
Contributor

@mcourteaux mcourteaux commented Aug 14, 2025

Initial changes to address #8717.

This does not address the Module deep-copy behavior copying Buffer contents. This merely addresses the string processing and avoids some copies.

Fixes #8752


Drive-by optimization: Change the signature of replace_all() to take the subject-string to be passed by value. This now allows for making replace-chains in this pattern:

std::string s = ...;
s = replace_all(std::move(s), ..., ...);
s = replace_all(std::move(s), ..., ...);
s = replace_all(std::move(s), ..., ...);
s = replace_all(std::move(s), ..., ...);

Without making copies in case the target string is not found in the subject string.

@mcourteaux mcourteaux added performance code_cleanup No functional changes. Reformatting, reorganizing, or refactoring existing code. labels Aug 14, 2025
@mcourteaux mcourteaux marked this pull request as ready for review August 14, 2025 13:58
asm_stream << line << "\n";
if (line.length() > 500) {
// Very long lines in the assembly are typically the _gpu_kernel_sources
// or other buffers (such as model weights) as a raw ASCII block in the
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would say something more realistic like "static LUTs" here... model weights are so large, they should be ImageParams.

Copy link
Contributor

@rtzam rtzam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just tested locally on the big LLM test case. The speedup is from "practically non-terminating" to ~40 seconds for a 1 layer model. Huge improvements! Thanks for the change.

}
}

start = end + 1;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fact that start can ever point past the end of the string_view made me uneasy. I can't wait for std::views::split to be available (c++23)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's my take on a C++23 version, for posterity.

void generate(std::string_view code, std::map<uint64_t, std::regex>& markers,
              std::map<uint64_t, int>& lnos) {
    static constexpr std::string_view marker_prefix = "%\"";
    std::size_t lno = 1;

    for (auto&& chunk : std::views::split(code, '\n')) {
        std::string_view line(chunk.begin(), chunk.end());
        if (line.contains(marker_prefix)) {
            std::erase_if(markers, [&](const auto& kv) {
                const auto& [node, regex] = kv;
                if (std::regex_search(line.begin(), line.end(), regex)) {
                    lnos[node] = lno;
                    return true;
                }
                return false;
            });
        }
        lno++;
    }
}

https://gcc.godbolt.org/z/W9se9Gn61

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well the loop condition should put you at ease. But indeed, a well-tested standard function would be nice.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well the loop condition should put you at ease.

It does 🙂

@mcourteaux
Copy link
Contributor Author

Just tested locally on the big LLM test case. The speedup is from "practically non-terminating" to ~40 seconds for a 1 layer model. Huge improvements! Thanks for the change.

Do you know where the remaining 40 seconds are spent? More stuff to blame in StmtToHTML? I suspect loading the asm file can be slow (in load_asm_code()), but didn't immediately see a way to improve that while keeping the code readable.

@rtzam
Copy link
Contributor

rtzam commented Aug 14, 2025

So I've re-profiled and the remaining runtime is:

  • ~20% load_asm_code (just copying and moving string data)
  • ~50% duplicating the IR
  • ~30% LLVM compilation
    Nothing we can really do about LLVM here which leaves the IR duplication ¯_(ツ)_/¯

@mcourteaux
Copy link
Contributor Author

mcourteaux commented Aug 14, 2025

Duplicating the IR refers the Module clone, right? So nothing I want to do there. I just wonder if cloning the IR is actually needed, as IR is supposed to be immutable AFIAK... Perhaps just assigning the IR to take a snapshot of what the IR looked like at the point it's considered the "conceptual version" might suffice.

20% of 40s is still 8s. Eight seconds for loading a text file... Let me try to optimize this either way. This is still a bit too ridiculous to my taste 😝

… HTML. Move-optimized replace_all: allow for no-copy execution when target string is not found.
@mcourteaux
Copy link
Contributor Author

~20% load_asm_code (just copying and moving string data)

I pushed another patch for loading faster. This should pretty much eliminate the loading time. Curious to see what the result is if you feel like profiling it again. 😄

@rtzam
Copy link
Contributor

rtzam commented Aug 14, 2025

I'm measuring a difference in the generation of HTML (~10% faster). Nice!!
The hottest path I'm observing is still Halide::PipelineHTMLInspector::load_asm_code() where pretty much all the time is spent in string's push_back() method.
Thanks again for all the work!!

@mcourteaux
Copy link
Contributor Author

I'm looking at LLVM's libcxx implementation of std::string::append(iter, iter):

basic_string<_CharT, _Traits, _Allocator>::append(_ForwardIterator __first, _ForwardIterator __last) {
  size_type __sz  = size();
  size_type __cap = capacity();
  size_type __n   = static_cast<size_type>(std::distance(__first, __last));
  if (__n) {
    if (__string_is_trivial_iterator<_ForwardIterator>::value && !__addr_in_range(*__first)) {
      if (__cap - __sz < __n)
        __grow_by_without_replace(__cap, __sz + __n - __cap, __sz, __sz, 0);
      __annotate_increase(__n);
      auto __end = __copy_non_overlapping_range(__first, __last, std::__to_address(__get_pointer() + __sz));
      traits_type::assign(*__end, value_type());
      __set_size(__sz + __n);
    } else {
      const basic_string __temp(__first, __last, __alloc());  //// <<< HERE!
      append(__temp.data(), __temp.size());
    }
  }
  return *this;
}

What a waste... It allocates the __temp string... there is a redundant copy happening for no reason. The other overload taking a char* and size_t doesn't do this pointless copy. I'll change it to that.

@mcourteaux mcourteaux removed the request for review from halidebuildbots August 15, 2025 08:23
@mcourteaux mcourteaux merged commit 8e965e7 into halide:main Aug 15, 2025
3 of 6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

code_cleanup No functional changes. Reformatting, reorganizing, or refactoring existing code. performance

Projects

None yet

Development

Successfully merging this pull request may close these issues.

StmtToHtml copies its input too frequently

3 participants