-
Notifications
You must be signed in to change notification settings - Fork 1.1k
fix(stmt-html): Fix embedded Buffer processing performance issue. #8748
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
742b880 to
58cf410
Compare
src/StmtToHTML.cpp
Outdated
| asm_stream << line << "\n"; | ||
| if (line.length() > 500) { | ||
| // Very long lines in the assembly are typically the _gpu_kernel_sources | ||
| // or other buffers (such as model weights) as a raw ASCII block in the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would say something more realistic like "static LUTs" here... model weights are so large, they should be ImageParams.
rtzam
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just tested locally on the big LLM test case. The speedup is from "practically non-terminating" to ~40 seconds for a 1 layer model. Huge improvements! Thanks for the change.
| } | ||
| } | ||
|
|
||
| start = end + 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The fact that start can ever point past the end of the string_view made me uneasy. I can't wait for std::views::split to be available (c++23)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here's my take on a C++23 version, for posterity.
void generate(std::string_view code, std::map<uint64_t, std::regex>& markers,
std::map<uint64_t, int>& lnos) {
static constexpr std::string_view marker_prefix = "%\"";
std::size_t lno = 1;
for (auto&& chunk : std::views::split(code, '\n')) {
std::string_view line(chunk.begin(), chunk.end());
if (line.contains(marker_prefix)) {
std::erase_if(markers, [&](const auto& kv) {
const auto& [node, regex] = kv;
if (std::regex_search(line.begin(), line.end(), regex)) {
lnos[node] = lno;
return true;
}
return false;
});
}
lno++;
}
}There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well the loop condition should put you at ease. But indeed, a well-tested standard function would be nice.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well the loop condition should put you at ease.
It does 🙂
Do you know where the remaining 40 seconds are spent? More stuff to blame in StmtToHTML? I suspect loading the asm file can be slow (in |
|
So I've re-profiled and the remaining runtime is:
|
|
Duplicating the IR refers the Module clone, right? So nothing I want to do there. 20% of 40s is still 8s. Eight seconds for loading a text file... Let me try to optimize this either way. This is still a bit too ridiculous to my taste 😝 |
… HTML. Move-optimized replace_all: allow for no-copy execution when target string is not found.
I pushed another patch for loading faster. This should pretty much eliminate the loading time. Curious to see what the result is if you feel like profiling it again. 😄 |
|
I'm measuring a difference in the generation of HTML (~10% faster). Nice!! |
|
I'm looking at LLVM's libcxx implementation of basic_string<_CharT, _Traits, _Allocator>::append(_ForwardIterator __first, _ForwardIterator __last) {
size_type __sz = size();
size_type __cap = capacity();
size_type __n = static_cast<size_type>(std::distance(__first, __last));
if (__n) {
if (__string_is_trivial_iterator<_ForwardIterator>::value && !__addr_in_range(*__first)) {
if (__cap - __sz < __n)
__grow_by_without_replace(__cap, __sz + __n - __cap, __sz, __sz, 0);
__annotate_increase(__n);
auto __end = __copy_non_overlapping_range(__first, __last, std::__to_address(__get_pointer() + __sz));
traits_type::assign(*__end, value_type());
__set_size(__sz + __n);
} else {
const basic_string __temp(__first, __last, __alloc()); //// <<< HERE!
append(__temp.data(), __temp.size());
}
}
return *this;
}What a waste... It allocates the |
Initial changes to address #8717.
This does not address the Module deep-copy behavior copying Buffer contents. This merely addresses the string processing and avoids some copies.
Fixes #8752
Drive-by optimization: Change the signature of replace_all() to take the subject-string to be passed by value. This now allows for making replace-chains in this pattern:
Without making copies in case the target string is not found in the subject string.