Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update limit behavior to use incrementing index (fix last_by_index aggregate) #2964

Merged
merged 3 commits into from
Apr 4, 2025

Conversation

timbess
Copy link
Contributor

@timbess timbess commented Mar 28, 2025

This PR decouples primary keys from the limit functionality of perspective tables, allowing us to handle aggregates like "last by index" a bit better when users have a limit and want to rely on insert order to window their data.

Along the way I also fixed some memory safety issues where we had called new to allocate some memory and free'd it with free rather than delete. This wasn't breaking anything at the moment, but may have been relying on undefined behavior accidentally.

The new benchmarks also triggered a bug in our polling implementation where if a table was dirtied and deleted before the poll call, it would segfault the poll call because we hadn't removed it from the dirty map in the server, so I fixed that as well.

@timbess timbess force-pushed the feature/decouple-pkey-from-limit branch from 7fc2d4f to 0b83a2b Compare March 28, 2025 15:23
@texodus texodus added the enhancement Feature requests or improvements label Mar 28, 2025
@timbess timbess force-pushed the feature/decouple-pkey-from-limit branch 10 times, most recently from bd63e26 to da9fb69 Compare April 2, 2025 19:47
timbess added 2 commits April 2, 2025 15:47
Signed-off-by: Timothy Bess <timbessmail@gmail.com>
Signed-off-by: Timothy Bess <timbessmail@gmail.com>
@timbess timbess force-pushed the feature/decouple-pkey-from-limit branch 2 times, most recently from 394606b to fd271ee Compare April 3, 2025 15:11
@timbess timbess marked this pull request as ready for review April 3, 2025 16:35
…freeing it with "free" which I believe worked in our case, but is UB I believe.

Signed-off-by: Timothy Bess <timbessmail@gmail.com>
@timbess timbess force-pushed the feature/decouple-pkey-from-limit branch from fd271ee to b7a4879 Compare April 3, 2025 20:55
Copy link
Member

@texodus texodus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR! Looks good!

You forgot to mentions his PR also fixes CI (much appreciated!) ... which has been failing due to the GHA CI machines running out of HD space during the Python build.

Test coverage looks great and covers last_by_index aggregate behavior specifically.

Benchmarks for feature change to limit behavior wrt Table.update() are great too. This change does facially increase the allocation size with the new psp_old_key column (I think - there is a lot changed including the flatten behavior so I may be wrong here), but its great to see there are no observable impact within the bounds of our CPU time benchmarking. It is also a good reminder IMO we should implement heap benchmarks!

break;

switch (op) { FOREACH_T_OP(X) }
#undef X
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lmao

flattened_old_pkey_col->set_scalar(flattened_idx, old);
flattened_old_pkey_col->set_valid(flattened_idx, true);
LOG_DEBUG("DELETING OLD PKEY: " << old);
// erase(old);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💀

t_uindex m_bidx;
t_uindex m_eidx;
t_uindex m_begin_idx;
t_uindex m_edge_idx;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@@ -404,6 +418,9 @@ t_data_table::flatten_helper_1(FLATTENED_T flattened) const {
std::vector<t_index> edges;
edges.push_back(0);

// | pkey-1 | pkey-1 | pkey-2 | pkey-2
// ^ Is this an edge? I think so.
// |-----------------| <- This is a span between those edges.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes

if (delete_encountered) {
d_pkey_col->push_back(
d_pkey_col->set_nth(
store_idx % limit,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems we're applying limit to the flattened input port table. Is this is because the size of this table is later assumed to be lower bound of the new master table size (or similar)? I think this is fine but (especially compared to the current state of the limit feature), but it has reminded me we likely have some other limit related abnormalities in specific aggregates, e.g. AGGTYPE_HIGH_WATER_MARK, which never decreases and thus would calculate differently if a single input update step was large than the limit, and the overwritten range removed the new high water mark value (I think).

@@ -26,6 +26,7 @@ namespace perspective {
template <class FUNCTION>
void
parallel_for(int num_tasks, FUNCTION&& func) {
#undef PSP_PARALLEL_FOR
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

?

@@ -26,6 +26,9 @@ pub fn cmake_build() -> Result<Option<PathBuf>, std::io::Error> {
let profile = std::env::var("PROFILE").unwrap();
dst.always_configure(true);
dst.define("CMAKE_BUILD_TYPE", profile.as_str());
dst.define("ARROW_BUILD_EXAMPLES", "OFF");
dst.define("RAPIDJSON_BUILD_EXAMPLES", "OFF");
dst.define("ARROW_CXX_FLAGS_DEBUG", "-Wno-error");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great! IMO we should move this into the CMakeLists.txt though so it applies to the wasm builds also (assuming it doesn't already).

if (!check_version_gte(metadata.version, "3.4.3")) {
// Bug with old versions of perspective segfault when you delete
// a table with pending updates.
await table.size();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch, this is good to fix (fix mentioned in the PR summary)

return cmp(cmpval, 0);
} // namespace perspective
}
;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a clang_format error?

@texodus texodus changed the title Decouple Primary keys from limit of table. Update limit behavior to use incrementing index (fix last_by_index aggregate) Apr 4, 2025
@texodus texodus merged commit b0c0177 into finos:master Apr 4, 2025
14 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Feature requests or improvements
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants