Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(time_offset): improved LIMIT-handling in advanced analytics #27934

Merged

Conversation

Antonio-RiveroMartnez
Copy link
Member

@Antonio-RiveroMartnez Antonio-RiveroMartnez commented Apr 8, 2024

SUMMARY

When applying a time_offset in Advanced Analytics, if the row_limit is too small, data will not be present when the join occurs, so this PR prevent such behavior by not using limit nor offset in the inner query (the shifted one).

For example this query shown in the following video has all the records for a period of 10+ years:

Screen.Recording.2024-04-08.at.11.04.50.AM.mov

We can see that the result of shifted queries not always include the same periods than the original query, so when the join gets applied, it won't match some records, thus giving you incorrect output.

Original query with limit and sorting:
Captura de pantalla 08 04 2024 a 11 05 13 AM

1 year ago query with limit and sorting:
Captura de pantalla 08 04 2024 a 11 05 50 AM

2 years ago query with limit and sorting:
Captura de pantalla 08 04 2024 a 11 06 08 AM

BEFORE/AFTER SCREENSHOTS OR ANIMATED GIF

Error:
In the following image some datapoints are missing in the shifted columns because they are not part of the result dataframe after sorting and limiting it like the original query.
error

Fix:
Now we can see that the query is able to find the records that belong to those offsets regardless, because it's joining using the complete data.
test

TESTING INSTRUCTIONS

ADDITIONAL INFORMATION

- Remove inner row_limit and row_offset when applying time_offset to prevent data inconsistency
@Antonio-RiveroMartnez
Copy link
Member Author

/testenv up

Copy link
Contributor

github-actions bot commented Apr 8, 2024

@Antonio-RiveroMartnez Ephemeral environment spinning up at http://34.215.85.155:8080. Credentials are admin/admin. Please allow several minutes for bootstrapping and startup.

- Use config value as row_limit
- Update test
@zhaoyongjie zhaoyongjie self-requested a review April 8, 2024 13:02
@Antonio-RiveroMartnez
Copy link
Member Author

/testenv up

Copy link
Contributor

github-actions bot commented Apr 8, 2024

@Antonio-RiveroMartnez Ephemeral environment spinning up at http://54.244.23.72:8080. Credentials are admin/admin. Please allow several minutes for bootstrapping and startup.

@john-bodley john-bodley self-requested a review April 8, 2024 16:53
@john-bodley
Copy link
Member

@Antonio-RiveroMartnez are there any open issues we can link to this? I'm somewhat surprised that no one has reported an issue with this in the past.

@Antonio-RiveroMartnez
Copy link
Member Author

Hey @john-bodley I just linked it in the PR description, thanks.

@mistercrunch mistercrunch changed the title fix(time_offset): Row Limit fix(time_offset): improved LIMIT-handling in advanced analytics Apr 11, 2024
Copy link
Member

@mistercrunch mistercrunch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes + tests LGTM. Let's follow up with a PR breaking this large method in the right places.

@@ -455,6 +455,13 @@ def processing_time_offsets( # pylint: disable=too-many-locals,too-many-stateme
for metric in metric_names
}

# When the original query has limit or offset we wont apply those
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow this method has gotten large! Thinking about refactors here and factoring out a processing_time_offset (a method for each time offset) might help a little in reading this, but probably breaking down further would be good too.

Doing the refactor in this PR as part of this PR would make it harder to spot the diff, so I'll approve this one, but hoping we can refactor some of these into more bite-sized methods. Maybe a goal could be to remove the disable=too-many-locals,too-many-statements above.

@Antonio-RiveroMartnez Antonio-RiveroMartnez merged commit 6844735 into apache:master Apr 11, 2024
28 checks passed
Copy link
Contributor

Ephemeral environment shutdown and build artifacts deleted.

@michael-s-molina michael-s-molina added the v4.0 Label added by the release manager to track PRs to be included in the 4.0 branch label Apr 12, 2024
EnxDev pushed a commit to EnxDev/superset that referenced this pull request Apr 15, 2024
michael-s-molina pushed a commit that referenced this pull request Apr 16, 2024
michael-s-molina pushed a commit that referenced this pull request Apr 16, 2024
michael-s-molina pushed a commit that referenced this pull request Apr 16, 2024
betodealmeida pushed a commit that referenced this pull request Apr 25, 2024
qleroy pushed a commit to qleroy/superset that referenced this pull request Apr 28, 2024
jzhao62 pushed a commit to jzhao62/superset that referenced this pull request May 16, 2024
@mistercrunch mistercrunch added 🍒 4.0.1 🍒 4.0.2 🏷️ bot A label used by `supersetbot` to keep track of which PR where auto-tagged with release labels labels Jul 24, 2024
vinothkumar66 pushed a commit to vinothkumar66/superset that referenced this pull request Nov 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🏷️ bot A label used by `supersetbot` to keep track of which PR where auto-tagged with release labels size/M v4.0 Label added by the release manager to track PRs to be included in the 4.0 branch 🍒 4.0.1 🍒 4.0.2 🚢 4.1.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants