-
Notifications
You must be signed in to change notification settings - Fork 143
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Limit lists for duplicate merging of cases #9054
Comments
Investigated the failing query (no space left on disk). The problem is that this query makes a Case x Case crossjoin that results probably in > 4 billion lines for 65k cases. Executed sql querySELECT DISTINCT case0_.id AS col_0_0_,
case1_.id AS col_1_0_,
case0_.creationDate AS col_2_0_
FROM cases case0_
CROSS JOIN cases case1_
LEFT OUTER JOIN Person person2_ ON case0_.person_id=person2_.id
LEFT OUTER JOIN LOCATION location3_ ON person2_.address_id=location3_.id
LEFT OUTER JOIN Region region4_ ON case0_.responsibleRegion_id=region4_.id
LEFT OUTER JOIN Region region5_ ON case0_.region_id=region5_.id
LEFT OUTER JOIN Symptoms symptoms6_ ON case0_.symptoms_id=symptoms6_.id
LEFT OUTER JOIN users user7_ ON case0_.reportingUser_id=user7_.id
LEFT OUTER JOIN Facility facility8_ ON case0_.healthFacility_id=facility8_.id
LEFT OUTER JOIN Person person9_ ON case1_.person_id=person9_.id
LEFT OUTER JOIN Region region10_ ON case1_.responsibleRegion_id=region10_.id
LEFT OUTER JOIN Region region11_ ON case1_.region_id=region11_.id
LEFT OUTER JOIN Symptoms symptoms12_ ON case1_.symptoms_id=symptoms12_.id
WHERE case0_.deleted=FALSE
AND case1_.deleted=FALSE
AND case0_.creationDate>$1
AND case0_.creationDate<$2
AND case0_.deleted=$3
AND (unaccent(user7_.firstName) ILIKE unaccent($4)=TRUE
OR unaccent(user7_.lastName) ILIKE unaccent($5)=TRUE
OR unaccent(user7_.userName) ILIKE unaccent($6)=TRUE)
AND similarity(((person2_.firstName||$7)||person2_.lastName),((person9_.firstName||$8)||person9_.lastName))>0.65
AND case0_.disease=case1_.disease
AND region4_.id=region10_.id
AND (region5_.id IS NULL
OR region5_.id=region11_.id)
AND abs(date_part($9, case0_.reportDate)-date_part($10, case1_.reportDate))<=2592000
AND (person2_.sex IS NULL
OR person9_.sex IS NULL
OR person2_.sex=$11
OR person9_.sex=$12
OR person2_.sex=person9_.sex)
AND (person2_.birthdate_dd IS NULL
OR person2_.birthdate_mm IS NULL
OR person2_.birthdate_yyyy IS NULL
OR person9_.birthdate_dd IS NULL
OR person9_.birthdate_mm IS NULL
OR person9_.birthdate_yyyy IS NULL
OR person2_.birthdate_dd=person9_.birthdate_dd
AND person2_.birthdate_mm=person9_.birthdate_mm
AND person2_.birthdate_yyyy=person9_.birthdate_yyyy)
AND (symptoms6_.onsetDate IS NULL
OR symptoms12_.onsetDate IS NULL
OR abs(date_part($13, symptoms6_.onsetDate)-date_part($14, symptoms12_.onsetDate))<=2592000)
AND (case0_.creationDate<case1_.creationDate
OR case1_.creationDate<=$15
OR case1_.creationDate>=$16)
AND case0_.id<>case1_.id
ORDER BY case0_.creationDate DESC
LIMIT $17 Idea: Only run the query for one case in case0_ collection to analyse that query, and then test where the sweet spot is (probably less than ten cases in one query). Search for duplicates until the desired limit is reached or all cases where checked. |
Current status and some Explain Analyse links: Getting rid of the person comparison conditions (i.e. name, sex, birth date) and instead comparing by person ID basically makes the query execute instantly with the performance database from ~ 2 months ago (85,000 cases) because it drastically reduces the subset of data that needs to be filtered. However, this is not an option until we have a dedicated person duplicate merging, and we would then probably run into performance problems while implementing that. Here's the Explain Analyse of the current state of the query: https://explain.dalibo.com/plan/PgR8 @MartinWahnschaffe and I had a closer look at it and have identified the potential problem that the limit is only applied at the end of the query, i.e. the whole subset of data is considered before, which is very problematic because we're doing a cross join. One reason could be that the query above adds some sorting in between. However, even getting rid of this sorting by adjusting the query according to the next Explain Analyse doesn't really improve this situation (even though it drastically reduces the size of the plan and definitely make sense to implement anyway): https://explain.dalibo.com/plan/8oI The proposal to only apply the whole query to a very small subset of cases is very likely not practical because it will result in a huge number of queries being executed in most scenarios where the limit is not reached quickly. |
…e_limit #9054 case merge limit
Note: I have checked the duplicate detection on https://test-de.sormas.netzlink.com. Before the update it took 1 minute or more to open the view. Now it's just a few seconds, so there seems to be a positive tendency :-) |
I am reopening the issue to the the fact that an error appears when filtering across directories and then deleting the inputted information. Reported it in #11470 but it was observed that it might have been caused by this ticket. |
…ull-check #9054 added missing null check
Validated on test.de, test.de2. test.de1 with the version 1.80.0-SNAPSHOT (a118b4a) |
Problem Description
Data used for merging is currently only limited based on the report date filter, but still has to go through big parts of the available data.
See #8697 (comment)
Proposed Change
CaseFacadeEjb.getCasesForDuplicateMerging
method to fetch 100 entries.When 100 entries are returned display a warning below the grid in the UI: "The returned data has been limited to < placeholder > entries. You can adjust the creation date filter or add other filter criteria to drill down on this."with the database used in Analyse slow system behaviour when creating an event participant / trying a case merge [2] #8697. If the performance is still bad, provide an explain analyze via https://explain.dalibo.com/) If the performance is acceptable (max 10 seconds waiting time), the currently available warning popup shown when entering the merge view can be removed (including Strings.infoMergeFiltersHint)Acceptance Criteria
Implementation Details
Additional Information
The text was updated successfully, but these errors were encountered: