Limit lists for duplicate merging of cases #9054

StefanKock · 2022-05-03T12:34:12Z

Problem Description

Data used for merging is currently only limited based on the report date filter, but still has to go through big parts of the available data.
See #8697 (comment)

Proposed Change

Limit CaseFacadeEjb.getCasesForDuplicateMerging method to fetch 100 entries.
When 100 entries are returned display a warning below the grid in the UI: "The returned data has been limited to < placeholder > entries. You can adjust the creation date filter or add other filter criteria to drill down on this."
In addition show a drop-down that allows the user to define the limit: 100, 500, 1000, none
Re-evaluate the performance of this ~~with the database used in Analyse slow system behaviour when creating an event participant / trying a case merge [2] #8697~~. If the performance is still bad, provide an explain analyze via https://explain.dalibo.com/) If the performance is acceptable (max 10 seconds waiting time), the currently available warning popup shown when entering the merge view can be removed (including Strings.infoMergeFiltersHint)

Acceptance Criteria

Implementation Details

get rid of joins that are not needed (see Limit lists for duplicate merging of cases #9054 (comment))

LEFT OUTER JOIN LOCATION location3_ ON person2_.address_id=location3_.id
LEFT OUTER JOIN Region region4_ ON case0_.responsibleRegion_id=region4_.id
LEFT OUTER JOIN Region region5_ ON case0_.region_id=region5_.id
LEFT OUTER JOIN users user7_ ON case0_.reportingUser_id=user7_.id
LEFT OUTER JOIN Facility facility8_ ON case0_.healthFacility_id=facility8_.id
LEFT OUTER JOIN Region region10_ ON case1_.responsibleRegion_id=region10_.id
LEFT OUTER JOIN Region region11_ ON case1_.region_id=region11_.id

get rid of distinct
introduce index to cases sort by creation date DESC

Additional Information

Create issue to do the same for ContactFacadadeEjb.getContactsForDuplicateMerging Limit lists for duplicate merging of contacts [3] #11469
Create issue to filter by archived/active and to only compare with cases that have creation date within or before date range. Limit case duplicate merging comparison based on creation date and archived status [3] #11465
Create issue to provide guide on how to analyze & solve performance problems like these Update guide on how to analyze & solve performance problems #11467

After merging a duplicate, the list will be re-loaded, potentially causing a long delay. If we don't manage to improve the performance of the query itself we would at least have to optimize this aspect.

The text was updated successfully, but these errors were encountered:

StefanKock · 2022-09-02T08:51:29Z

Investigated the failing query (no space left on disk). The problem is that this query makes a Case x Case crossjoin that results probably in > 4 billion lines for 65k cases.

Executed sql query

SELECT DISTINCT case0_.id AS col_0_0_,
                case1_.id AS col_1_0_,
                case0_.creationDate AS col_2_0_
FROM cases case0_
CROSS JOIN cases case1_
LEFT OUTER JOIN Person person2_ ON case0_.person_id=person2_.id
LEFT OUTER JOIN LOCATION location3_ ON person2_.address_id=location3_.id
LEFT OUTER JOIN Region region4_ ON case0_.responsibleRegion_id=region4_.id
LEFT OUTER JOIN Region region5_ ON case0_.region_id=region5_.id
LEFT OUTER JOIN Symptoms symptoms6_ ON case0_.symptoms_id=symptoms6_.id
LEFT OUTER JOIN users user7_ ON case0_.reportingUser_id=user7_.id
LEFT OUTER JOIN Facility facility8_ ON case0_.healthFacility_id=facility8_.id
LEFT OUTER JOIN Person person9_ ON case1_.person_id=person9_.id
LEFT OUTER JOIN Region region10_ ON case1_.responsibleRegion_id=region10_.id
LEFT OUTER JOIN Region region11_ ON case1_.region_id=region11_.id
LEFT OUTER JOIN Symptoms symptoms12_ ON case1_.symptoms_id=symptoms12_.id
WHERE case0_.deleted=FALSE
  AND case1_.deleted=FALSE
  AND case0_.creationDate>$1
  AND case0_.creationDate<$2
  AND case0_.deleted=$3
  AND (unaccent(user7_.firstName) ILIKE unaccent($4)=TRUE
       OR unaccent(user7_.lastName) ILIKE unaccent($5)=TRUE
       OR unaccent(user7_.userName) ILIKE unaccent($6)=TRUE)
  AND similarity(((person2_.firstName||$7)||person2_.lastName),((person9_.firstName||$8)||person9_.lastName))>0.65
  AND case0_.disease=case1_.disease
  AND region4_.id=region10_.id
  AND (region5_.id IS NULL
       OR region5_.id=region11_.id)
  AND abs(date_part($9, case0_.reportDate)-date_part($10, case1_.reportDate))<=2592000
  AND (person2_.sex IS NULL
       OR person9_.sex IS NULL
       OR person2_.sex=$11
       OR person9_.sex=$12
       OR person2_.sex=person9_.sex)
  AND (person2_.birthdate_dd IS NULL
       OR person2_.birthdate_mm IS NULL
       OR person2_.birthdate_yyyy IS NULL
       OR person9_.birthdate_dd IS NULL
       OR person9_.birthdate_mm IS NULL
       OR person9_.birthdate_yyyy IS NULL
       OR person2_.birthdate_dd=person9_.birthdate_dd
       AND person2_.birthdate_mm=person9_.birthdate_mm
       AND person2_.birthdate_yyyy=person9_.birthdate_yyyy)
  AND (symptoms6_.onsetDate IS NULL
       OR symptoms12_.onsetDate IS NULL
       OR abs(date_part($13, symptoms6_.onsetDate)-date_part($14, symptoms12_.onsetDate))<=2592000)
  AND (case0_.creationDate<case1_.creationDate
       OR case1_.creationDate<=$15
       OR case1_.creationDate>=$16)
  AND case0_.id<>case1_.id
ORDER BY case0_.creationDate DESC
LIMIT $17

Idea: Only run the query for one case in case0_ collection to analyse that query, and then test where the sweet spot is (probably less than ten cases in one query). Search for duplicates until the desired limit is reached or all cases where checked.

MateStrysewske · 2022-09-14T11:36:53Z

Current status and some Explain Analyse links:

Getting rid of the person comparison conditions (i.e. name, sex, birth date) and instead comparing by person ID basically makes the query execute instantly with the performance database from ~ 2 months ago (85,000 cases) because it drastically reduces the subset of data that needs to be filtered. However, this is not an option until we have a dedicated person duplicate merging, and we would then probably run into performance problems while implementing that.

Here's the Explain Analyse of the current state of the query:

https://explain.dalibo.com/plan/PgR8

@MartinWahnschaffe and I had a closer look at it and have identified the potential problem that the limit is only applied at the end of the query, i.e. the whole subset of data is considered before, which is very problematic because we're doing a cross join. One reason could be that the query above adds some sorting in between. However, even getting rid of this sorting by adjusting the query according to the next Explain Analyse doesn't really improve this situation (even though it drastically reduces the size of the plan and definitely make sense to implement anyway):

https://explain.dalibo.com/plan/8oI

The proposal to only apply the whole query to a very small subset of cases is very likely not practical because it will result in a huge number of queries being executed in most scenarios where the limit is not reached quickly.

…e_limit #9054 case merge limit

MartinWahnschaffe · 2023-02-09T08:50:28Z

Note: I have checked the duplicate detection on https://test-de.sormas.netzlink.com. Before the update it took 1 minute or more to open the view. Now it's just a few seconds, so there seems to be a positive tendency :-)

adinaflorea9 · 2023-02-09T10:13:19Z

I am reopening the issue to the the fact that an error appears when filtering across directories and then deleting the inputted information. Reported it in #11470 but it was observed that it might have been caused by this ticket.

…ull-check #9054 added missing null check

abrudanancuta · 2023-02-13T12:46:28Z

Validated on test.de, test.de2. test.de1 with the version 1.80.0-SNAPSHOT (a118b4a)

StefanKock added contacts cases backend Affects the web backend change A change of an existing feature (ticket type) performance Issues that are meant to increase the performance of the application labels May 3, 2022

StefanKock mentioned this issue May 12, 2022

Improve loading times on Merge Duplicate Cases view [Timebox: 1d] #2227

Closed

MartinWahnschaffe changed the title ~~Limit lists for duplicate merging~~ Limit lists for duplicate merging or cases and contacts May 23, 2022

markusmann-vg changed the title ~~Limit lists for duplicate merging or cases and contacts~~ Limit lists for duplicate merging for cases and contacts May 25, 2022

markusmann-vg changed the title ~~Limit lists for duplicate merging for cases and contacts~~ Limit lists for duplicate merging for cases and contacts [2] May 25, 2022

sergiupacurariu self-assigned this Aug 22, 2022

sergiupacurariu removed their assignment Aug 26, 2022

MateStrysewske self-assigned this Sep 1, 2022

MateStrysewske mentioned this issue Sep 27, 2022

[Performance] Filtering out the cases in the Merging Duplicate Cases Directory shouldn't take that long #10460

Closed

StefanKock added the discuss Discussion about whether or not the issue should or can be implemented required label Oct 14, 2022

markusmann-vg removed the discuss Discussion about whether or not the issue should or can be implemented required label Jan 27, 2023

markusmann-vg unassigned MateStrysewske Jan 27, 2023

markusmann-vg changed the title ~~Limit lists for duplicate merging for cases and contacts [2]~~ INVESTIGATION: Limit lists for duplicate merging for cases and contacts Jan 27, 2023

MartinWahnschaffe added a commit that referenced this issue Feb 8, 2023

Merge pull request #11459 from hzi-braunschweig/change-9054_case_merg…

cf6272b

…e_limit #9054 case merge limit

StefanKock added this to the Iteration 2023-01 - 1.80.0 milestone Feb 9, 2023

StefanKock removed the contacts label Feb 9, 2023

MartinWahnschaffe mentioned this issue Feb 9, 2023

Update guide on how to analyze & solve performance problems #11467

Closed

5 tasks

MartinWahnschaffe changed the title ~~Limit lists for duplicate merging for cases~~ Limit lists for duplicate merging of cases Feb 9, 2023

StefanKock assigned adinaflorea9 Feb 9, 2023

MartinWahnschaffe mentioned this issue Feb 9, 2023

Limit lists for duplicate merging of contacts [3] #11469

Closed

7 tasks

abrudanancuta assigned abrudanancuta and unassigned adinaflorea9 Feb 9, 2023

adinaflorea9 reopened this Feb 9, 2023

MartinWahnschaffe added a commit that referenced this issue Feb 9, 2023

#9054 added missing null check

0386b83

MartinWahnschaffe mentioned this issue Feb 9, 2023

#9054 added missing null check #11472

Merged

MartinWahnschaffe closed this as completed in #11472 Feb 9, 2023

MartinWahnschaffe added a commit that referenced this issue Feb 9, 2023

Merge pull request #11472 from hzi-braunschweig/change-9054-missing-n…

943c563

…ull-check #9054 added missing null check

abrudanancuta added the qa-verified Issue has been tested and verified by QA label Feb 13, 2023

abrudanancuta mentioned this issue Mar 7, 2023

Add merge confirmation pop-up to contacts view and rephrase content [1] #11596

Closed

3 tasks

MateStrysewske mentioned this issue Mar 23, 2023

[Merge] Searching for name results in error message #11712

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Limit lists for duplicate merging of cases #9054

Limit lists for duplicate merging of cases #9054

StefanKock commented May 3, 2022 •

edited by MartinWahnschaffe

Loading

StefanKock commented Sep 2, 2022

MateStrysewske commented Sep 14, 2022

MartinWahnschaffe commented Feb 9, 2023

adinaflorea9 commented Feb 9, 2023

abrudanancuta commented Feb 13, 2023

Limit lists for duplicate merging of cases #9054

Limit lists for duplicate merging of cases #9054

Comments

StefanKock commented May 3, 2022 • edited by MartinWahnschaffe Loading

Problem Description

Proposed Change

Acceptance Criteria

Implementation Details

Additional Information

StefanKock commented Sep 2, 2022

MateStrysewske commented Sep 14, 2022

MartinWahnschaffe commented Feb 9, 2023

adinaflorea9 commented Feb 9, 2023

abrudanancuta commented Feb 13, 2023

StefanKock commented May 3, 2022 •

edited by MartinWahnschaffe

Loading