matching.py: Fix incorrect fuzzy amount loop bounds #210

ankurdave · 2023-11-01T23:28:28Z

The indexing logic in matching.py has an off-by-one bug that manifests as #202.

We currently modify the lower bound of the loop to include an element less than the lower bound:

for sp in cur_matches[lower_bound-1 if lower_bound > 0 else 0:upper_bound]:
    # Verify that number is in fuzzy range.
    # Currently fails (issue #202).
    assert abs(sp.number - amount.number) <= self.fuzzy_match_amount

Removing that fixes the problem:

for sp in cur_matches[lower_bound:upper_bound]:
    # Verify that number is in fuzzy range.
    # No longer fails.
    assert abs(sp.number - amount.number) <= self.fuzzy_match_amount

This approach is an alternative to #206.

In addition to the loop bounds fix, this PR adds the following:

The above loop assertion.
test_nonmatch_fuzzy_amount_with_dates: a repro for transactions matched that shouldn't be #202. It previously failed and now passes.
test_match_fuzzy_amount_upper_bound: exercises the loop's upper bound handling. Injecting an off-by-one error in the upper bound by changing upper_bound to upper_bound-1 causes this test to fail.

Fixes #202.

ankurdave · 2023-11-02T00:23:18Z

It seems like the tests are failing due to unstable ordering of the posting meta keys. I saw the same issue when running locally on master with Python 3.11.

ankurdave · 2023-11-02T01:45:56Z

I think the recent release of Beancount 2.3.6 broke the tests. Here's the diff between releases and the relevant PR: beancount/beancount#726

I'll submit an empty PR to confirm, then a PR to unbreak.

Zburatorul · 2023-11-13T06:27:42Z

Looks like it's still failing after the metadata sorting PR got merged.

Zburatorul · 2023-11-18T04:31:04Z

beancount_import/matching.py

@@ -443,7 +443,8 @@ def _get_matches(
        if cur_matches is not None:
            lower_bound = bisect.bisect_left(cur_matches, (lower, tuple(), None, None))
            upper_bound = bisect.bisect_right(cur_matches, (upper, (sys.maxsize,), None, None), lo=lower_bound)
-            for sp in cur_matches[lower_bound-1 if lower_bound > 0 else 0:upper_bound]:
+            for sp in cur_matches[lower_bound:upper_bound]:
+                assert abs(sp.number - amount.number) <= self.fuzzy_match_amount


Could this assert have an easy informative error message?

Added the following message:

assert abs(sp.number - amount.number) <= self.fuzzy_match_amount, ( f'Bug in matching algorithm: {sp} is not within ' + f'{self.fuzzy_match_amount} of {amount}')

Zburatorul · 2023-11-18T21:51:23Z

I forgot to ask, but do you understand why this off-by-one error slipped through in previous PRs? Is it easy to construct a test that exercises just that logic?

ankurdave · 2023-11-19T00:33:02Z

@Zburatorul:

I forgot to ask, but do you understand why this off-by-one error slipped through in previous PRs? Is it easy to construct a test that exercises just that logic?

Good question, I was unclear about that as well. It turns out this was because the existing nonmatch test only exercises the case where the target posting is at the beginning of the list of candidates (i.e. lower_bound == 0), in which case the existing code happened to be correct.

In turn that's because the Expenses:FIXME posting has a large positive amount (100 USD), which is negated during candidate search to become a large negative number (-100).

The new test instead has an Expenses:FIXME posting with a large negative amount (-20.00 USD), so it exercises the opposite case.

You can see this by adding the following print statements:

modified   beancount_import/matching.py
@@ -444,7 +444,22 @@ class PostingDatabase(object):
         if cur_matches is not None:
             lower_bound = bisect.bisect_left(cur_matches, (lower, tuple(), None, None))
             upper_bound = bisect.bisect_right(cur_matches, (upper, (sys.maxsize,), None, None), lo=lower_bound)
-            for sp in cur_matches[lower_bound:upper_bound]:
+
+            print()
+            print(f'__get_matches({account}, {date}, {amount}): '
+                  + f'considering {len(cur_matches)} date/currency matches')
+            printer = beancount.parser.printer.EntryPrinter()
+            for i in range(0, lower_bound-1 if lower_bound > 0 else 0):
+                print(f'  Not considering match below lower bound, at {i}: '
+                      + f'{printer(cur_matches[i].mp.posting)}')
+            for i in range(lower_bound-1 if lower_bound > 0 else 0, upper_bound):
+                print(f'  Considering match within bounds, at {i}: '
+                      + f'{printer(cur_matches[i].mp.posting)}')
+            for i in range(upper_bound, len(cur_matches)):
+                print(f'  Not considering match above upper bound, at {i}: '
+                      + f'{printer(cur_matches[i].mp.posting)}')
+
+            for sp in cur_matches[lower_bound-1 if lower_bound > 0 else 0:upper_bound]:
                 assert abs(sp.number - amount.number) <= self.fuzzy_match_amount, (
                     f'Bug in matching algorithm: {sp} is not within '
                     + f'{self.fuzzy_match_amount} of {amount}')

I simplified the new test and documented this in a comment.

ankurdave mentioned this pull request Nov 1, 2023

fix: #202 - missing checks during matching for number #206

Closed

ankurdave force-pushed the fix-issue-202-loop-bounds branch from e422bda to 5bb05bb Compare November 18, 2023 01:32

Zburatorul reviewed Nov 18, 2023

View reviewed changes

matching.py: Fix incorrect fuzzy amount loop bounds

273b35f

ankurdave force-pushed the fix-issue-202-loop-bounds branch from 5bb05bb to 273b35f Compare November 18, 2023 17:36

ankurdave requested a review from Zburatorul November 18, 2023 17:37

Clarify the cause of jbms#202: position within cur_matches list

a964adb

Zburatorul merged commit c831323 into jbms:master Nov 19, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

matching.py: Fix incorrect fuzzy amount loop bounds #210

matching.py: Fix incorrect fuzzy amount loop bounds #210

ankurdave commented Nov 1, 2023 •

edited

Loading

ankurdave commented Nov 2, 2023 •

edited

Loading

ankurdave commented Nov 2, 2023

Zburatorul commented Nov 13, 2023

Zburatorul Nov 18, 2023

ankurdave Nov 18, 2023

Zburatorul commented Nov 18, 2023

ankurdave commented Nov 19, 2023 •

edited

Loading

matching.py: Fix incorrect fuzzy amount loop bounds #210

matching.py: Fix incorrect fuzzy amount loop bounds #210

Conversation

ankurdave commented Nov 1, 2023 • edited Loading

ankurdave commented Nov 2, 2023 • edited Loading

ankurdave commented Nov 2, 2023

Zburatorul commented Nov 13, 2023

Zburatorul Nov 18, 2023

Choose a reason for hiding this comment

ankurdave Nov 18, 2023

Choose a reason for hiding this comment

Zburatorul commented Nov 18, 2023

ankurdave commented Nov 19, 2023 • edited Loading

ankurdave commented Nov 1, 2023 •

edited

Loading

ankurdave commented Nov 2, 2023 •

edited

Loading

ankurdave commented Nov 19, 2023 •

edited

Loading