Skip to content

Conversation

@tammymendt
Copy link

Additions:

  • optimisation to the runtime of the match function, to avoid sorting on every iteration over the test_scores. Instead, the ctrl_scores variable is sorted once, optimising the search for values that are close to the value of the current iteration over test_scores.
  • added an optional with_replacement argument to the match function to allow the possiblity to match without replacement.
  • The threshold argument of the match function is now also used by the 'min' or "nearest_neighbours" method, to discard matches that are not similar enough. This plays the role of a "caliper".
  • find_nearest_n function to encapsulate the logic for searching for nearest neighbours. Added a test for this function.

@tammymendt
Copy link
Author

Hey @benmiroglio, could you please have a look at this PR?

@skjerns
Copy link

skjerns commented May 17, 2020

Somehow this broke the matching when there are no exact matches.

from pymatch.Matcher import Matcher
import pandas as pd

cases_ages =[23, 21, 26, 25, 23, 44, 24, 22, 46, 26]
controls_ages = [34, 30, 24, 25, 25, 27, 30, 33, 53, 27, 26, 28, 23, 23, 28, 23, 24, 22, 23, 25]
cases_group = [1 for _ in range(len(cases_ages))]
controls_group = [0 for _ in range(len(cases_ages))]

df_cases = pd.DataFrame(list(zip(cases_ages, cases_group )), columns=['age', 'group'])
df_controls = pd.DataFrame(list(zip(controls_ages, controls_group )), columns=['age', 'group'])

m = Matcher(df_cases , df_controls , yvar='group')
m.fit_scores(balance=True, nmodels=100)
m.match(method='min', nmatches=1, with_replacement=False)
print(m.matched_data)

Only exact matches are printed now, which should not be the case.

@tammymendt
Copy link
Author

tammymendt commented May 18, 2020

@skjerns thanks for the feedback. This is caused by the threshold argument which by default is 0.001. If you set this to a larger value you will get matches that are not exact. You can try
m.match(method='min', nmatches=1, with_replacement=False, threshold=0.1) and then will get non exact matches.

However, you are right that the behavior should not break (that is, by default, when using the 'min' matching method the threshold should be None). The only way I can think of doing this would be following something like this: https://stackoverflow.com/questions/14749328/how-to-check-whether-optional-function-parameter-is-set/58166804#58166804. So checking whether or not the threshold is being explicitly set and when not, then passing None as a threshold to the find_nearest_n function. Its not very elegant though. I would rather favor breaking the old behavior but bumping the version so that it is known that the old behavior is broken.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants