-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implementation of Kwon and Zou Data-OOB: Out-of-bag Estimate as a Sim… #426
Conversation
…ple and Efficient Data Value ICML 2023 using pyDVL
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Salut Bastien and thanks a lot for your first contribution! Data-OOB is an interesting approach and it's great that you chose to help out implementing it. 🙏🏽
I have left a few comments below about the main module. We can do the notebook in a second step if you want.
One general comment is that you will want to run the linter and type checker. Basically run tox -e linting
and tox -e type-checking
to look for errors. The first will rearrange module imports, run black and so on. The second will, well... run the type checker 😄 . You will also want to run mkdocs to see how your documentation renders. See the file CONTRIBUTING.md for more info
Finally, if you want to run the removal experiment faster in your notebook, you can use this code. There's a bit of boilerplate, but you can use as many cores as you have by setting max_workers
.
from tqdm.notebook import tqdm
from concurrent.futures import FIRST_COMPLETED, wait
from pydvl.utils.parallel import init_executor
def removal_job(method: str, n_est=300, max_samples=0.95, progress=False):
if method == "random":
values = ValuationResult.from_random(size=len(utility.data))
else:
values = compute_data_oob(u=utility, n_est=n_est, max_samples=max_samples, progress=progress)
best_scores = compute_removal_score(
u=utility,
values=values,
percentages=removal_percentages,
remove_best=True,
)
best_scores["method_name"] = values.algorithm
worst_scores = compute_removal_score(
u=utility,
values=values,
percentages=removal_percentages,
remove_best=False,
)
worst_scores["method_name"] = values.algorithm
return best_scores, worst_scores
all_best_scores = []
all_worst_scores = []
removal_percentages = np.arange(0, 0.99, 0.01)
n_runs = 20
pending = set()
with init_executor(max_workers=24) as executor:
for i in range(n_runs):
pending.add(executor.submit(removal_job, method="random"))
pending.add(executor.submit(removal_job, method="data_oob"))
pbar = tqdm(total=2*n_runs, unit="%")
while len(pending) > 0:
pbar.n = 2*n_runs - len(pending) + 1 # HACK
pbar.refresh()
completed, pending = wait(
pending, timeout=config.wait_timeout, return_when=FIRST_COMPLETED
)
for future in completed:
best_scores, worst_scores = future.result()
all_best_scores.append(best_scores)
all_worst_scores.append(worst_scores)
best_scores_df = pd.DataFrame(all_best_scores)
worst_scores_df = pd.DataFrame(all_worst_scores)
Co-authored-by: Miguel de Benito Delgado <m.debenito.d@gmail.com>
Co-authored-by: Miguel de Benito Delgado <m.debenito.d@gmail.com>
Co-authored-by: Miguel de Benito Delgado <m.debenito.d@gmail.com>
…gression models ; added more parameters for the bagging function ; added a generic loss function parameter
There still remains a problem raised by the type-checker that I did not manage to resolve. I would be interested in knowing the solution. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added some superficial comments. Looking good!
Co-authored-by: Miguel de Benito Delgado <m.debenito.d@gmail.com>
Co-authored-by: Miguel de Benito Delgado <m.debenito.d@gmail.com>
Co-authored-by: Miguel de Benito Delgado <m.debenito.d@gmail.com>
Co-authored-by: Miguel de Benito Delgado <m.debenito.d@gmail.com>
Co-authored-by: Miguel de Benito Delgado <m.debenito.d@gmail.com>
Hi @BastienZim it looks good. It's almost ready there are just a few things missing:
|
Co-authored-by: Anes Benmerzoug <Anes.Benmerzoug@gmail.com>
Co-authored-by: Anes Benmerzoug <Anes.Benmerzoug@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@BastienZim Thanks for your work! It looks good to me now.
@mdbenito I think it's ready to be merged. All that's left to do is to add a changelog entry and maybe a description for the oob module. What do you think?
Co-authored-by: Anes Benmerzoug <Anes.Benmerzoug@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed that this is ready. We can polish any cosmetic details later post-merge. Thanks Bastien for the contribution, looking forward to more! :)
…ple and Efficient Data Value ICML 2023 using pyDVL
Description
This PR adds the implementation of a data valuation method described in Kwon and Zou "Data-OOB: Out-of-bag Estimate as a Simple and Efficient Data Value" published at ICML 2023.
The notebook provided gives a comprehensive overview of the method, through examples, visualizations and point removal-evaluation.
No unit tests were added, as the notebook is testing the method. If-ever that is considered useful, I could write some.
Changes
Checklist
"tags": ["hide"]
or"tags": ["hide-input"]