Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implementation of Kwon and Zou Data-OOB: Out-of-bag Estimate as a Sim… #426

Merged
merged 23 commits into from
Sep 12, 2023

Conversation

BastienZim
Copy link
Contributor

@BastienZim BastienZim commented Sep 5, 2023

…ple and Efficient Data Value ICML 2023 using pyDVL

Description

This PR adds the implementation of a data valuation method described in Kwon and Zou "Data-OOB: Out-of-bag Estimate as a Simple and Efficient Data Value" published at ICML 2023.

The notebook provided gives a comprehensive overview of the method, through examples, visualizations and point removal-evaluation.

No unit tests were added, as the notebook is testing the method. If-ever that is considered useful, I could write some.

Changes

  • Created compute_data_oob which computed the data_oob data values. The function is compatible with all sklearn estimators as weak learners, provided through the utility model parameter.

Checklist

  • Wrote Unit tests (if necessary)
  • Updated Documentation (if necessary)
  • Updated Changelog
  • If notebooks were added/changed, added boilerplate cells are tagged with "tags": ["hide"] or "tags": ["hide-input"]

…ple and Efficient Data Value ICML 2023 using pyDVL
@AnesBenmerzoug AnesBenmerzoug added this to the v0.8.0 milestone Sep 6, 2023
@mdbenito mdbenito assigned mdbenito and unassigned BastienZim Sep 6, 2023
@mdbenito mdbenito added the new-method Implementation of new algorithms for valuation or influence functions label Sep 6, 2023
Copy link
Collaborator

@mdbenito mdbenito left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Salut Bastien and thanks a lot for your first contribution! Data-OOB is an interesting approach and it's great that you chose to help out implementing it. 🙏🏽

I have left a few comments below about the main module. We can do the notebook in a second step if you want.

One general comment is that you will want to run the linter and type checker. Basically run tox -e linting and tox -e type-checking to look for errors. The first will rearrange module imports, run black and so on. The second will, well... run the type checker 😄 . You will also want to run mkdocs to see how your documentation renders. See the file CONTRIBUTING.md for more info

Finally, if you want to run the removal experiment faster in your notebook, you can use this code. There's a bit of boilerplate, but you can use as many cores as you have by setting max_workers.

from tqdm.notebook import tqdm
from concurrent.futures import FIRST_COMPLETED, wait
from pydvl.utils.parallel import init_executor

def removal_job(method: str, n_est=300, max_samples=0.95, progress=False):
    if method == "random":
        values = ValuationResult.from_random(size=len(utility.data))
    else:
        values = compute_data_oob(u=utility, n_est=n_est, max_samples=max_samples, progress=progress)

    best_scores = compute_removal_score(
                u=utility,
                values=values,
                percentages=removal_percentages,
                remove_best=True,
                )
    best_scores["method_name"] = values.algorithm

    worst_scores = compute_removal_score(
            u=utility,
            values=values,
            percentages=removal_percentages,
            remove_best=False,
            )
    worst_scores["method_name"] = values.algorithm
    
    return best_scores, worst_scores

all_best_scores = []
all_worst_scores = []

removal_percentages = np.arange(0, 0.99, 0.01)

n_runs = 20
pending = set()

with init_executor(max_workers=24) as executor:
    for i in range(n_runs):
        pending.add(executor.submit(removal_job, method="random"))
        pending.add(executor.submit(removal_job, method="data_oob"))

    pbar = tqdm(total=2*n_runs, unit="%")
    while len(pending) > 0:
        pbar.n = 2*n_runs - len(pending) + 1  # HACK
        pbar.refresh()
        completed, pending = wait(
                pending, timeout=config.wait_timeout, return_when=FIRST_COMPLETED
                )
        for future in completed:
            best_scores, worst_scores = future.result()
            all_best_scores.append(best_scores)
            all_worst_scores.append(worst_scores)

best_scores_df = pd.DataFrame(all_best_scores)
worst_scores_df = pd.DataFrame(all_worst_scores)

src/pydvl/value/oob/oob.py Show resolved Hide resolved
src/pydvl/value/oob/oob.py Outdated Show resolved Hide resolved
src/pydvl/value/oob/oob.py Outdated Show resolved Hide resolved
src/pydvl/value/oob/oob.py Outdated Show resolved Hide resolved
src/pydvl/value/oob/oob.py Outdated Show resolved Hide resolved
src/pydvl/value/oob/oob.py Outdated Show resolved Hide resolved
src/pydvl/value/oob/oob.py Outdated Show resolved Hide resolved
src/pydvl/value/oob/oob.py Outdated Show resolved Hide resolved
BastienZim and others added 3 commits September 7, 2023 11:22
Co-authored-by: Miguel de Benito Delgado <m.debenito.d@gmail.com>
Co-authored-by: Miguel de Benito Delgado <m.debenito.d@gmail.com>
Co-authored-by: Miguel de Benito Delgado <m.debenito.d@gmail.com>
BastienZim added 2 commits September 7, 2023 16:38
…gression models ; added more parameters for the bagging function ; added a generic loss function parameter
@BastienZim
Copy link
Contributor Author

There still remains a problem raised by the type-checker that I did not manage to resolve.
It concerns the return type of compute_data_oob.
"Returning Any from function declared to return "ValuationResult[Any, Any]""

I would be interested in knowing the solution.

Copy link
Collaborator

@mdbenito mdbenito left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added some superficial comments. Looking good!

src/pydvl/value/oob/oob.py Outdated Show resolved Hide resolved
src/pydvl/value/oob/oob.py Outdated Show resolved Hide resolved
src/pydvl/value/oob/oob.py Outdated Show resolved Hide resolved
src/pydvl/value/oob/oob.py Outdated Show resolved Hide resolved
src/pydvl/value/oob/oob.py Outdated Show resolved Hide resolved
docs/assets/pydvl.bib Outdated Show resolved Hide resolved
BastienZim and others added 5 commits September 8, 2023 10:07
Co-authored-by: Miguel de Benito Delgado <m.debenito.d@gmail.com>
Co-authored-by: Miguel de Benito Delgado <m.debenito.d@gmail.com>
Co-authored-by: Miguel de Benito Delgado <m.debenito.d@gmail.com>
Co-authored-by: Miguel de Benito Delgado <m.debenito.d@gmail.com>
Co-authored-by: Miguel de Benito Delgado <m.debenito.d@gmail.com>
@AnesBenmerzoug
Copy link
Collaborator

Hi @BastienZim it looks good. It's almost ready there are just a few things missing:

  • In order to render the notebook in the docs, you have to manually add nav entry to the mkdocs.yml file under nav -> Data Valuation -> Examples.
  • In the notebook's introduction, you pasted the raw link to the publication. I think you should put that on the name of the paper just like you did in the docstring.
  • Most matplotlib plotting functions return objects and they are printed out in jupyter notebooks. You should add a semi-colon ; at the end of the last line of a plotting call e.g. ax.set_xlabel("Point rank"); or assign the result to _ e.g. _ = plt.plot(np.arange(len(oob_values.values)), oob_values.values)
  • We use tags on jupyter notebook cells to customize documentation rendering. We either hide an entire cell using the tag "hide" like import cells and the tag "hide-input" for plotting cells to hide the plotting code. Could you please add that? Refer to the Data Utility Learning notebook for an example.

Co-authored-by: Anes Benmerzoug <Anes.Benmerzoug@gmail.com>
AnesBenmerzoug
AnesBenmerzoug previously approved these changes Sep 11, 2023
Copy link
Collaborator

@AnesBenmerzoug AnesBenmerzoug left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@BastienZim Thanks for your work! It looks good to me now.
@mdbenito I think it's ready to be merged. All that's left to do is to add a changelog entry and maybe a description for the oob module. What do you think?

mkdocs.yml Outdated Show resolved Hide resolved
Co-authored-by: Anes Benmerzoug <Anes.Benmerzoug@gmail.com>
Copy link
Collaborator

@mdbenito mdbenito left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed that this is ready. We can polish any cosmetic details later post-merge. Thanks Bastien for the contribution, looking forward to more! :)

src/pydvl/value/oob/oob.py Show resolved Hide resolved
@mdbenito mdbenito merged commit ca9591e into aai-institute:develop Sep 12, 2023
0 of 6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
new-method Implementation of new algorithms for valuation or influence functions
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants