Implementation of Kwon and Zou Data-OOB: Out-of-bag Estimate as a Sim… #426

BastienZim · 2023-09-05T15:42:41Z

…ple and Efficient Data Value ICML 2023 using pyDVL

Description

This PR adds the implementation of a data valuation method described in Kwon and Zou "Data-OOB: Out-of-bag Estimate as a Simple and Efficient Data Value" published at ICML 2023.

The notebook provided gives a comprehensive overview of the method, through examples, visualizations and point removal-evaluation.

No unit tests were added, as the notebook is testing the method. If-ever that is considered useful, I could write some.

Changes

Created compute_data_oob which computed the data_oob data values. The function is compatible with all sklearn estimators as weak learners, provided through the utility model parameter.

Checklist

Wrote Unit tests (if necessary)
Updated Documentation (if necessary)
Updated Changelog
If notebooks were added/changed, added boilerplate cells are tagged with "tags": ["hide"] or "tags": ["hide-input"]

…ple and Efficient Data Value ICML 2023 using pyDVL

mdbenito

Salut Bastien and thanks a lot for your first contribution! Data-OOB is an interesting approach and it's great that you chose to help out implementing it. 🙏🏽

I have left a few comments below about the main module. We can do the notebook in a second step if you want.

One general comment is that you will want to run the linter and type checker. Basically run tox -e linting and tox -e type-checking to look for errors. The first will rearrange module imports, run black and so on. The second will, well... run the type checker 😄 . You will also want to run mkdocs to see how your documentation renders. See the file CONTRIBUTING.md for more info

Finally, if you want to run the removal experiment faster in your notebook, you can use this code. There's a bit of boilerplate, but you can use as many cores as you have by setting max_workers.

from tqdm.notebook import tqdm
from concurrent.futures import FIRST_COMPLETED, wait
from pydvl.utils.parallel import init_executor

def removal_job(method: str, n_est=300, max_samples=0.95, progress=False):
    if method == "random":
        values = ValuationResult.from_random(size=len(utility.data))
    else:
        values = compute_data_oob(u=utility, n_est=n_est, max_samples=max_samples, progress=progress)

    best_scores = compute_removal_score(
                u=utility,
                values=values,
                percentages=removal_percentages,
                remove_best=True,
                )
    best_scores["method_name"] = values.algorithm

    worst_scores = compute_removal_score(
            u=utility,
            values=values,
            percentages=removal_percentages,
            remove_best=False,
            )
    worst_scores["method_name"] = values.algorithm
    
    return best_scores, worst_scores

all_best_scores = []
all_worst_scores = []

removal_percentages = np.arange(0, 0.99, 0.01)

n_runs = 20
pending = set()

with init_executor(max_workers=24) as executor:
    for i in range(n_runs):
        pending.add(executor.submit(removal_job, method="random"))
        pending.add(executor.submit(removal_job, method="data_oob"))

    pbar = tqdm(total=2*n_runs, unit="%")
    while len(pending) > 0:
        pbar.n = 2*n_runs - len(pending) + 1  # HACK
        pbar.refresh()
        completed, pending = wait(
                pending, timeout=config.wait_timeout, return_when=FIRST_COMPLETED
                )
        for future in completed:
            best_scores, worst_scores = future.result()
            all_best_scores.append(best_scores)
            all_worst_scores.append(worst_scores)

best_scores_df = pd.DataFrame(all_best_scores)
worst_scores_df = pd.DataFrame(all_worst_scores)

src/pydvl/value/oob/oob.py

Co-authored-by: Miguel de Benito Delgado <m.debenito.d@gmail.com>

src/pydvl/value/oob/oob.py

…gression models ; added more parameters for the bagging function ; added a generic loss function parameter

BastienZim · 2023-09-07T16:05:02Z

There still remains a problem raised by the type-checker that I did not manage to resolve.
It concerns the return type of compute_data_oob.
"Returning Any from function declared to return "ValuationResult[Any, Any]""

I would be interested in knowing the solution.

mdbenito

I added some superficial comments. Looking good!

src/pydvl/value/oob/oob.py

docs/assets/pydvl.bib

Co-authored-by: Miguel de Benito Delgado <m.debenito.d@gmail.com>

src/pydvl/value/oob/oob.py

AnesBenmerzoug · 2023-09-08T09:15:00Z

Hi @BastienZim it looks good. It's almost ready there are just a few things missing:

In order to render the notebook in the docs, you have to manually add nav entry to the mkdocs.yml file under nav -> Data Valuation -> Examples.
In the notebook's introduction, you pasted the raw link to the publication. I think you should put that on the name of the paper just like you did in the docstring.
Most matplotlib plotting functions return objects and they are printed out in jupyter notebooks. You should add a semi-colon ; at the end of the last line of a plotting call e.g. ax.set_xlabel("Point rank"); or assign the result to _ e.g. _ = plt.plot(np.arange(len(oob_values.values)), oob_values.values)
We use tags on jupyter notebook cells to customize documentation rendering. We either hide an entire cell using the tag "hide" like import cells and the tag "hide-input" for plotting cells to hide the plotting code. Could you please add that? Refer to the Data Utility Learning notebook for an example.

Co-authored-by: Anes Benmerzoug <Anes.Benmerzoug@gmail.com>

…notebook cells

AnesBenmerzoug

@BastienZim Thanks for your work! It looks good to me now.
@mdbenito I think it's ready to be merged. All that's left to do is to add a changelog entry and maybe a description for the oob module. What do you think?

mkdocs.yml

Co-authored-by: Anes Benmerzoug <Anes.Benmerzoug@gmail.com>

mdbenito

Agreed that this is ready. We can polish any cosmetic details later post-merge. Thanks Bastien for the contribution, looking forward to more! :)

src/pydvl/value/oob/oob.py

Implementation of Kwon and Zou Data-OOB: Out-of-bag Estimate as a Sim…

c6d71af

…ple and Efficient Data Value ICML 2023 using pyDVL

AnesBenmerzoug assigned BastienZim Sep 6, 2023

AnesBenmerzoug requested review from mdbenito and AnesBenmerzoug September 6, 2023 08:06

AnesBenmerzoug added this to the v0.8.0 milestone Sep 6, 2023

mdbenito assigned mdbenito and unassigned BastienZim Sep 6, 2023

mdbenito added the new-method Implementation of new algorithms for valuation or influence functions label Sep 6, 2023

mdbenito reviewed Sep 6, 2023

View reviewed changes

BastienZim and others added 3 commits September 7, 2023 11:22

Update src/pydvl/value/oob/oob.py

0d451e7

Co-authored-by: Miguel de Benito Delgado <m.debenito.d@gmail.com>

Update src/pydvl/value/oob/oob.py

1ab7867

Co-authored-by: Miguel de Benito Delgado <m.debenito.d@gmail.com>

Update src/pydvl/value/oob/oob.py

c6eb39c

Co-authored-by: Miguel de Benito Delgado <m.debenito.d@gmail.com>

AnesBenmerzoug reviewed Sep 7, 2023

View reviewed changes

src/pydvl/value/oob/oob.py Outdated Show resolved Hide resolved

BastienZim added 2 commits September 7, 2023 16:38

Converted doctrings to mkdocs and google style ; added support for re…

ed2db41

…gression models ; added more parameters for the bagging function ; added a generic loss function parameter

Linter + type check pass

e9dd612

mdbenito reviewed Sep 7, 2023

View reviewed changes

BastienZim and others added 5 commits September 8, 2023 10:07

Update src/pydvl/value/oob/oob.py

79ac3c6

Co-authored-by: Miguel de Benito Delgado <m.debenito.d@gmail.com>

Update src/pydvl/value/oob/oob.py

b99f56b

Co-authored-by: Miguel de Benito Delgado <m.debenito.d@gmail.com>

Update src/pydvl/value/oob/oob.py

99adcf1

Co-authored-by: Miguel de Benito Delgado <m.debenito.d@gmail.com>

Update src/pydvl/value/oob/oob.py

0a948ad

Co-authored-by: Miguel de Benito Delgado <m.debenito.d@gmail.com>

Update src/pydvl/value/oob/oob.py

6b8f113

Co-authored-by: Miguel de Benito Delgado <m.debenito.d@gmail.com>

AnesBenmerzoug reviewed Sep 8, 2023

View reviewed changes

src/pydvl/value/oob/oob.py Show resolved Hide resolved

BastienZim added 3 commits September 8, 2023 10:44

small type related changes

fce2661

following past commit

0a59f1d

Added - in neg_l2_distance to actually make it negative

568f8a6

AnesBenmerzoug reviewed Sep 8, 2023

View reviewed changes

src/pydvl/value/oob/oob.py Outdated Show resolved Hide resolved

AnesBenmerzoug reviewed Sep 8, 2023

View reviewed changes

src/pydvl/value/oob/oob.py Outdated Show resolved Hide resolved

Update src/pydvl/value/oob/oob.py

76d37b4

Co-authored-by: Anes Benmerzoug <Anes.Benmerzoug@gmail.com>

BastienZim and others added 4 commits September 8, 2023 14:20

Update src/pydvl/value/oob/oob.py

d8122ed

Co-authored-by: Anes Benmerzoug <Anes.Benmerzoug@gmail.com>

updated mkdocs file and added hide and hide-input to appropriate oob …

d0612dc

…notebook cells

linter

ca56f03

remove useless codelines

e138d11

AnesBenmerzoug previously approved these changes Sep 11, 2023

View reviewed changes

lone parenthesis forgoten

364e505

BastienZim dismissed AnesBenmerzoug’s stale review via 364e505 September 11, 2023 15:00

parenthesis removed (with notebook saved this time)

2e9cf02

AnesBenmerzoug reviewed Sep 12, 2023

View reviewed changes

mkdocs.yml Outdated Show resolved Hide resolved

Update mkdocs.yml

5919b62

Co-authored-by: Anes Benmerzoug <Anes.Benmerzoug@gmail.com>

mdbenito approved these changes Sep 12, 2023

View reviewed changes

src/pydvl/value/oob/oob.py Show resolved Hide resolved

Merge branch 'develop' into develop

8c86480

mdbenito merged commit ca9591e into aai-institute:develop Sep 12, 2023

Implementation of Kwon and Zou Data-OOB: Out-of-bag Estimate as a Sim… #426

Implementation of Kwon and Zou Data-OOB: Out-of-bag Estimate as a Sim… #426

Uh oh!

Conversation

BastienZim commented Sep 5, 2023 • edited by mdbenito Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Changes

Checklist

Uh oh!

mdbenito left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

BastienZim commented Sep 7, 2023

Uh oh!

mdbenito left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

AnesBenmerzoug commented Sep 8, 2023

Uh oh!

AnesBenmerzoug left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

mdbenito left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

BastienZim commented Sep 5, 2023 •

edited by mdbenito

Loading

mdbenito left a comment •

edited

Loading