-
Notifications
You must be signed in to change notification settings - Fork 289
TabPFNRegressor preprocessing fails on bigger datasets fix #255
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
TabPFNRegressor preprocessing fails on bigger datasets fix #255
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR fixes an issue in the TabPFNRegressor preprocessing module where the number of quantiles could become excessively large on bigger datasets. The change ensures that the quantiles for various QuantileTransformer instances are capped at 10,000.
- Updated the "quantile_uni_coarse" and "quantile_norm_coarse" transformers to use n_quantiles as min(max(num_examples // 10, 2), 10_000).
- Updated the "quantile_norm" transformer to use n_quantiles as min(max(num_examples // 5, 2), 10_000).
- Updated the "quantile_uni_fine" and "quantile_norm_fine" transformers to use n_quantiles as min(num_examples, 10_000).
Thanks so much for this change! Would yu be able to add a test for this change, i.e. one that tests if the preprocessing runs on datasets of > 10,000 samples. We can't run the inference step unfortunately as ofc, it would be way too slow. Only way to test the inference on lareg datasets was if we provided a tiny tabpfn checkpoint, a very small model that is randomly initialized but that would be a project in itself. |
@noahho I had added |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR fixes an issue with the TabPFNRegressor preprocessing on larger datasets by capping the number of quantiles to 10,000 in several QuantileTransformer configurations.
- Updated quantile transformer settings in the preprocessing module to avoid excessive quantile calculations.
- Added a test case in tests/test_preprocessing.py to verify functionality on large datasets.
Reviewed Changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.
File | Description |
---|---|
tests/test_preprocessing.py | Adds a new test to verify preprocessing functionality on large datasets. |
src/tabpfn/model/preprocessing.py | Updates the n_quantiles parameter in various QuantileTransformer configurations by capping values to 10,000. |
Comments suppressed due to low confidence (2)
tests/test_preprocessing.py:27
- Consider adding assertions to verify the shape and properties of the transformed output to increase test coverage beyond a simple non-null check.
assert result is not None
src/tabpfn/model/preprocessing.py:722
- [nitpick] The 'quantile_uni_coarse' transformer now caps n_quantiles to 10,000, yet the 'quantile_uni' transformer remains uncapped. If this discrepancy is unintentional, consider applying the same cap or adding a clarifying comment.
n_quantiles=min(max(num_examples // 10, 2), 10_000),
Great, this looks really good. There seems to be a tiny ruff issue at this point. Do you know how to resolve? |
Ohh also something that copilot just caught: |
i will fix it now |
@noahho I had ran ruff check . --fix but still ruff linting test is failing |
The two open ones don't seem to be automatically fixable: An LLM will know how to fix number 2 and by deleting a character in src/tabpfn/regressor.py:723:89 you fix no1 |
@noahho Please review |
Thanks a lot for continuing to work on this. It seems there were a few changes made for the linting that weren't right (such as adding ""). I'll look into the PR and fix those things, if you'd like me to. |
Yes ,sure |
This PR fixes #169
chnages made- took n_quantiles=min(n_quantiles, 10_000) in
TabPFN/src/tabpfn/model/preprocessing.py