Skip to content

Conversation

@bejaeger
Copy link
Contributor

@bejaeger bejaeger commented Jan 19, 2026

No functional changes are done in this PR. It refactors the SequentialEncoder for further changes.
In particular:

  • Allows to decouple projections and preprocessing steps.
  • Makes preprocessing pipeline more explicit by parsing dict instead of tensors.

More work is needed to disentangle the preprocessing "state" from the model after this PR is merged.

@bejaeger
Copy link
Contributor Author

This change is part of the following stack:

Change managed by git-spice.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This is a great refactoring that significantly improves the encoder pipeline's architecture. The introduction of TorchPreprocessingPipeline and TorchPreprocessingStep with a clear _fit/_transform pattern and dictionary-based state management makes the code more robust, readable, and extensible. The decoupling of projections into separate embedder modules is also a good design choice. The tests have been diligently updated to reflect these changes. I've found a couple of areas for improvement, including a potential regression.

@bejaeger bejaeger added the no changelog needed PR does not require a changelog entry label Jan 19, 2026
@bejaeger bejaeger marked this pull request as ready for review January 19, 2026 15:18
@bejaeger bejaeger requested a review from a team as a code owner January 19, 2026 15:18
@bejaeger bejaeger requested review from alanprior and Copilot and removed request for a team January 19, 2026 15:18
@chatgpt-codex-connector
Copy link

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.
Credits must be used to enable repository wide code reviews.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR refactors the SequentialEncoder to improve code organization by decoupling projections from preprocessing steps and making the preprocessing pipeline more explicit through dictionary-based state passing instead of tensor arguments.

Changes:

  • Renamed SequentialEncoder to TorchPreprocessingPipeline and SeqEncStep to TorchPreprocessingStep to clarify that these components handle non-learnable preprocessing only
  • Introduced new embedder modules (LinearFeatureGroupEmbedder, MLPFeatureGroupEmbedder) to separate learnable projections from preprocessing
  • Updated all encoder steps to use dict[str, torch.Tensor] state dictionaries instead of tensor tuples for clearer data flow
  • Removed the seed parameter from InputNormalizationEncoderStep and related code in update_encoder_params

Reviewed changes

Copilot reviewed 24 out of 24 changed files in this pull request and generated 10 comments.

Show a summary per file
File Description
src/tabpfn/architectures/encoders/pipeline_interfaces.py Refactored to use TorchPreprocessingPipeline and TorchPreprocessingStep with dict-based state passing
src/tabpfn/architectures/encoders/embedders.py New file with LinearFeatureGroupEmbedder and MLPFeatureGroupEmbedder for learnable projections
src/tabpfn/architectures/encoders/steps/*.py Updated all encoder steps to use dict-based state dictionaries and standardized in_keys/out_keys handling
src/tabpfn/architectures/encoders/init.py Updated exports to reflect renamed classes and new embedders
src/tabpfn/architectures/base/transformer.py Updated to use TorchPreprocessingPipeline with list-based step initialization
src/tabpfn/architectures/base/init.py Updated encoder factory functions to use new class names and list-based initialization
src/tabpfn/utils.py Removed seed parameter from update_encoder_params
src/tabpfn/base.py Updated call to update_encoder_params to remove seed argument
tests/test_model_move_backwards_compatibility.py Updated to check for TorchPreprocessingPipeline instead of InputEncoder
tests/test_model/test_seperate_train_inference.py Updated to use new pipeline initialization syntax
tests/test_encoders/test_projections.py Updated tests to use TorchPreprocessingPipeline and removed test_interface (moved to test_encoders.py)
tests/test_encoders/test_encoders.py Added test_interface and test_feature_group_padding_and_reshape_step
tests/test_encoders/test_embedders.py New test file for embedder modules

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

@bejaeger bejaeger requested review from LeoGrin and alanprior January 19, 2026 16:00
Copy link
Collaborator

@LeoGrin LeoGrin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM thanks a lot!! The new names are much clearer :)

@bejaeger
Copy link
Contributor Author

Thanks @alanprior and @LeoGrin !
I implemented a bunch of your suggestions and a few of the comments will be addressed in future PRs.

@bejaeger bejaeger requested a review from alanprior January 20, 2026 11:30
@alanprior
Copy link
Contributor

LGTM thanks a lot!! The new names are much clearer :)

+1 on this. Looks much much clearer and better. I'm learning a lot :)

@bejaeger bejaeger merged commit f44561f into main Jan 20, 2026
13 checks passed
@LeoGrin LeoGrin deleted the ben/refactor-sequential-encoder branch January 20, 2026 12:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

no changelog needed PR does not require a changelog entry

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants