[Prep-refactor 5] Refactor SequentialEncoder #740

bejaeger · 2026-01-19T11:10:28Z

No functional changes are done in this PR. It refactors the SequentialEncoder for further changes.
In particular:

Allows to decouple projections and preprocessing steps.
Makes preprocessing pipeline more explicit by parsing dict instead of tensors.

More work is needed to disentangle the preprocessing "state" from the model after this PR is merged.

bejaeger · 2026-01-19T11:10:30Z

This change is part of the following stack:

[Prep-refactor 5] Refactor SequentialEncoder #740 ◀

_{Change managed by git-spice.}

gemini-code-assist

Code Review

This is a great refactoring that significantly improves the encoder pipeline's architecture. The introduction of TorchPreprocessingPipeline and TorchPreprocessingStep with a clear _fit/_transform pattern and dictionary-based state management makes the code more robust, readable, and extensible. The decoupling of projections into separate embedder modules is also a good design choice. The tests have been diligently updated to reflect these changes. I've found a couple of areas for improvement, including a potential regression.

src/tabpfn/architectures/encoders/steps/normalize_feature_groups_encoder_step.py

src/tabpfn/architectures/encoders/steps/feature_group_padding_encoder_step.py

chatgpt-codex-connector · 2026-01-19T15:19:02Z

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.
Credits must be used to enable repository wide code reviews.

Copilot

Pull request overview

This PR refactors the SequentialEncoder to improve code organization by decoupling projections from preprocessing steps and making the preprocessing pipeline more explicit through dictionary-based state passing instead of tensor arguments.

Changes:

Renamed SequentialEncoder to TorchPreprocessingPipeline and SeqEncStep to TorchPreprocessingStep to clarify that these components handle non-learnable preprocessing only
Introduced new embedder modules (LinearFeatureGroupEmbedder, MLPFeatureGroupEmbedder) to separate learnable projections from preprocessing
Updated all encoder steps to use dict[str, torch.Tensor] state dictionaries instead of tensor tuples for clearer data flow
Removed the seed parameter from InputNormalizationEncoderStep and related code in update_encoder_params

Reviewed changes

Copilot reviewed 24 out of 24 changed files in this pull request and generated 10 comments.

Show a summary per file

File	Description
src/tabpfn/architectures/encoders/pipeline_interfaces.py	Refactored to use `TorchPreprocessingPipeline` and `TorchPreprocessingStep` with dict-based state passing
src/tabpfn/architectures/encoders/embedders.py	New file with `LinearFeatureGroupEmbedder` and `MLPFeatureGroupEmbedder` for learnable projections
src/tabpfn/architectures/encoders/steps/*.py	Updated all encoder steps to use dict-based state dictionaries and standardized in_keys/out_keys handling
src/tabpfn/architectures/encoders/init.py	Updated exports to reflect renamed classes and new embedders
src/tabpfn/architectures/base/transformer.py	Updated to use `TorchPreprocessingPipeline` with list-based step initialization
src/tabpfn/architectures/base/init.py	Updated encoder factory functions to use new class names and list-based initialization
src/tabpfn/utils.py	Removed `seed` parameter from `update_encoder_params`
src/tabpfn/base.py	Updated call to `update_encoder_params` to remove `seed` argument
tests/test_model_move_backwards_compatibility.py	Updated to check for `TorchPreprocessingPipeline` instead of `InputEncoder`
tests/test_model/test_seperate_train_inference.py	Updated to use new pipeline initialization syntax
tests/test_encoders/test_projections.py	Updated tests to use `TorchPreprocessingPipeline` and removed test_interface (moved to test_encoders.py)
tests/test_encoders/test_encoders.py	Added test_interface and test_feature_group_padding_and_reshape_step
tests/test_encoders/test_embedders.py	New test file for embedder modules

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

src/tabpfn/architectures/encoders/steps/feature_group_padding_encoder_step.py

src/tabpfn/architectures/encoders/steps/variable_num_features_encoder_step.py

src/tabpfn/architectures/encoders/steps/nan_handling_encoder_step.py

src/tabpfn/architectures/encoders/steps/remove_empty_features_encoder_step.py

src/tabpfn/architectures/encoders/steps/input_normalization_encoder_step.py

src/tabpfn/architectures/encoders/steps/nan_handling_encoder_step.py

src/tabpfn/architectures/encoders/pipeline_interfaces.py

Copilot

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

LeoGrin

LGTM thanks a lot!! The new names are much clearer :)

src/tabpfn/architectures/encoders/embedders.py

src/tabpfn/architectures/encoders/steps/categorical_input_encoder_per_feature_encoder_step.py

src/tabpfn/architectures/encoders/steps/nan_handling_encoder_step.py

src/tabpfn/architectures/encoders/steps/feature_group_padding_encoder_step.py

src/tabpfn/architectures/encoders/steps/normalize_feature_groups_encoder_step.py

src/tabpfn/architectures/encoders/steps/variable_num_features_encoder_step.py

…kpoints

src/tabpfn/architectures/base/__init__.py

src/tabpfn/architectures/base/transformer.py

src/tabpfn/architectures/encoders/pipeline_interfaces.py

src/tabpfn/architectures/encoders/steps/feature_group_projections_encoder_step.py

src/tabpfn/architectures/encoders/steps/nan_handling_encoder_step.py

src/tabpfn/architectures/base/transformer.py

bejaeger · 2026-01-20T11:30:48Z

Thanks @alanprior and @LeoGrin !
I implemented a bunch of your suggestions and a few of the comments will be addressed in future PRs.

alanprior · 2026-01-20T11:45:05Z

LGTM thanks a lot!! The new names are much clearer :)

+1 on this. Looks much much clearer and better. I'm learning a lot :)

bejaeger added 4 commits January 19, 2026 12:05

Refactor SequentialEncoder

e46427d

fix gpu/cpu item

df4d6b6

clean-up & fixes for backward compatibility

627b3cb

decouple new from old projections

2045631

gemini-code-assist bot reviewed Jan 19, 2026

View reviewed changes

src/tabpfn/architectures/encoders/steps/normalize_feature_groups_encoder_step.py Show resolved Hide resolved

src/tabpfn/architectures/encoders/steps/feature_group_padding_encoder_step.py Outdated Show resolved Hide resolved

bejaeger added 2 commits January 19, 2026 14:48

split embedders from projections

8460ae8

add back remove duplicate features

fb129a1

bejaeger added the no changelog needed PR does not require a changelog entry label Jan 19, 2026

bejaeger added 3 commits January 19, 2026 16:05

tweaks

c262783

return dataset of zeros if all features were removed

77af4bd

add nan to test

6f13651

bejaeger marked this pull request as ready for review January 19, 2026 15:18

bejaeger requested a review from a team as a code owner January 19, 2026 15:18

bejaeger requested review from alanprior and Copilot and removed request for a team January 19, 2026 15:18

bejaeger removed the request for review from alanprior January 19, 2026 15:19

Copilot started reviewing on behalf of bejaeger January 19, 2026 15:19 View session

Copilot AI reviewed Jan 19, 2026

View reviewed changes

cleanup

ef0e6c5

Copilot AI reviewed Jan 19, 2026

View reviewed changes

bejaeger requested review from LeoGrin and alanprior January 19, 2026 16:00

LeoGrin approved these changes Jan 19, 2026

View reviewed changes

bejaeger added 4 commits January 20, 2026 10:31

remove feature padding and reshape step

5aeee9a

revisions, renaming, remove not used code

4a01ba0

rename encoders

805bc47

add no-op option for feature group scaling to be compatible with chec…

00990e2

…kpoints