RFC 0004 / RFC 0005: Refactor all remaining sampling logic from R / Python to C++ and standardize serialization#356
Open
andrewherren wants to merge 162 commits into
Open
RFC 0004 / RFC 0005: Refactor all remaining sampling logic from R / Python to C++ and standardize serialization#356andrewherren wants to merge 162 commits into
andrewherren wants to merge 162 commits into
Conversation
…le for probit BCF
…vation_weights param without _train
…ds work for supported models
…hon BART and BCF (to be refactored)
…n core) Core C++ (de)serialization of the samples-owned subtree (forests + parameter traces + intrinsic scalars), the shared source of truth that will let R and Python route serialization through one BARTSamples/BCFSamples object instead of each hand-rolling their own marshalling. - Forests under self-describing named keys, param traces under the "parameters" subfolder, intrinsic scalars top-level -- byte-identical to the current wire format (nlohmann sorts keys, so insertion order is irrelevant). - Presence inferred from JSON structure rather than the envelope booleans, so the samples (de)serialization is self-contained; the envelope (model_params, covariate preprocessor, schema_version) stays per-language. - Parameter traces serialized verbatim; tau_0 user-facing scaling and its multivariate ravel-order are reconciled at the postprocess/wiring boundary. - Random effects and cloglog cutpoints are hard-guarded for now (no silent drops); they still route through the per-language serializer until covered here. Verified by Json.BARTSamplesRoundTrip / Json.BCFSamplesRoundTrip (params, scalars, per-draw length invariants, byte-level forest equality, num_forests counter). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Merge(other) appends another chain's draws onto this one: forests deep-copied sample-by-sample, parameter traces concatenated, draw order preserved (this's draws, then other's). Usage for combined-load: FromJson chain 0, then Merge chains 1..N. Defensive guards reject mismatched structure (forest present in one chain but not the other), mismatched outcome standardization, mismatched treatment_dim (BCF), and rfx/cloglog (not yet routed through this path). The shared forest-merge logic lives in one inline helper, AppendForestContainerSamples, in container.h (where ForestContainer is declared and which both headers include) -- avoids duplicating it across bart.h/bcf.h and the ODR clash a per-header free function would cause when both are in one TU. Verified by Json.BARTSamplesMerge / Json.BCFSamplesMerge (draw counts add, forest sample counts add, merged forest's appended sample is a byte-identical deep copy, parameter traces concatenate in order). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…ctor_rfc_0004 # Conflicts: # test/python/test_bart.py
Thin pybind wrappers that each own a unique_ptr<BARTSamples>/<BCFSamples> and forward to the core methods, doing only Python marshalling. This is the single external pointer the model will hold once it stops decomposing the samples object into separate forest containers + parameter arrays. Surface: num_samples/y_bar/y_std/treatment_dim, parameter-trace getters (-> numpy), has_*_forest, materialize_*_forest() (deep copy -> ForestContainerCpp, None if absent, for the deprecated direct forest accessor), from_json_string/to_json_string/ load_from_json/add_to_json, and merge -- all forwarding to BARTSamples/BCFSamples. Shared free helpers samples_vec_to_numpy / materialize_forest_container avoid duplicating marshalling across the two wrappers; JsonCpp gains a GetJson() accessor so the wrappers can read/write the samples subtree of a JsonCpp in place. Verified by test/python/test_samples_wrapper.py (5 tests): accessors match the model, parameter traces round-trip, materialized forests are byte-identical (DumpJsonString equality), to_json_string round-trips, and merge concatenates draws -- for both BART and BCF. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…rts) from_components assembles the single owned samples object by deep-copying existing forest containers and parameter arrays (the #406 "construct from components" path). This lets the model build self._samples from the current sampler outputs without restructuring bart_sample_cpp -- the lowest-risk route to model-level single-owner; having the binding return the wrapper directly stays an optional later optimization. Also unified the forest deep-copy: materialize_forest_container now uses the shared copy_forest_container (sample-by-sample AddSample) instead of a JSON round-trip -- faster, with byte-identity preserved (existing materialize tests stay green). Added numpy_to_samples_vec helper for the optional parameter-array arguments. Verified by test_samples_wrapper from_components tests (BART + BCF): wrapper built from a fitted model's forest container(s) + param arrays matches the model's counts and scalars, and the deep-copied forests are byte-identical to the source. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
BARTModel now owns one self._samples (BARTSamplesCpp) holding the sampled forests + parameter traces. forest_container_mean/_variance, global_var_samples, leaf_scale_samples, and num_samples become properties backed by it: the forest accessors materialize-on-demand a deep-copied ForestContainer (cached, invalidated when _samples changes) for the deprecated direct-access path; the param getters return None when unsampled to preserve getattr(..., None) semantics. All four write-paths -- sample(), continue_sampling(), from_json (single), and from_json combined/multi-chain -- assemble self._samples via BARTSamplesCpp.from_components(...). predict/to_json/kernel/collapse ride along through the properties unchanged; rfx_container and cloglog_cutpoint_samples remain separate attributes (the wrapper guards those). FromComponents now takes an optional mean forest so variance-only BART models are supported. Validated: full Python suite (251 passed, 2 skipped) -- sampling, predict, continuation, collapse, single + multi-chain serialization fixtures, and the wrapper tests. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
BCFModel now owns one self._samples (BCFSamplesCpp) holding the sampled forests + parameter traces, mirroring the BART re-point. forest_container_mu/tau/variance, global_var_samples, leaf_scale_mu_samples, leaf_scale_tau_samples, and num_samples become properties backed by it (forests materialize-on-demand + cache; param getters return None when unsampled). tau_0_samples / b0_samples / b1_samples stay separate attributes (specialized 2D shape / scale; not routed through the wrapper yet). All four write-paths -- sample(), continue_sampling(), from_json (single), and from_json combined/multi-chain -- assemble self._samples via BCFSamplesCpp.from_components(...). predict/to_json/kernel ride along via the properties; rfx stays a separate attribute (the wrapper guards it). Incidentally fixes a pre-existing bug in the multi-chain combined load where the concatenated leaf-tau scale was assigned to the boolean flag self.sample_sigma2_leaf_tau instead of the samples array. Validated: full Python suite (251 passed, 2 skipped) -- sampling, predict, continuation, single + multi-chain serialization fixtures, and the wrapper tests. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…er foundation) R analog of the Python BARTSamplesCpp/BCFSamplesCpp wrappers, mirroring the existing ForestSamples R6 + cpp11 idiom: an external pointer to the core C++ BARTSamples/ BCFSamples lives in a field of an R6 class whose methods forward to thin cpp11 free functions. - src/R_samples.cpp: [[cpp11::register]] free functions over core BARTSamples/BCFSamples -- from_components, scalar + parameter-vector getters (-> cpp11::doubles), has_*_forest, materialize_*_forest (-> ForestContainer EXTPTR), merge. Nullable args via cpp11::sexp + (x == R_NilValue); required forests via external_pointer. - R/samples.R: BARTSamples / BCFSamples R6 classes (samples_ptr EXTPTR field), methods forwarding to the *_cpp functions; materialize_*_forest() returns a ForestSamples for the (deprecated) direct forest accessor. Internal (@nord) for now. - Makevars.in / Makevars.win.in: add R_samples.o (explicit object list); cpp11.cpp / cpp11.R regenerated via cpp11::cpp_register(). Verified by test/R/testthat/test-samples-container.R (BART + BCF, 20 assertions): counts/scalars, has_*, materialize (mean-forest prediction parity for BART), parameter round-trip, and merge. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…redict) Non-owning external_pointer getters (bart_samples_mean/variance_forest_ptr_cpp, bcf_samples_mu/tau/variance_forest_ptr_cpp) that alias the samples-owned forest containers without taking ownership (use_deleter=false, finalize=false), plus R6 methods exposing them. These let R predict read forests straight from the $samples object (read-through) instead of an eager materialized copy. Foundation for the bart.R/bcf.R model re-point. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
For a full overview of the design, see the RFC 0004 doc