Skip to content

RFC 0004 / RFC 0005: Refactor all remaining sampling logic from R / Python to C++ and standardize serialization#356

Open
andrewherren wants to merge 162 commits into
mainfrom
cpp_sampler_loop_refactor_rfc_0004
Open

RFC 0004 / RFC 0005: Refactor all remaining sampling logic from R / Python to C++ and standardize serialization#356
andrewherren wants to merge 162 commits into
mainfrom
cpp_sampler_loop_refactor_rfc_0004

Conversation

@andrewherren

Copy link
Copy Markdown
Collaborator

For a full overview of the design, see the RFC 0004 doc

@andrewherren andrewherren linked an issue Apr 9, 2026 that may be closed by this pull request
@andrewherren andrewherren changed the title RFC 0004: Refactor all remaining sampling logic from R / Python to C++ RFC 0004 / RFC 0005: Refactor all remaining sampling logic from R / Python to C++ and standardize serialization Jun 16, 2026
andrewherren and others added 27 commits June 15, 2026 23:39
…n core)

Core C++ (de)serialization of the samples-owned subtree (forests + parameter
traces + intrinsic scalars), the shared source of truth that will let R and
Python route serialization through one BARTSamples/BCFSamples object instead of
each hand-rolling their own marshalling.

- Forests under self-describing named keys, param traces under the "parameters"
  subfolder, intrinsic scalars top-level -- byte-identical to the current wire
  format (nlohmann sorts keys, so insertion order is irrelevant).
- Presence inferred from JSON structure rather than the envelope booleans, so the
  samples (de)serialization is self-contained; the envelope (model_params,
  covariate preprocessor, schema_version) stays per-language.
- Parameter traces serialized verbatim; tau_0 user-facing scaling and its
  multivariate ravel-order are reconciled at the postprocess/wiring boundary.
- Random effects and cloglog cutpoints are hard-guarded for now (no silent drops);
  they still route through the per-language serializer until covered here.

Verified by Json.BARTSamplesRoundTrip / Json.BCFSamplesRoundTrip (params, scalars,
per-draw length invariants, byte-level forest equality, num_forests counter).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Merge(other) appends another chain's draws onto this one: forests deep-copied
sample-by-sample, parameter traces concatenated, draw order preserved (this's
draws, then other's). Usage for combined-load: FromJson chain 0, then Merge
chains 1..N. Defensive guards reject mismatched structure (forest present in one
chain but not the other), mismatched outcome standardization, mismatched
treatment_dim (BCF), and rfx/cloglog (not yet routed through this path).

The shared forest-merge logic lives in one inline helper,
AppendForestContainerSamples, in container.h (where ForestContainer is declared
and which both headers include) -- avoids duplicating it across bart.h/bcf.h and
the ODR clash a per-header free function would cause when both are in one TU.

Verified by Json.BARTSamplesMerge / Json.BCFSamplesMerge (draw counts add, forest
sample counts add, merged forest's appended sample is a byte-identical deep copy,
parameter traces concatenate in order).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…ctor_rfc_0004

# Conflicts:
#	test/python/test_bart.py
Thin pybind wrappers that each own a unique_ptr<BARTSamples>/<BCFSamples> and
forward to the core methods, doing only Python marshalling. This is the single
external pointer the model will hold once it stops decomposing the samples object
into separate forest containers + parameter arrays.

Surface: num_samples/y_bar/y_std/treatment_dim, parameter-trace getters (-> numpy),
has_*_forest, materialize_*_forest() (deep copy -> ForestContainerCpp, None if
absent, for the deprecated direct forest accessor), from_json_string/to_json_string/
load_from_json/add_to_json, and merge -- all forwarding to BARTSamples/BCFSamples.

Shared free helpers samples_vec_to_numpy / materialize_forest_container avoid
duplicating marshalling across the two wrappers; JsonCpp gains a GetJson() accessor
so the wrappers can read/write the samples subtree of a JsonCpp in place.

Verified by test/python/test_samples_wrapper.py (5 tests): accessors match the
model, parameter traces round-trip, materialized forests are byte-identical
(DumpJsonString equality), to_json_string round-trips, and merge concatenates
draws -- for both BART and BCF.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…rts)

from_components assembles the single owned samples object by deep-copying existing
forest containers and parameter arrays (the #406 "construct from components" path).
This lets the model build self._samples from the current sampler outputs without
restructuring bart_sample_cpp -- the lowest-risk route to model-level single-owner;
having the binding return the wrapper directly stays an optional later optimization.

Also unified the forest deep-copy: materialize_forest_container now uses the shared
copy_forest_container (sample-by-sample AddSample) instead of a JSON round-trip --
faster, with byte-identity preserved (existing materialize tests stay green). Added
numpy_to_samples_vec helper for the optional parameter-array arguments.

Verified by test_samples_wrapper from_components tests (BART + BCF): wrapper built
from a fitted model's forest container(s) + param arrays matches the model's counts
and scalars, and the deep-copied forests are byte-identical to the source.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
BARTModel now owns one self._samples (BARTSamplesCpp) holding the sampled forests +
parameter traces. forest_container_mean/_variance, global_var_samples,
leaf_scale_samples, and num_samples become properties backed by it: the forest
accessors materialize-on-demand a deep-copied ForestContainer (cached, invalidated
when _samples changes) for the deprecated direct-access path; the param getters
return None when unsampled to preserve getattr(..., None) semantics.

All four write-paths -- sample(), continue_sampling(), from_json (single), and
from_json combined/multi-chain -- assemble self._samples via
BARTSamplesCpp.from_components(...). predict/to_json/kernel/collapse ride along
through the properties unchanged; rfx_container and cloglog_cutpoint_samples remain
separate attributes (the wrapper guards those). FromComponents now takes an optional
mean forest so variance-only BART models are supported.

Validated: full Python suite (251 passed, 2 skipped) -- sampling, predict,
continuation, collapse, single + multi-chain serialization fixtures, and the
wrapper tests.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
BCFModel now owns one self._samples (BCFSamplesCpp) holding the sampled forests +
parameter traces, mirroring the BART re-point. forest_container_mu/tau/variance,
global_var_samples, leaf_scale_mu_samples, leaf_scale_tau_samples, and num_samples
become properties backed by it (forests materialize-on-demand + cache; param getters
return None when unsampled). tau_0_samples / b0_samples / b1_samples stay separate
attributes (specialized 2D shape / scale; not routed through the wrapper yet).

All four write-paths -- sample(), continue_sampling(), from_json (single), and
from_json combined/multi-chain -- assemble self._samples via
BCFSamplesCpp.from_components(...). predict/to_json/kernel ride along via the
properties; rfx stays a separate attribute (the wrapper guards it).

Incidentally fixes a pre-existing bug in the multi-chain combined load where the
concatenated leaf-tau scale was assigned to the boolean flag self.sample_sigma2_leaf_tau
instead of the samples array.

Validated: full Python suite (251 passed, 2 skipped) -- sampling, predict,
continuation, single + multi-chain serialization fixtures, and the wrapper tests.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…er foundation)

R analog of the Python BARTSamplesCpp/BCFSamplesCpp wrappers, mirroring the existing
ForestSamples R6 + cpp11 idiom: an external pointer to the core C++ BARTSamples/
BCFSamples lives in a field of an R6 class whose methods forward to thin cpp11 free
functions.

- src/R_samples.cpp: [[cpp11::register]] free functions over core BARTSamples/BCFSamples
  -- from_components, scalar + parameter-vector getters (-> cpp11::doubles), has_*_forest,
  materialize_*_forest (-> ForestContainer EXTPTR), merge. Nullable args via
  cpp11::sexp + (x == R_NilValue); required forests via external_pointer.
- R/samples.R: BARTSamples / BCFSamples R6 classes (samples_ptr EXTPTR field), methods
  forwarding to the *_cpp functions; materialize_*_forest() returns a ForestSamples for
  the (deprecated) direct forest accessor. Internal (@nord) for now.
- Makevars.in / Makevars.win.in: add R_samples.o (explicit object list); cpp11.cpp /
  cpp11.R regenerated via cpp11::cpp_register().

Verified by test/R/testthat/test-samples-container.R (BART + BCF, 20 assertions):
counts/scalars, has_*, materialize (mean-forest prediction parity for BART), parameter
round-trip, and merge.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…redict)

Non-owning external_pointer getters (bart_samples_mean/variance_forest_ptr_cpp,
bcf_samples_mu/tau/variance_forest_ptr_cpp) that alias the samples-owned forest
containers without taking ownership (use_deleter=false, finalize=false), plus R6
methods exposing them. These let R predict read forests straight from the $samples
object (read-through) instead of an eager materialized copy. Foundation for the
bart.R/bcf.R model re-point.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[TRACKING] Tracking Issue for RFC 0004

1 participant