Improve statistics aggregators #2848

mexanick · 2025-09-29T16:38:43Z

Relax requirements on the data shape (aggregate arbitrary arrays on axis 0)
Add option to aggregate statistics by time

closes #2745

mexanick · 2025-09-30T13:49:24Z

@matteobalbo FYI

ctao-sonarqube · 2025-10-02T11:55:41Z

Quality Gate passed

Issues
0 New issues
0 Fixed issues
0 Accepted issues

Measures
0 Security Hotspots
97.8% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube

TjarkMiener

Looks already good to me, @mexanick! I'm attaching some comments and suggestions.

TjarkMiener · 2025-10-06T07:05:45Z

src/ctapipe/monitoring/aggregator.py

+        default_value=None,
+        allow_none=True,
+        help=(
+            "Size of the chunk used for the computation of aggregated statistic values. "
+            "If None, use the entire table as one chunk. "
+            "For event-based chunking: number of events per chunk. "
+            "For time-based chunking: duration in seconds per chunk (integer)."
+        ),
+    ).tag(config=True)
+
+    chunking_mode = CaselessStrEnum(
+        ["events", "time"],
+        default_value="events",
+        help=(
+            "Chunking strategy: 'events' for fixed event count, 'time' for time intervals. "
+            "When 'time', chunk_size and chunk_shift are interpreted as seconds."
+        ),


I'd slightly prefer to have two config parameters, something like:

chunk_size = Int( default_value=None, allow_none=True, help=( "Event-based size of the chunk (number of events per chunk) " "used for the computation of aggregated statistic values. " ), ).tag(config=True) chunk_duration = AstroQuantity( default_value=None, allow_none=True, physical_type=astropy.units.physical.time, help=( "Time-based size of the chunk (duration in 'astropy.units.physical.time' per chunk) "used for the computation of aggregated statistic values. " ), ).tag(config=True)

And then follow the logic: If both None, use entire table as one chunk. If both set, break. And if either set take the one set.

I kept it as is with Int chunk size with the only N second slicing in mind. However, it indeed could be update to what you propose. Do you have a UC where a non-integer chunk (in terms of time) is expected?
Also, @maxnoe , @matteobalbo from your (external) point of view, what's more clear?

I think the more canonical option would be to have Chunking component with to implementations TimeChunking and SizeChunking which then can have the corresponding configuration options.

Eventually (I don't know if it make sense, but...) a naive UC that would require a non-INT time chuck in the QualPipe could be if we finally decide to compute the trend of a given parameter during an Observation Block in an arbitrary number of binning (e.g. you want to have 7 datapoints during one OB evenly spaced in time).

ok, I will do a new component as @maxnoe suggested. Then for the time-based selection it will take Time in whatever format.

TjarkMiener · 2025-10-06T07:15:58Z

src/ctapipe/monitoring/aggregator.py

+            timestamps of shape (n_events, )
+        masked_elements_of_sample : ndarray, optional
+            boolean array of masked elements of shape (\*data_dimensions) that are not available for processing
        chunk_shift : int, optional


this could be int or astropy.units.Quantity, optional

TjarkMiener · 2025-10-06T07:17:27Z

src/ctapipe/monitoring/aggregator.py

+            self.chunking_mode = (
+                "events"  # Force event-based chunking when using entire table


this force would be redundant with suggested config change above

TjarkMiener · 2025-10-06T07:24:39Z

src/ctapipe/monitoring/aggregator.py

+            # Check if chunk end exceeds our data range (with 0.1s tolerance for floating point)
+            if chunk_end > end_time + 0.1 * u.s:


I'd assume that this tolerance depends on your use case. I'd pass it as an argument in the __call__()

TjarkMiener · 2025-10-06T07:26:04Z

src/ctapipe/monitoring/aggregator.py

        chunk_shift=None,
        col_name="image",


see below: I'd add time_tolerance = 0.1 * u.s to customise based on the use case.

TjarkMiener · 2025-10-06T07:27:49Z

src/ctapipe/monitoring/aggregator.py

+        elif self.chunking_mode == "time":
+            yield from self._get_time_chunks(table, chunk_shift, effective_chunk_size)
+
+    def _check_table_length(self, table, effective_chunk_size):


I'd consider moving this check into _get_event_chunks() and _get_time_chunks() and remove this extra function.

TjarkMiener · 2025-10-06T07:29:12Z

src/ctapipe/monitoring/aggregator.py

+            Chunks of the input table
+        """
+        # If using entire table as one chunk, just yield the whole table
+        if self.chunk_size is None:


this would then be if self.chunk_size is None and if self.chunk_duration is None:

TjarkMiener · 2025-10-06T07:29:53Z

src/ctapipe/monitoring/aggregator.py

+        should have the same units as the input data.
+        """
+        for col in ("mean", "median", "std"):
+            if col in table.colnames:


is this check needed?

matteobalbo · 2025-10-07T10:36:07Z

src/ctapipe/monitoring/aggregator.py

    @abstractmethod
-    def compute_stats(self, images, masked_pixels_of_sample) -> StatisticsContainer:
+    def _add_result_columns(self, data, masked_elements_of_sample, results_dict):
+        """


Suggested change

"""

r"""

mexanick added 4 commits September 29, 2025 18:36

Add option to aggregate statistics by time

0f982b5

Escape asteriscs in the docs and add changelog

2d5ac5e

make docstrings r-strings to avoid issues with py<3.12

1b03740

refactor to reduce cognitive complexity

3c493e0

mexanick marked this pull request as ready for review September 30, 2025 13:43

mexanick requested review from TjarkMiener and maxnoe September 30, 2025 13:43

mexanick self-assigned this Sep 30, 2025

mexanick added 4 commits September 30, 2025 18:22

Factor out basic aggregation functinoality

d4651ce

make add_result_column private

63dc69c

Fix StatisticsAggregator

f294829

Change name of the expected time column from "time_mono" to "time"

a1b1415

mexanick marked this pull request as draft October 2, 2025 09:38

mexanick added 2 commits October 2, 2025 11:42

Fix 1D arrays aggregation

f0e17d3

Fix NaN handling

a598ee7

mexanick marked this pull request as ready for review October 2, 2025 14:51

TjarkMiener reviewed Oct 6, 2025

View reviewed changes

mexanick marked this pull request as draft October 7, 2025 08:26

matteobalbo reviewed Oct 7, 2025

View reviewed changes

fix typo in calculator doc string

045fcb6

		self.chunking_mode = (
		"events" # Force event-based chunking when using entire table

		# Check if chunk end exceeds our data range (with 0.1s tolerance for floating point)
		if chunk_end > end_time + 0.1 * u.s:

Improve statistics aggregators #2848

Are you sure you want to change the base?

Improve statistics aggregators #2848

Conversation

mexanick commented Sep 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mexanick commented Sep 30, 2025

Uh oh!

ctao-sonarqube bot commented Oct 2, 2025

Quality Gate passed

Uh oh!

TjarkMiener left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

matteobalbo Oct 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

mexanick commented Sep 29, 2025 •

edited

Loading

matteobalbo Oct 6, 2025 •

edited

Loading