Skip to content

Conversation

mexanick
Copy link
Contributor

@mexanick mexanick commented Sep 29, 2025

  • Relax requirements on the data shape (aggregate arbitrary arrays on axis 0)
  • Add option to aggregate statistics by time

closes #2745

@mexanick mexanick marked this pull request as ready for review September 30, 2025 13:43
@mexanick mexanick self-assigned this Sep 30, 2025
@mexanick
Copy link
Contributor Author

@matteobalbo FYI

@mexanick mexanick marked this pull request as draft October 2, 2025 09:38
Copy link

ctao-sonarqube bot commented Oct 2, 2025

@mexanick mexanick marked this pull request as ready for review October 2, 2025 14:51
Copy link
Member

@TjarkMiener TjarkMiener left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks already good to me, @mexanick! I'm attaching some comments and suggestions.

Comment on lines +44 to +60
default_value=None,
allow_none=True,
help=(
"Size of the chunk used for the computation of aggregated statistic values. "
"If None, use the entire table as one chunk. "
"For event-based chunking: number of events per chunk. "
"For time-based chunking: duration in seconds per chunk (integer)."
),
).tag(config=True)

chunking_mode = CaselessStrEnum(
["events", "time"],
default_value="events",
help=(
"Chunking strategy: 'events' for fixed event count, 'time' for time intervals. "
"When 'time', chunk_size and chunk_shift are interpreted as seconds."
),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd slightly prefer to have two config parameters, something like:

chunk_size = Int(
        default_value=None,
        allow_none=True,
        help=(
            "Event-based size of the chunk (number of events per chunk) "
            "used for the computation of aggregated statistic values. "
        ),
    ).tag(config=True) 

chunk_duration = AstroQuantity(
        default_value=None,
        allow_none=True,
        physical_type=astropy.units.physical.time,
        help=(
            "Time-based size of the chunk (duration in 'astropy.units.physical.time' per chunk)
            "used for the computation of aggregated statistic values. "
        ), 
   ).tag(config=True)

And then follow the logic: If both None, use entire table as one chunk. If both set, break. And if either set take the one set.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I kept it as is with Int chunk size with the only N second slicing in mind. However, it indeed could be update to what you propose. Do you have a UC where a non-integer chunk (in terms of time) is expected?
Also, @maxnoe , @matteobalbo from your (external) point of view, what's more clear?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the more canonical option would be to have Chunking component with to implementations TimeChunking and SizeChunking which then can have the corresponding configuration options.

Copy link

@matteobalbo matteobalbo Oct 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Eventually (I don't know if it make sense, but...) a naive UC that would require a non-INT time chuck in the QualPipe could be if we finally decide to compute the trend of a given parameter during an Observation Block in an arbitrary number of binning (e.g. you want to have 7 datapoints during one OB evenly spaced in time).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, I will do a new component as @maxnoe suggested. Then for the time-based selection it will take Time in whatever format.

timestamps of shape (n_events, )
masked_elements_of_sample : ndarray, optional
boolean array of masked elements of shape (\*data_dimensions) that are not available for processing
chunk_shift : int, optional
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this could be int or astropy.units.Quantity, optional

Comment on lines +105 to +106
self.chunking_mode = (
"events" # Force event-based chunking when using entire table
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this force would be redundant with suggested config change above

Comment on lines +273 to +274
# Check if chunk end exceeds our data range (with 0.1s tolerance for floating point)
if chunk_end > end_time + 0.1 * u.s:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd assume that this tolerance depends on your use case. I'd pass it as an argument in the __call__()

Comment on lines 67 to 68
chunk_shift=None,
col_name="image",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see below: I'd add time_tolerance = 0.1 * u.s to customise based on the use case.

elif self.chunking_mode == "time":
yield from self._get_time_chunks(table, chunk_shift, effective_chunk_size)

def _check_table_length(self, table, effective_chunk_size):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd consider moving this check into _get_event_chunks() and _get_time_chunks() and remove this extra function.

Chunks of the input table
"""
# If using entire table as one chunk, just yield the whole table
if self.chunk_size is None:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this would then be if self.chunk_size is None and if self.chunk_duration is None:

should have the same units as the input data.
"""
for col in ("mean", "median", "std"):
if col in table.colnames:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this check needed?

@mexanick mexanick marked this pull request as draft October 7, 2025 08:26
@abstractmethod
def compute_stats(self, images, masked_pixels_of_sample) -> StatisticsContainer:
def _add_result_columns(self, data, masked_elements_of_sample, results_dict):
"""

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"""
r"""

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add "time interval" option feature to compute aggregated statistic with StatisticalAggregator
4 participants