-
Notifications
You must be signed in to change notification settings - Fork 273
Improve statistics aggregators #2848
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
@matteobalbo FYI |
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks already good to me, @mexanick! I'm attaching some comments and suggestions.
default_value=None, | ||
allow_none=True, | ||
help=( | ||
"Size of the chunk used for the computation of aggregated statistic values. " | ||
"If None, use the entire table as one chunk. " | ||
"For event-based chunking: number of events per chunk. " | ||
"For time-based chunking: duration in seconds per chunk (integer)." | ||
), | ||
).tag(config=True) | ||
|
||
chunking_mode = CaselessStrEnum( | ||
["events", "time"], | ||
default_value="events", | ||
help=( | ||
"Chunking strategy: 'events' for fixed event count, 'time' for time intervals. " | ||
"When 'time', chunk_size and chunk_shift are interpreted as seconds." | ||
), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd slightly prefer to have two config parameters, something like:
chunk_size = Int(
default_value=None,
allow_none=True,
help=(
"Event-based size of the chunk (number of events per chunk) "
"used for the computation of aggregated statistic values. "
),
).tag(config=True)
chunk_duration = AstroQuantity(
default_value=None,
allow_none=True,
physical_type=astropy.units.physical.time,
help=(
"Time-based size of the chunk (duration in 'astropy.units.physical.time' per chunk)
"used for the computation of aggregated statistic values. "
),
).tag(config=True)
And then follow the logic: If both None
, use entire table as one chunk. If both set, break. And if either set take the one set.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I kept it as is with Int chunk size with the only N second slicing in mind. However, it indeed could be update to what you propose. Do you have a UC where a non-integer chunk (in terms of time) is expected?
Also, @maxnoe , @matteobalbo from your (external) point of view, what's more clear?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the more canonical option would be to have Chunking
component with to implementations TimeChunking
and SizeChunking
which then can have the corresponding configuration options.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Eventually (I don't know if it make sense, but...) a naive UC that would require a non-INT time chuck in the QualPipe could be if we finally decide to compute the trend of a given parameter during an Observation Block in an arbitrary number of binning (e.g. you want to have 7 datapoints during one OB evenly spaced in time).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, I will do a new component as @maxnoe suggested. Then for the time-based selection it will take Time
in whatever format.
timestamps of shape (n_events, ) | ||
masked_elements_of_sample : ndarray, optional | ||
boolean array of masked elements of shape (\*data_dimensions) that are not available for processing | ||
chunk_shift : int, optional |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this could be int
or astropy.units.Quantity
, optional
self.chunking_mode = ( | ||
"events" # Force event-based chunking when using entire table |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this force would be redundant with suggested config change above
# Check if chunk end exceeds our data range (with 0.1s tolerance for floating point) | ||
if chunk_end > end_time + 0.1 * u.s: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd assume that this tolerance depends on your use case. I'd pass it as an argument in the __call__()
chunk_shift=None, | ||
col_name="image", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see below: I'd add time_tolerance = 0.1 * u.s
to customise based on the use case.
elif self.chunking_mode == "time": | ||
yield from self._get_time_chunks(table, chunk_shift, effective_chunk_size) | ||
|
||
def _check_table_length(self, table, effective_chunk_size): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd consider moving this check into _get_event_chunks()
and _get_time_chunks()
and remove this extra function.
Chunks of the input table | ||
""" | ||
# If using entire table as one chunk, just yield the whole table | ||
if self.chunk_size is None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this would then be if self.chunk_size is None and if self.chunk_duration is None:
should have the same units as the input data. | ||
""" | ||
for col in ("mean", "median", "std"): | ||
if col in table.colnames: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this check needed?
@abstractmethod | ||
def compute_stats(self, images, masked_pixels_of_sample) -> StatisticsContainer: | ||
def _add_result_columns(self, data, masked_elements_of_sample, results_dict): | ||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
""" | |
r""" |
closes #2745