-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
customizable transform statistics #2059
Conversation
This is nice. |
These transform statistics and corpora loading log are only logged one time when begin(for load) or finish(for statistic) iterating the corpora. The log seems too frequently because: the same corpus are loaded in every producer process and iterating with stride/offset mechanism, this makes the same corpus loading I think we can disable other processes' logging and only let the first producer do the log. |
If we let only the first producer do the logging, we should have some way to aggregate the stats over all the processes, like the |
Gather stats across all producer process is a little bit complex since it requires inter-process communication or shared object and needs extra effort which may introduce additional complexity while not adding too many benefits. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few comments.
Also, it could be nice to add some transforms to one of the integration tests, to allow a quick look at how stats logging looks like.
Largely reduced Data related loggings by:
|
Currently,
TransformStatistics
only tracks statistics of providing Transforms, we need to modify this class if we want to track statistics from customized Transforms created by ourselves, which is not convenient.In this PR, we made it possible to track statistics for customized transform.
Proposal detail
Currently, each corpus can be processed on-the-fly with a sequence of transforms. As there may have multiple corpora which may use a different combination of transforms, all transforms should only be instantiated once per process to save memory use, and each corpus has one
TransformPipe
object to perform its transform pipeline. As we want to know the statistics of transform for each corpus, we set aTransformStatistics
object as an attribute toTransformPipe
, and can be thus updated during the apply of the pipe.Before this PR, the
TransformStatistics
is hardcoded with the statistics needed for existingTransform
s, in this PR, we would like to make it a container to register and keep track of statistics we sent for flexibility.The expected behavior for this new
TransformStatistics
should be:To enable customizable statistics for different
Transform
, the object received byTransformStatistics
should have the elements to count and its corresponding update method. Therefore, we introduce theObservableStats
class. Any new transform statistics should inherit this class, and be instantiated to pass information toTransformStatistics
.Inherit
ObservableStats
for your customizeTransform
to track statisticsIf you have some statistics to track for you customize
Transform
, you need to extend theObservableStats
:__init__
update
method__str__
to change the default log message formatAs an example:
Passing
ObservableStats
object toTransformStatistics
to keep trackingWhen updating, what you need to do is passing a statistic object in current transform
apply
to theupdate
method ofTransformStatistics
object: