customizable transform statistics #2059

Zenglinxiao · 2021-05-03T15:34:57Z

Currently, TransformStatistics only tracks statistics of providing Transforms, we need to modify this class if we want to track statistics from customized Transforms created by ourselves, which is not convenient.
In this PR, we made it possible to track statistics for customized transform.

Proposal detail

Currently, each corpus can be processed on-the-fly with a sequence of transforms. As there may have multiple corpora which may use a different combination of transforms, all transforms should only be instantiated once per process to save memory use, and each corpus has one TransformPipe object to perform its transform pipeline. As we want to know the statistics of transform for each corpus, we set a TransformStatistics object as an attribute to TransformPipe, and can be thus updated during the apply of the pipe.

Before this PR, the TransformStatistics is hardcoded with the statistics needed for existing Transforms, in this PR, we would like to make it a container to register and keep track of statistics we sent for flexibility.
The expected behavior for this new TransformStatistics should be:

Register the statistics to watch
Update watching statistics if receive another statistic object with the same name
Reset once the statistics are reported to enable another fresh counting

To enable customizable statistics for different Transform, the object received by TransformStatistics should have the elements to count and its corresponding update method. Therefore, we introduce the ObservableStats class. Any new transform statistics should inherit this class, and be instantiated to pass information to TransformStatistics.

Inherit `ObservableStats` for your customize `Transform` to track statistics

If you have some statistics to track for you customize Transform, you need to extend the ObservableStats:

Add the elements to keep track with in the __init__
update method
(optional) override __str__ to change the default log message format

As an example:

class SubwordStats(ObservableStats):
    """Running statistics for counting tokens before/after subword transform."""

    __slots__ = ["subwords", "words"]

    def __init__(self, subwords: int, words: int):
        self.subwords = subwords
        self.words = words

    def update(self, other: "SubwordStats"):
        self.subwords += other.subwords
        self.words += other.words

    def __str__(self) -> str:
        return "{}: {} -> {} tokens".format(
            self.name(), self.words, self.subwords
        )

Passing `ObservableStats` object to `TransformStatistics` to keep tracking

When updating, what you need to do is passing a statistic object in current transform apply to the update method of TransformStatistics object:

Class SubwordTransform(Transform):
...
def apply(self, example, is_train=False, stats=None, **kwargs):
    ...
        if stats is not None:
            # stats is an instance of TransformStatistics within a corpus
            n_words = len(example['src']) + len(example['tgt'])
            n_subwords = len(src_out) + len(tgt_out)
            stats.update(SubwordStats(n_subwords, n_words))
        example['src'], example['tgt'] = src_out, tgt_out
        return example
...

francoishernandez · 2021-05-04T08:12:29Z

This is nice.
Maybe it would be good to reduce the verbosity of the stats logging also. I don't remember what's the exact logic here but it seems these stats are logged way too frequently. Maybe we should log those only at fixed interval e.g. like we perform validation at valid_steps interval.

Zenglinxiao · 2021-05-04T09:40:59Z

Maybe it would be good to reduce the verbosity of the stats logging also. I don't remember what's the exact logic here but it seems these stats are logged way too frequently. Maybe we should log those only at fixed interval e.g. like we perform validation at valid_steps interval.

These transform statistics and corpora loading log are only logged one time when begin(for load) or finish(for statistic) iterating the corpora. The log seems too frequently because: the same corpus are loaded in every producer process and iterating with stride/offset mechanism, this makes the same corpus loading world_size times faster w.r.t single process as reduced size, and each time world_size time log because of multiprocessing, thus resulting world_size * world_size times frequent logging.

I think we can disable other processes' logging and only let the first producer do the log.

francoishernandez · 2021-05-04T09:48:33Z

If we let only the first producer do the logging, we should have some way to aggregate the stats over all the processes, like the all_gather_stats of onmt.utils.Statistics.

Zenglinxiao · 2021-05-04T16:13:04Z

Gather stats across all producer process is a little bit complex since it requires inter-process communication or shared object and needs extra effort which may introduce additional complexity while not adding too many benefits.
Here, I add the option --verbose to the train.py script, making it log only the first producer by default, and can get the full log if we set --verbose.
And --skip_empty_level option is also used here to default log the loading corpus name rather than its path. We can get the full corpus path if set --skip_empty_level=error, or completely silent loading corpus if set --skip_empty_level=silent. Since the use case is exactly the same as skip_empty_level, I reuse this option rather than create another option.

francoishernandez

A few comments.
Also, it could be nice to add some transforms to one of the integration tests, to allow a quick look at how stats logging looks like.

onmt/inputters/corpus.py

onmt/utils/distributed.py

Zenglinxiao · 2021-05-06T17:18:43Z

Largely reduced Data related loggings by:

Assign corpus loading info log to WeightedMixer, and only log loaded times so far for each corpus when a new load is performed;
Logs related to data default to be reported only by the first producer process to reduce the logging amount, but can be reactivated to all producers if set --verbose.

customizable transform statistics

2de943a

Zenglinxiao marked this pull request as draft May 3, 2021 15:40

Zenglinxiao added 4 commits May 4, 2021 17:13

better format statistic report

32b641e

reduce data log

38efa5d

fix unittest

a576e29

use skip_empty_level to control data load log brefty

4e1fc00

Zenglinxiao marked this pull request as ready for review May 4, 2021 15:50

francoishernandez reviewed May 4, 2021

View reviewed changes

onmt/inputters/corpus.py Outdated Show resolved Hide resolved

onmt/inputters/corpus.py Outdated Show resolved Hide resolved

onmt/utils/distributed.py Outdated Show resolved Hide resolved

Zenglinxiao added 6 commits May 5, 2021 16:31

default other producer log level to warning

da0291e

Deprecated skip_empty_level with data_log_level

9d24630

let mixer report loading

1fdbc32

some clean

c59f75f

adjust log

6e51481

revert skip_empty_level option change

1792f86

francoishernandez merged commit d5d3c74 into OpenNMT:master May 7, 2021

Zenglinxiao deleted the custom_stats branch May 7, 2021 16:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

customizable transform statistics #2059

customizable transform statistics #2059

Zenglinxiao commented May 3, 2021 •

edited

Loading

francoishernandez commented May 4, 2021

Zenglinxiao commented May 4, 2021

francoishernandez commented May 4, 2021

Zenglinxiao commented May 4, 2021

francoishernandez left a comment

Zenglinxiao commented May 6, 2021

customizable transform statistics #2059

customizable transform statistics #2059

Conversation

Zenglinxiao commented May 3, 2021 • edited Loading

Proposal detail

Inherit ObservableStats for your customize Transform to track statistics

Passing ObservableStats object to TransformStatistics to keep tracking

francoishernandez commented May 4, 2021

Zenglinxiao commented May 4, 2021

francoishernandez commented May 4, 2021

Zenglinxiao commented May 4, 2021

francoishernandez left a comment

Choose a reason for hiding this comment

Zenglinxiao commented May 6, 2021

Zenglinxiao commented May 3, 2021 •

edited

Loading

Inherit `ObservableStats` for your customize `Transform` to track statistics

Passing `ObservableStats` object to `TransformStatistics` to keep tracking