Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

customizable transform statistics #2059

Merged
merged 11 commits into from
May 7, 2021

Conversation

Zenglinxiao
Copy link
Contributor

@Zenglinxiao Zenglinxiao commented May 3, 2021

Currently, TransformStatistics only tracks statistics of providing Transforms, we need to modify this class if we want to track statistics from customized Transforms created by ourselves, which is not convenient.
In this PR, we made it possible to track statistics for customized transform.

Proposal detail

Currently, each corpus can be processed on-the-fly with a sequence of transforms. As there may have multiple corpora which may use a different combination of transforms, all transforms should only be instantiated once per process to save memory use, and each corpus has one TransformPipe object to perform its transform pipeline. As we want to know the statistics of transform for each corpus, we set a TransformStatistics object as an attribute to TransformPipe, and can be thus updated during the apply of the pipe.

Before this PR, the TransformStatistics is hardcoded with the statistics needed for existing Transforms, in this PR, we would like to make it a container to register and keep track of statistics we sent for flexibility.
The expected behavior for this new TransformStatistics should be:

  • Register the statistics to watch
  • Update watching statistics if receive another statistic object with the same name
  • Reset once the statistics are reported to enable another fresh counting

To enable customizable statistics for different Transform, the object received by TransformStatistics should have the elements to count and its corresponding update method. Therefore, we introduce the ObservableStats class. Any new transform statistics should inherit this class, and be instantiated to pass information to TransformStatistics.

Inherit ObservableStats for your customize Transform to track statistics

If you have some statistics to track for you customize Transform, you need to extend the ObservableStats:

  1. Add the elements to keep track with in the __init__
  2. update method
  3. (optional) override __str__ to change the default log message format

As an example:

class SubwordStats(ObservableStats):
    """Running statistics for counting tokens before/after subword transform."""

    __slots__ = ["subwords", "words"]

    def __init__(self, subwords: int, words: int):
        self.subwords = subwords
        self.words = words

    def update(self, other: "SubwordStats"):
        self.subwords += other.subwords
        self.words += other.words

    def __str__(self) -> str:
        return "{}: {} -> {} tokens".format(
            self.name(), self.words, self.subwords
        )

Passing ObservableStats object to TransformStatistics to keep tracking

When updating, what you need to do is passing a statistic object in current transform apply to the update method of TransformStatistics object:

Class SubwordTransform(Transform):
...
def apply(self, example, is_train=False, stats=None, **kwargs):
    ...
        if stats is not None:
            # stats is an instance of TransformStatistics within a corpus
            n_words = len(example['src']) + len(example['tgt'])
            n_subwords = len(src_out) + len(tgt_out)
            stats.update(SubwordStats(n_subwords, n_words))
        example['src'], example['tgt'] = src_out, tgt_out
        return example
...

@Zenglinxiao Zenglinxiao marked this pull request as draft May 3, 2021 15:40
@francoishernandez
Copy link
Member

This is nice.
Maybe it would be good to reduce the verbosity of the stats logging also. I don't remember what's the exact logic here but it seems these stats are logged way too frequently. Maybe we should log those only at fixed interval e.g. like we perform validation at valid_steps interval.

@Zenglinxiao
Copy link
Contributor Author

Maybe it would be good to reduce the verbosity of the stats logging also. I don't remember what's the exact logic here but it seems these stats are logged way too frequently. Maybe we should log those only at fixed interval e.g. like we perform validation at valid_steps interval.

These transform statistics and corpora loading log are only logged one time when begin(for load) or finish(for statistic) iterating the corpora. The log seems too frequently because: the same corpus are loaded in every producer process and iterating with stride/offset mechanism, this makes the same corpus loading world_size times faster w.r.t single process as reduced size, and each time world_size time log because of multiprocessing, thus resulting world_size * world_size times frequent logging.

I think we can disable other processes' logging and only let the first producer do the log.

@francoishernandez
Copy link
Member

If we let only the first producer do the logging, we should have some way to aggregate the stats over all the processes, like the all_gather_stats of onmt.utils.Statistics.

@Zenglinxiao Zenglinxiao marked this pull request as ready for review May 4, 2021 15:50
@Zenglinxiao
Copy link
Contributor Author

Gather stats across all producer process is a little bit complex since it requires inter-process communication or shared object and needs extra effort which may introduce additional complexity while not adding too many benefits.
Here, I add the option --verbose to the train.py script, making it log only the first producer by default, and can get the full log if we set --verbose.
And --skip_empty_level option is also used here to default log the loading corpus name rather than its path. We can get the full corpus path if set --skip_empty_level=error, or completely silent loading corpus if set --skip_empty_level=silent. Since the use case is exactly the same as skip_empty_level, I reuse this option rather than create another option.

Copy link
Member

@francoishernandez francoishernandez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few comments.
Also, it could be nice to add some transforms to one of the integration tests, to allow a quick look at how stats logging looks like.

onmt/inputters/corpus.py Outdated Show resolved Hide resolved
onmt/inputters/corpus.py Outdated Show resolved Hide resolved
onmt/utils/distributed.py Outdated Show resolved Hide resolved
@Zenglinxiao
Copy link
Contributor Author

Largely reduced Data related loggings by:

  1. Assign corpus loading info log to WeightedMixer, and only log loaded times so far for each corpus when a new load is performed;
  2. Logs related to data default to be reported only by the first producer process to reduce the logging amount, but can be reactivated to all producers if set --verbose.

@francoishernandez francoishernandez merged commit d5d3c74 into OpenNMT:master May 7, 2021
@Zenglinxiao Zenglinxiao deleted the custom_stats branch May 7, 2021 16:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants