Skip to content

translate.py ignores sentences in src when tgt not provided #1317

Closed
@fdalvi

Description

@fdalvi

translate.py ignores all sentences after processing shard_size*shard_size sentences in src (if no tgt file is provided). The reason for this lies in

OpenNMT-py/translate.py

Lines 19 to 20 in 9865443

tgt_shards = split_corpus(opt.tgt, opt.shard_size) \
if opt.tgt is not None else [None]*opt.shard_size

Essentially, if no tgt is provided, tgt_shards becomes a list of None's of size shard_size, while src_shards is a generator that will generate num_lines/shard_size elements. When num_lines/shard_size becomes greater than shard_size, the rest of the elements in src_shards are ignored, since tgt_shards ends prematurely. An example might make this clearer:

shard_size=2
src:

a
b
c
d
e
f

tgt: None

In this case, the following shards are computed

OpenNMT-py/translate.py

Lines 18 to 21 in 9865443

src_shards = split_corpus(opt.src, opt.shard_size)
tgt_shards = split_corpus(opt.tgt, opt.shard_size) \
if opt.tgt is not None else [None]*opt.shard_size
shard_pairs = zip(src_shards, tgt_shards)

src_shards: generator([[a,b],[c,d],[e,f]])
tgt_shards: [None, None]

shard_pairs: zip(src_shards, tgt_shards) ==> [ ([a,b], None), ([c,d], None) ]

[e,f] is completely ignored, since there is no corresponding element to zip in tgt_shards.

The bug is that tgt_shards should be computed using num_shards and not shard_size, but since we don't read the entire file, we don't know what num_shards is at this point.

A potential solution is that when tgt is None, tgt_shards becomes an infinite None generator, in which case the zip will be limited by number of source shards, which is what we want.

If this all makes sense, I can send in a PR. Happy to clarify and/or discuss something I might have missed!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions