-
Notifications
You must be signed in to change notification settings - Fork 2.3k
Description
translate.py ignores all sentences after processing shard_size*shard_size sentences in src (if no tgt file is provided). The reason for this lies in
Lines 19 to 20 in 9865443
| tgt_shards = split_corpus(opt.tgt, opt.shard_size) \ | |
| if opt.tgt is not None else [None]*opt.shard_size |
Essentially, if no tgt is provided, tgt_shards becomes a list of None's of size shard_size, while src_shards is a generator that will generate num_lines/shard_size elements. When num_lines/shard_size becomes greater than shard_size, the rest of the elements in src_shards are ignored, since tgt_shards ends prematurely. An example might make this clearer:
shard_size=2
src:
a
b
c
d
e
f
tgt: None
In this case, the following shards are computed
Lines 18 to 21 in 9865443
| src_shards = split_corpus(opt.src, opt.shard_size) | |
| tgt_shards = split_corpus(opt.tgt, opt.shard_size) \ | |
| if opt.tgt is not None else [None]*opt.shard_size | |
| shard_pairs = zip(src_shards, tgt_shards) |
src_shards: generator([[a,b],[c,d],[e,f]])
tgt_shards: [None, None]
shard_pairs: zip(src_shards, tgt_shards) ==> [ ([a,b], None), ([c,d], None) ]
[e,f] is completely ignored, since there is no corresponding element to zip in tgt_shards.
The bug is that tgt_shards should be computed using num_shards and not shard_size, but since we don't read the entire file, we don't know what num_shards is at this point.
A potential solution is that when tgt is None, tgt_shards becomes an infinite None generator, in which case the zip will be limited by number of source shards, which is what we want.
If this all makes sense, I can send in a PR. Happy to clarify and/or discuss something I might have missed!