Skip to content

About filtering duplicate relations for VG dataset #129

Open
@coldmanck

Description

@coldmanck

It seems to me the following code snippet doesn't work as expected:

if self.filter_duplicate_rels:
# Filter out dupes!
assert self.split == 'train'
old_size = relation.shape[0]
all_rel_sets = defaultdict(list)
for (o0, o1, r) in relation:
all_rel_sets[(o0, o1)].append(r)
relation = [(k[0], k[1], np.random.choice(v)) for k,v in all_rel_sets.items()]
relation = np.array(relation, dtype=np.int32)

I was thinking filtering out duplicate relations means for those exactly repeated relation triplets (i.e., not only subject and object are the same but also the predicate); however, this snippet seems to preserve only a single predicate for each object pair (with a higher chance for those occurring more times to be chosen). This seems unreasonable for me and makes the following snippet redundant:

if relation_map[int(relation[i,0]), int(relation[i,1])] > 0:
if (random.random() > 0.5):
relation_map[int(relation[i,0]), int(relation[i,1])] = int(relation[i,2])

To accommodate multiple labels for each object pair, I think we have to change L148-L156 to the following:

if self.filter_duplicate_rels:
    # Filter out dupes!
    assert self.split == 'train'
    old_size = relation.shape[0]
    all_rel_sets = defaultdict(set)
    for (o0, o1, r) in relation:
        all_rel_sets[(o0, o1)].add(r)
    relation = [(k[0], k[1], v) for k, vs in all_rel_sets.items() for v in vs]
    relation = np.array(relation, dtype=np.int32)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions