Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deduplication helpers for lexicon modification jobs #458

Open
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

Icemole
Copy link
Collaborator

@Icemole Icemole commented Oct 27, 2023

This PR introduces some improvements to how the lemma duplication is handled in lexicon.conversion.LexiconUniqueOrthJob and lexicon.modification.MergeLexiconJob.

For the former, I added a flag that makes special lemmas go through the unique lemma creation. In my view this is the most relevant part of the PR.

For the latter, I added a flag to deduplicate any duplicated lemmas that might appear as a result of the merging. This change might be debatable since MergeLexiconJob combined with LexiconUniqueOrthJob has the same result. I can remove this flag if it's too redundant; let me know about this.

Finally, I also added a Lemma._equals method and overwrote == and != for the Lemma class, which I think might be useful in the long term (I actually used this in MergeLexiconJob).

lexicon/conversion.py Outdated Show resolved Hide resolved
lexicon/modification.py Outdated Show resolved Hide resolved
lib/lexicon.py Outdated Show resolved Hide resolved
lib/lexicon.py Outdated
@@ -104,6 +107,42 @@ def from_element(cls, e):
synt = None if not synt else synt[0]
return Lemma(orth, phon, synt, eval, special)

def _equals(self, other: Lemma, same_order: bool = True) -> bool:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is the default same_order=True here, but the only usage of this function actually uses same_order=False?

The default should reflect the most reasonable expected common usage, or if this might be ambiguous, there should not be a default.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, probably the most common usage of equality would be checking without explicit order. At least I think it that way. What would other reviewers expect from the default of this function?

)

def __eq__(self, other: Lemma) -> bool:
return self._equals(other, same_order=False)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason you want same_order=False here?

It probably depends on the use case of __eq__ here (what actually is your use case?) but it feels kind of arbitrary.

Copy link
Collaborator Author

@Icemole Icemole Oct 27, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be discussed, but in my view a lemma is equal to another even if it has different orth order, or different pronunciation order. This is why it's set to False here.

Edit: the use case is basically the one specified in MergeLexiconJob, but I think it's reasonable to think that a lemma with two pronunciations X, Y is the same lemma as any other lemma with the same pronunciations Y, X (assuming the rest of the attributes are the same as well), even if X and Y are not inserted in the exact same order.

I think not enforcing a strict ordering is the best way of comparing two lemmas, but I would understand that some users might want strict ordering comparison, which is why I added the same_order flag. Then, the user who wants to enforce strict ordering can directly call _equals.

Copy link
Member

@albertz albertz Oct 27, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it depends on the use case. Or if you say, only same_order=False makes sense anyway, then why is this an option at all for _equals?

What actually is your use case?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My use case is the following: in MergeLexiconJob I added functionality (present in this PR) that will remove "duplicated" lemmas. According to my definition, duplicated lemmas are those lemmas that have the same orthographies and pronunciations. Note that my definition implies no ordering, so I would be fine with same_order=False.

However, I wanted to give the freedom to let the user be able to decide whether the order in their orths and prons matters or not. Maybe some users might want to store the orths and prons of a lexicon in lexicographical order, and thus the order would matter. Maybe some other users already have a specific ordering, and therefore the addition of a lemma with a different order helps them notice that something's wrong with their pipeline.

Summarizing: same_order=False is enough for my needs/use case, but I added this flag since I was unsure of the other use cases for this function, since each user might have a particular use case.

If you think it won't be used and only ==/!= will be used as I have been using it (with same_order=False), let me know. I won't use this flag, and if keeping it or removing it will spark a debate, I'd rather just remove it now and let another user, who might find it more necessary, implement it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, if you think there is no clear non-ambiguous definition of an equal-relation, thus having such flag makes sense, then I'm not sure if defining __eq__ makes sense here. Or usually __eq__ should always be most restrictive, i.e. use same_order=True.

Now __eq__ is very arbitrary, supporting exactly your case, but as you say yourself, there might be other cases where the equality should only be with same_order=True.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, __eq__ matches my use case but it doesn't cover the most restrictive use case.

What do others think? Is it too overkill to have same_order to test lemma equality? Should we just have the case of "two lemmas are equal whenever they contain the same data, regardless of their order"?

At every comment I'm more inclined to remove same_order and just check for what I said above, since I think no realistic use case would consider the ordering of data inside a lemma... At least I can't think of any specific use case for this...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general, we should not add code/logic for cases which are only hypothetical and not used currently, but only for things we really are using currently. Thus only the logic same_order=False. And then __eq__ is also non-ambiguous.

But let's wait what other think about this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would consider how it is treated in RASR:

  • phon: order does not matter; all are hypothesized and best one is selected
  • orth:
    • training: order does not matter
    • recognition: the first orth is the one that is printed in the output (if no eval is given) and used for LM purposes (if no synt is given)
  • synt: order does matter, as it defines the order used for the LM
  • eval: order does matter, as it defines the order the tokens are printed for output

So for phon, synt, and eval it is (in my opinion and as far as my understanding goes) clear what to do. For orth, which is probably mostly what you are interested in, it is not immediately obvious if the order should be considered.

Icemole and others added 3 commits October 27, 2023 17:11
Co-authored-by: Albert Zeyer <zeyer@cs.rwth-aachen.de>
Co-authored-by: Albert Zeyer <zeyer@cs.rwth-aachen.de>
Co-authored-by: Albert Zeyer <zeyer@cs.rwth-aachen.de>
@@ -178,6 +184,10 @@ def run(self):
for lemma in lex.lemmata:
# sort by first orth entry
orth_key = lemma.orth[0] if lemma.orth else ""
if self.deduplicate_lemmata:
# don't add the lemma when there's already an equal lemma
if len(lemma_dict[orth_key]) > 0 and lemma == lemma_dict[orth_key][0]:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I realize, this == uses your definition below. Maybe you should use is_equal(other, same_order=False) here.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is the use case what I was referring to, I might not have made this clear.

Right now I'm even more inclined to remove the same_order flag altogether and just keep same_order=False as the default. I'm waiting for what the rest of the reviewers might have to say about this.

In general, thinking about use cases as you said, I couldn't think of any actual use case for same_order=True that wouldn't be covered by same_order=False, only hypothetical situations in which the user might want the same order, but I don't think that would have any sort of impact on the final lexicon.

@@ -119,11 +119,14 @@ def __init__(self, bliss_lexicon, merge_multi_orths_lemmata=False):
** having a synt <=> synt is not None
this could lead to INFORMATION LOSS if there are several
different synt token sequences in the to-be-merged lemmata
:param bool deduplicate_special_lemmata: if True, special lemmata will also be considered
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you think there are possibly cases where same_order=True makes sense, maybe this should not just be a bool but sth like deduplicate_special_lemmata_type or deduplicate_special_lemmata_opts?

Copy link
Contributor

@JackTemaki JackTemaki left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would like to have the option of same_order in there, as there are some setups that just pick the first phon sequence, and then the order matters.

Thus _equals should actually be equals, so that it is not private but is supposed to be used outside.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants