Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add V2 version of the Corpus class #544

Merged
merged 15 commits into from
Oct 10, 2024
Merged

Add V2 version of the Corpus class #544

merged 15 commits into from
Oct 10, 2024

Conversation

Icemole
Copy link
Collaborator

@Icemole Icemole commented Sep 24, 2024

The class CorpusV2 allows O(1) access to any segment or recording.

It might be a naive implementation, but I think a concept similar to this should be enough for our needs of accessing any segment in O(subcorpus_depth), which is practically O(1) in our cases.

It's a draft so any feedback and changes will be appreciated.

Allows O(1) access to any segment or recording
Typing improvements, fixes on accessing segment/recording by full name
@Icemole Icemole changed the title [Draft] Add V2 version of the Corpus class Add V2 version of the Corpus class Sep 24, 2024
lib/corpus.py Outdated Show resolved Hide resolved
lib/corpus.py Outdated Show resolved Hide resolved
lib/corpus.py Outdated
Comment on lines 318 to 319
self.subcorpora: Dict[str, Corpus] = {}
self.recordings: Dict[str, RecordingV2] = {}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how about we add the new Dict members under a different name and then add a property like

@property
def subcorpora(self):
    return list(self._subcorpora.values())

@subcorpora.setter
def subcorpora(self, values):
    for sc in values:
        self._subcorpora[sc.name] = sc

so that most previous code would also work with V2. Also many functions below would then not need to be re-implemented.

Copy link
Collaborator Author

@Icemole Icemole Sep 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my view, we shouldn't restrain ourselves to using V1 code anymore, since it leads the user to keep using retrocompatible code instead of striving for optimal performance. If the user needs a Corpus, then they should build a Corpus, and should only build a CorpusV2 if they know what they're doing.

If you still think that it would be a nice addition, please let me know, we can debate it or I can simply support this.

Anyway, I expect that the CorpusV2 class makes sense only in very specific scenarios, such as when having to access segments in O(1). Otherwise, the Corpus class still makes sense as a repository of subcorpora, recordings and segments, since the functionality in V2 won't need to be used by many users.

Edit: note that the CorpusV2 has some memory redundancy from the dictionary indexed by name and the object in memory having the name as well. So it's a bit of a tradeoff, more memory for less access time.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my view, we shouldn't restrain ourselves to using V1 code anymore

This would be another argument for not inheriting from Corpus. Maybe a Corpus-like implementation is not even necessary. As you said

I expect that the CorpusV2 class makes sense only in very specific scenarios,

Maybe for these scenarios a simple data structure or even just a

segment_map = { s.fullname(): s for s in some_corpus.all_segments() }

would be sufficient to fulfill their performance needs.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that makes a lot of sense. This implementation was originally thought in order to get the orthography of a given segment in O(1), which was the issue we were facing. @michelwi do you think returning a segment map from full names to segments would do the trick for your specific use case (I don't know all the details)?

@albertz and the rest of the reviewers, what's your take on this? Do you think the functionality here is too redundant? Our use case could be covered by a small function (or two, if you want to get recordings in O(1) as well) in the Corpus class.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

About the memory redundancy, I don't think there is really any problem. It will only have duplicated some of the pointers. You probably won't even notice this.

I haven't really checked too much, but if you can just put your extension into the Corpus class itself, or rewrite Corpus such that it stays compatible, I think I would prefer that over having a CorpusV2.

lib/corpus.py Outdated
Comment on lines 330 to 352
def get_segment_by_name(self, name: str) -> Segment:
"""
:return: the segment specified by its name
"""
for seg in self.segments():
if seg.name == name:
return seg
assert False, f"Segment '{name}' was not found in corpus"

def get_segment_by_full_name(self, name: str) -> Optional[Segment]:
"""
:return: the segment specified by its full name
"""
if name == "":
# Found nothing.
return None

if name in self.segments:
return self.segments[name]
else:
subcorpus_name = name.split("/")[0]
segment_name_from_subcorpus = name[len(f"{subcorpus_name}/"):]
return self.subcorpora[subcorpus_name].get_segment_by_full_name(segment_name_from_subcorpus)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not like that the functions get_segment_by_{full_,}name differ in the efficiency of the implementation and also in the behavior if a segment is not found.
Also the get_segment_by_full_name is recursive and then the name that is passed is not the full_name any more.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should therefore get rid of get_segment_by_name as it currently is.

lib/corpus.py Outdated
Comment on lines 347 to 351
if name in self.segments:
return self.segments[name]
else:
subcorpus_name = name.split("/")[0]
segment_name_from_subcorpus = name[len(f"{subcorpus_name}/"):]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can already determine the location of the segment within the subcorpora/recordings by len(name.split("/")) so we could indeed loop over the name and use e.g.

L = len(name.split('/'))
active_element = self
for i,n in enumerate(name.split('/')):
    if L-i > 2:
        active_element = active_element.subcorpora[n]
    elif L-i > 1:
        active_element = active_element.recordings[n]
    else:
        return  active_element.segments[n]

to directly navigate to the segment.

But a recursive implementation is also fine (an maybe cleaner == nicer) if done nicely :)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like better the recursive implementation, but let me know if you prefer the iterative one.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Iterative is always to be preferred for Python.

lib/corpus.py Outdated Show resolved Hide resolved
lib/corpus.py Outdated Show resolved Hide resolved
lib/corpus.py Outdated Show resolved Hide resolved
lib/corpus.py Outdated Show resolved Hide resolved
lib/corpus.py Outdated Show resolved Hide resolved
@Icemole
Copy link
Collaborator Author

Icemole commented Sep 24, 2024

As I stated in a comment above, note that the CorpusV2 has some memory redundancy from the dictionary indexed by name and the object in memory having the name as well. So it's a bit of a tradeoff, more memory for less access time.

I still expect the Corpus class to be useful as a repository of subcorpora, recordings and segments, for users who don't want fast access to segments and want to, for instance, filter all segments according to a criterion or iterate over all recordings (which are pretty common use cases, at least from my side).

get_segment_map allows for more explicit control to the user
@Icemole
Copy link
Collaborator Author

Icemole commented Sep 25, 2024

Copy-pasting from a comment above:

Maybe for these scenarios a simple data structure or even just a segment_map = { s.fullname(): s for s in some_corpus.all_segments() } would be sufficient to fulfill their performance needs.

Yes, that makes a lot of sense. This implementation was originally thought in order to get the orthography of a given segment in O(1), which was the issue we were facing. @michelwi do you think returning a segment map from full names to segments would do the trick for your specific use case (I don't know all the details)?

@albertz and the rest of the reviewers, what's your take on this? Do you think the functionality here is too redundant? Our use case could be covered by a small function (or two, if you want to get recordings in O(1) as well) in the Corpus class.

lib/corpus.py Outdated Show resolved Hide resolved
lib/corpus.py Outdated Show resolved Hide resolved
Co-authored-by: michelwi <michelwi@users.noreply.github.com>
@Icemole Icemole requested a review from michelwi October 4, 2024 16:03
@Icemole
Copy link
Collaborator Author

Icemole commented Oct 10, 2024

@albertz I cannot merge while you're requesting changes. Could you please review the PR whenever you have time? Thanks!

@Icemole Icemole merged commit 229d490 into main Oct 10, 2024
4 checks passed
@Icemole Icemole deleted the add-corpus-v2 branch October 10, 2024 13:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants