-
Notifications
You must be signed in to change notification settings - Fork 23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add V2 version of the Corpus class #544
Conversation
Allows O(1) access to any segment or recording
Typing improvements, fixes on accessing segment/recording by full name
lib/corpus.py
Outdated
self.subcorpora: Dict[str, Corpus] = {} | ||
self.recordings: Dict[str, RecordingV2] = {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how about we add the new Dict members under a different name and then add a property like
@property
def subcorpora(self):
return list(self._subcorpora.values())
@subcorpora.setter
def subcorpora(self, values):
for sc in values:
self._subcorpora[sc.name] = sc
so that most previous code would also work with V2. Also many functions below would then not need to be re-implemented.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In my view, we shouldn't restrain ourselves to using V1 code anymore, since it leads the user to keep using retrocompatible code instead of striving for optimal performance. If the user needs a Corpus
, then they should build a Corpus
, and should only build a CorpusV2
if they know what they're doing.
If you still think that it would be a nice addition, please let me know, we can debate it or I can simply support this.
Anyway, I expect that the CorpusV2
class makes sense only in very specific scenarios, such as when having to access segments in O(1). Otherwise, the Corpus
class still makes sense as a repository of subcorpora, recordings and segments, since the functionality in V2
won't need to be used by many users.
Edit: note that the CorpusV2
has some memory redundancy from the dictionary indexed by name and the object in memory having the name as well. So it's a bit of a tradeoff, more memory for less access time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In my view, we shouldn't restrain ourselves to using V1 code anymore
This would be another argument for not inheriting from Corpus. Maybe a Corpus-like implementation is not even necessary. As you said
I expect that the CorpusV2 class makes sense only in very specific scenarios,
Maybe for these scenarios a simple data structure or even just a
segment_map = { s.fullname(): s for s in some_corpus.all_segments() }
would be sufficient to fulfill their performance needs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that makes a lot of sense. This implementation was originally thought in order to get the orthography of a given segment in O(1), which was the issue we were facing. @michelwi do you think returning a segment map from full names to segments would do the trick for your specific use case (I don't know all the details)?
@albertz and the rest of the reviewers, what's your take on this? Do you think the functionality here is too redundant? Our use case could be covered by a small function (or two, if you want to get recordings in O(1) as well) in the Corpus
class.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
About the memory redundancy, I don't think there is really any problem. It will only have duplicated some of the pointers. You probably won't even notice this.
I haven't really checked too much, but if you can just put your extension into the Corpus
class itself, or rewrite Corpus
such that it stays compatible, I think I would prefer that over having a CorpusV2
.
lib/corpus.py
Outdated
def get_segment_by_name(self, name: str) -> Segment: | ||
""" | ||
:return: the segment specified by its name | ||
""" | ||
for seg in self.segments(): | ||
if seg.name == name: | ||
return seg | ||
assert False, f"Segment '{name}' was not found in corpus" | ||
|
||
def get_segment_by_full_name(self, name: str) -> Optional[Segment]: | ||
""" | ||
:return: the segment specified by its full name | ||
""" | ||
if name == "": | ||
# Found nothing. | ||
return None | ||
|
||
if name in self.segments: | ||
return self.segments[name] | ||
else: | ||
subcorpus_name = name.split("/")[0] | ||
segment_name_from_subcorpus = name[len(f"{subcorpus_name}/"):] | ||
return self.subcorpora[subcorpus_name].get_segment_by_full_name(segment_name_from_subcorpus) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not like that the functions get_segment_by_{full_,}name
differ in the efficiency of the implementation and also in the behavior if a segment is not found.
Also the get_segment_by_full_name
is recursive and then the name
that is passed is not the full_name any more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should therefore get rid of get_segment_by_name
as it currently is.
lib/corpus.py
Outdated
if name in self.segments: | ||
return self.segments[name] | ||
else: | ||
subcorpus_name = name.split("/")[0] | ||
segment_name_from_subcorpus = name[len(f"{subcorpus_name}/"):] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can already determine the location of the segment within the subcorpora/recordings by len(name.split("/"))
so we could indeed loop over the name and use e.g.
L = len(name.split('/'))
active_element = self
for i,n in enumerate(name.split('/')):
if L-i > 2:
active_element = active_element.subcorpora[n]
elif L-i > 1:
active_element = active_element.recordings[n]
else:
return active_element.segments[n]
to directly navigate to the segment.
But a recursive implementation is also fine (an maybe cleaner == nicer) if done nicely :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like better the recursive implementation, but let me know if you prefer the iterative one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Iterative is always to be preferred for Python.
As I stated in a comment above, note that the I still expect the |
get_segment_map allows for more explicit control to the user
Copy-pasting from a comment above:
Yes, that makes a lot of sense. This implementation was originally thought in order to get the orthography of a given segment in O(1), which was the issue we were facing. @michelwi do you think returning a segment map from full names to segments would do the trick for your specific use case (I don't know all the details)? @albertz and the rest of the reviewers, what's your take on this? Do you think the functionality here is too redundant? Our use case could be covered by a small function (or two, if you want to get recordings in O(1) as well) in the Corpus class. |
Co-authored-by: michelwi <michelwi@users.noreply.github.com>
@albertz I cannot merge while you're requesting changes. Could you please review the PR whenever you have time? Thanks! |
The class
CorpusV2
allowsO(1)
access to any segment or recording.It might be a naive implementation, but I think a concept similar to this should be enough for our needs of accessing any segment in
O(subcorpus_depth)
, which is practicallyO(1)
in our cases.It's a draft so any feedback and changes will be appreciated.