Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add V2 version of the Corpus class #544
Add V2 version of the Corpus class #544
Changes from 1 commit
e12a275
5bd8bc7
f3188b3
e045629
824ffd1
92c646c
7b7f4b8
bfd6bbc
230585b
224b618
81582f4
a10e427
17c96b8
02722e0
120cc76
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how about we add the new Dict members under a different name and then add a property like
so that most previous code would also work with V2. Also many functions below would then not need to be re-implemented.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In my view, we shouldn't restrain ourselves to using V1 code anymore, since it leads the user to keep using retrocompatible code instead of striving for optimal performance. If the user needs a
Corpus
, then they should build aCorpus
, and should only build aCorpusV2
if they know what they're doing.If you still think that it would be a nice addition, please let me know, we can debate it or I can simply support this.
Anyway, I expect that the
CorpusV2
class makes sense only in very specific scenarios, such as when having to access segments in O(1). Otherwise, theCorpus
class still makes sense as a repository of subcorpora, recordings and segments, since the functionality inV2
won't need to be used by many users.Edit: note that the
CorpusV2
has some memory redundancy from the dictionary indexed by name and the object in memory having the name as well. So it's a bit of a tradeoff, more memory for less access time.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would be another argument for not inheriting from Corpus. Maybe a Corpus-like implementation is not even necessary. As you said
Maybe for these scenarios a simple data structure or even just a
would be sufficient to fulfill their performance needs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that makes a lot of sense. This implementation was originally thought in order to get the orthography of a given segment in O(1), which was the issue we were facing. @michelwi do you think returning a segment map from full names to segments would do the trick for your specific use case (I don't know all the details)?
@albertz and the rest of the reviewers, what's your take on this? Do you think the functionality here is too redundant? Our use case could be covered by a small function (or two, if you want to get recordings in O(1) as well) in the
Corpus
class.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
About the memory redundancy, I don't think there is really any problem. It will only have duplicated some of the pointers. You probably won't even notice this.
I haven't really checked too much, but if you can just put your extension into the
Corpus
class itself, or rewriteCorpus
such that it stays compatible, I think I would prefer that over having aCorpusV2
.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can already determine the location of the segment within the subcorpora/recordings by
len(name.split("/"))
so we could indeed loop over the name and use e.g.to directly navigate to the segment.
But a recursive implementation is also fine (an maybe cleaner == nicer) if done nicely :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like better the recursive implementation, but let me know if you prefer the iterative one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Iterative is always to be preferred for Python.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not like that the functions
get_segment_by_{full_,}name
differ in the efficiency of the implementation and also in the behavior if a segment is not found.Also the
get_segment_by_full_name
is recursive and then thename
that is passed is not the full_name any more.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should therefore get rid of
get_segment_by_name
as it currently is.