Add V2 version of the Corpus class #544

Icemole · 2024-09-24T13:48:46Z

The class CorpusV2 allows O(1) access to any segment or recording.

It might be a naive implementation, but I think a concept similar to this should be enough for our needs of accessing any segment in O(subcorpus_depth), which is practically O(1) in our cases.

It's a draft so any feedback and changes will be appreciated.

Allows O(1) access to any segment or recording

Typing improvements, fixes on accessing segment/recording by full name

lib/corpus.py

michelwi · 2024-09-24T14:11:40Z

lib/corpus.py

+        self.subcorpora: Dict[str, Corpus] = {}
+        self.recordings: Dict[str, RecordingV2] = {}


how about we add the new Dict members under a different name and then add a property like

@property def subcorpora(self): return list(self._subcorpora.values()) @subcorpora.setter def subcorpora(self, values): for sc in values: self._subcorpora[sc.name] = sc

so that most previous code would also work with V2. Also many functions below would then not need to be re-implemented.

In my view, we shouldn't restrain ourselves to using V1 code anymore, since it leads the user to keep using retrocompatible code instead of striving for optimal performance. If the user needs a Corpus, then they should build a Corpus, and should only build a CorpusV2 if they know what they're doing.

If you still think that it would be a nice addition, please let me know, we can debate it or I can simply support this.

Anyway, I expect that the CorpusV2 class makes sense only in very specific scenarios, such as when having to access segments in O(1). Otherwise, the Corpus class still makes sense as a repository of subcorpora, recordings and segments, since the functionality in V2 won't need to be used by many users.

Edit: note that the CorpusV2 has some memory redundancy from the dictionary indexed by name and the object in memory having the name as well. So it's a bit of a tradeoff, more memory for less access time.

In my view, we shouldn't restrain ourselves to using V1 code anymore

This would be another argument for not inheriting from Corpus. Maybe a Corpus-like implementation is not even necessary. As you said

I expect that the CorpusV2 class makes sense only in very specific scenarios,

Maybe for these scenarios a simple data structure or even just a

segment_map = { s.fullname(): s for s in some_corpus.all_segments() }

would be sufficient to fulfill their performance needs.

Yes, that makes a lot of sense. This implementation was originally thought in order to get the orthography of a given segment in O(1), which was the issue we were facing. @michelwi do you think returning a segment map from full names to segments would do the trick for your specific use case (I don't know all the details)?

@albertz and the rest of the reviewers, what's your take on this? Do you think the functionality here is too redundant? Our use case could be covered by a small function (or two, if you want to get recordings in O(1) as well) in the Corpus class.

About the memory redundancy, I don't think there is really any problem. It will only have duplicated some of the pointers. You probably won't even notice this.

I haven't really checked too much, but if you can just put your extension into the Corpus class itself, or rewrite Corpus such that it stays compatible, I think I would prefer that over having a CorpusV2.

michelwi · 2024-09-24T14:14:51Z

lib/corpus.py

+    def get_segment_by_name(self, name: str) -> Segment:
+        """
+        :return: the segment specified by its name
+        """
+        for seg in self.segments():
+            if seg.name == name:
+                return seg
+        assert False, f"Segment '{name}' was not found in corpus"
+
+     def get_segment_by_full_name(self, name: str) -> Optional[Segment]:
+        """
+        :return: the segment specified by its full name
+        """
+        if name == "":
+            # Found nothing.
+            return None
+
+        if name in self.segments:
+            return self.segments[name]
+        else:
+            subcorpus_name = name.split("/")[0]
+            segment_name_from_subcorpus = name[len(f"{subcorpus_name}/"):]
+            return self.subcorpora[subcorpus_name].get_segment_by_full_name(segment_name_from_subcorpus)


I do not like that the functions get_segment_by_{full_,}name differ in the efficiency of the implementation and also in the behavior if a segment is not found.
Also the get_segment_by_full_name is recursive and then the name that is passed is not the full_name any more.

I think we should therefore get rid of get_segment_by_name as it currently is.

michelwi · 2024-09-24T14:26:01Z

lib/corpus.py

+        if name in self.segments:
+            return self.segments[name]
+        else:
+            subcorpus_name = name.split("/")[0]
+            segment_name_from_subcorpus = name[len(f"{subcorpus_name}/"):]


we can already determine the location of the segment within the subcorpora/recordings by len(name.split("/")) so we could indeed loop over the name and use e.g.

L = len(name.split('/')) active_element = self for i,n in enumerate(name.split('/')): if L-i > 2: active_element = active_element.subcorpora[n] elif L-i > 1: active_element = active_element.recordings[n] else: return active_element.segments[n]

to directly navigate to the segment.

But a recursive implementation is also fine (an maybe cleaner == nicer) if done nicely :)

I like better the recursive implementation, but let me know if you prefer the iterative one.

Iterative is always to be preferred for Python.

lib/corpus.py

Icemole · 2024-09-24T16:22:29Z

As I stated in a comment above, note that the CorpusV2 has some memory redundancy from the dictionary indexed by name and the object in memory having the name as well. So it's a bit of a tradeoff, more memory for less access time.

I still expect the Corpus class to be useful as a repository of subcorpora, recordings and segments, for users who don't want fast access to segments and want to, for instance, filter all segments according to a criterion or iterate over all recordings (which are pretty common use cases, at least from my side).

get_segment_map allows for more explicit control to the user

Icemole · 2024-09-25T08:29:07Z

Copy-pasting from a comment above:

Maybe for these scenarios a simple data structure or even just a segment_map = { s.fullname(): s for s in some_corpus.all_segments() } would be sufficient to fulfill their performance needs.

Yes, that makes a lot of sense. This implementation was originally thought in order to get the orthography of a given segment in O(1), which was the issue we were facing. @michelwi do you think returning a segment map from full names to segments would do the trick for your specific use case (I don't know all the details)?

@albertz and the rest of the reviewers, what's your take on this? Do you think the functionality here is too redundant? Our use case could be covered by a small function (or two, if you want to get recordings in O(1) as well) in the Corpus class.

lib/corpus.py

Co-authored-by: michelwi <michelwi@users.noreply.github.com>

Icemole · 2024-10-10T09:00:47Z

@albertz I cannot merge while you're requesting changes. Could you please review the PR whenever you have time? Thanks!

Add V2 version of the Corpus class

e12a275

Allows O(1) access to any segment or recording

Icemole requested review from albertz, curufinwe, christophmluscher, JackTemaki and michelwi September 24, 2024 13:48

Icemole added 4 commits September 24, 2024 14:00

Several improvements

5bd8bc7

Typing improvements, fixes on accessing segment/recording by full name

Black

f3188b3

Fix indentation

e045629

Fix subcorpus access

824ffd1

Icemole changed the title ~~[Draft] Add V2 version of the Corpus class~~ Add V2 version of the Corpus class Sep 24, 2024

michelwi requested changes Sep 24, 2024

View reviewed changes

albertz requested changes Sep 24, 2024

View reviewed changes

lib/corpus.py Outdated Show resolved Hide resolved

lib/corpus.py Outdated Show resolved Hide resolved

Icemole added 4 commits September 24, 2024 15:50

Update load function to allow V2 classes to be loaded

92c646c

Improve get_segment_by_full_name function

7b7f4b8

Don't make V2 classes inherit from V1 classes

bfd6bbc

Fix subcorpus creation

230585b

Remove redundant get_segment_by_name, add get_segment_map

224b618

get_segment_map allows for more explicit control to the user

Icemole added 2 commits September 25, 2024 08:45

Fix function name

81582f4

Remove V2 implementation, add get_{recording,segment}_mapping

a10e427

albertz reviewed Sep 26, 2024

View reviewed changes

lib/corpus.py Outdated Show resolved Hide resolved

Icemole added 2 commits September 26, 2024 15:09

Remove any reference to V2

17c96b8

Improve some small details

02722e0

Icemole requested review from albertz and michelwi October 3, 2024 07:29

michelwi reviewed Oct 4, 2024

View reviewed changes

lib/corpus.py Outdated Show resolved Hide resolved

Fix access to recordings in nested subcorpora

120cc76

Co-authored-by: michelwi <michelwi@users.noreply.github.com>

Icemole requested a review from michelwi October 4, 2024 16:03

michelwi approved these changes Oct 8, 2024

View reviewed changes

albertz approved these changes Oct 10, 2024

View reviewed changes

Icemole merged commit 229d490 into main Oct 10, 2024
4 checks passed

Icemole deleted the add-corpus-v2 branch October 10, 2024 13:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add V2 version of the Corpus class #544

Add V2 version of the Corpus class #544

Icemole commented Sep 24, 2024 •

edited

Loading

michelwi Sep 24, 2024

Icemole Sep 24, 2024 •

edited

Loading

michelwi Sep 24, 2024

Icemole Sep 25, 2024

albertz Sep 25, 2024

michelwi Sep 24, 2024

Icemole Sep 24, 2024

michelwi Sep 24, 2024

Icemole Sep 24, 2024

albertz Sep 24, 2024

Icemole commented Sep 24, 2024 •

edited

Loading

Icemole commented Sep 25, 2024

Icemole commented Oct 10, 2024

		self.subcorpora: Dict[str, Corpus] = {}
		self.recordings: Dict[str, RecordingV2] = {}

Add V2 version of the Corpus class #544

Add V2 version of the Corpus class #544

Conversation

Icemole commented Sep 24, 2024 • edited Loading

Choose a reason for hiding this comment

Icemole Sep 24, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Icemole commented Sep 24, 2024 • edited Loading

Icemole commented Sep 25, 2024

Icemole commented Oct 10, 2024

Icemole commented Sep 24, 2024 •

edited

Loading

Icemole Sep 24, 2024 •

edited

Loading

Icemole commented Sep 24, 2024 •

edited

Loading