Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add V2 version of the Corpus class #544

Merged
merged 15 commits into from
Oct 10, 2024
164 changes: 163 additions & 1 deletion lib/corpus.py
Original file line number Diff line number Diff line change
Expand Up @@ -163,7 +163,7 @@ def segments(self) -> Iterable[Segment]:
"""
for r in self.recordings:
yield from r.segments
for sc in self.subcorpora:
for sc in self.subcorpora:
Icemole marked this conversation as resolved.
Show resolved Hide resolved
yield from sc.segments()

def get_segment_by_name(self, name: str) -> Segment:
Expand Down Expand Up @@ -301,6 +301,154 @@ def _dump_internal(self, out: TextIO, indentation: str = ""):
out.write("%s</subcorpus>\n" % (indentation,))


class CorpusV2(Corpus):
Icemole marked this conversation as resolved.
Show resolved Hide resolved
"""
This class represents a corpus in the Bliss format. It is also used to represent subcorpora when the parent_corpus
attribute is set. Corpora with include statements can be read but are written back as a single file.

The difference with respect to :class:`i6_core.lib.corpus.Corpus` is that in this class we can access
any recording or segment in practically O(1).
"""

def __init__(self):
super().__init__()

self.parent_corpus: Optional[Corpus] = None
Icemole marked this conversation as resolved.
Show resolved Hide resolved

self.subcorpora: Dict[str, Corpus] = {}
self.recordings: Dict[str, RecordingV2] = {}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how about we add the new Dict members under a different name and then add a property like

@property
def subcorpora(self):
    return list(self._subcorpora.values())

@subcorpora.setter
def subcorpora(self, values):
    for sc in values:
        self._subcorpora[sc.name] = sc

so that most previous code would also work with V2. Also many functions below would then not need to be re-implemented.

Copy link
Collaborator Author

@Icemole Icemole Sep 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my view, we shouldn't restrain ourselves to using V1 code anymore, since it leads the user to keep using retrocompatible code instead of striving for optimal performance. If the user needs a Corpus, then they should build a Corpus, and should only build a CorpusV2 if they know what they're doing.

If you still think that it would be a nice addition, please let me know, we can debate it or I can simply support this.

Anyway, I expect that the CorpusV2 class makes sense only in very specific scenarios, such as when having to access segments in O(1). Otherwise, the Corpus class still makes sense as a repository of subcorpora, recordings and segments, since the functionality in V2 won't need to be used by many users.

Edit: note that the CorpusV2 has some memory redundancy from the dictionary indexed by name and the object in memory having the name as well. So it's a bit of a tradeoff, more memory for less access time.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my view, we shouldn't restrain ourselves to using V1 code anymore

This would be another argument for not inheriting from Corpus. Maybe a Corpus-like implementation is not even necessary. As you said

I expect that the CorpusV2 class makes sense only in very specific scenarios,

Maybe for these scenarios a simple data structure or even just a

segment_map = { s.fullname(): s for s in some_corpus.all_segments() }

would be sufficient to fulfill their performance needs.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that makes a lot of sense. This implementation was originally thought in order to get the orthography of a given segment in O(1), which was the issue we were facing. @michelwi do you think returning a segment map from full names to segments would do the trick for your specific use case (I don't know all the details)?

@albertz and the rest of the reviewers, what's your take on this? Do you think the functionality here is too redundant? Our use case could be covered by a small function (or two, if you want to get recordings in O(1) as well) in the Corpus class.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

About the memory redundancy, I don't think there is really any problem. It will only have duplicated some of the pointers. You probably won't even notice this.

I haven't really checked too much, but if you can just put your extension into the Corpus class itself, or rewrite Corpus such that it stays compatible, I think I would prefer that over having a CorpusV2.

Icemole marked this conversation as resolved.
Show resolved Hide resolved

def segments(self) -> Iterable[Segment]:
"""
:return: an iterator over all segments within the corpus
"""
for r in self.recordings:
yield from r.segments.values()
for sc in self.subcorpora:
yield from sc.segments()

def get_segment_by_name(self, name: str) -> Segment:
"""
:return: the segment specified by its name
"""
for seg in self.segments():
if seg.name == name:
return seg
assert False, f"Segment '{name}' was not found in corpus"

def get_segment_by_full_name(self, name: str) -> Optional[Segment]:
"""
:return: the segment specified by its full name
"""
if name == "":
# Found nothing.
return None

if name in self.segments:
Icemole marked this conversation as resolved.
Show resolved Hide resolved
return self.segments[name]
else:
subcorpus_name = name.split("/")[0]
segment_name_from_subcorpus = name[len(f"{subcorpus_name}/"):]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can already determine the location of the segment within the subcorpora/recordings by len(name.split("/")) so we could indeed loop over the name and use e.g.

L = len(name.split('/'))
active_element = self
for i,n in enumerate(name.split('/')):
    if L-i > 2:
        active_element = active_element.subcorpora[n]
    elif L-i > 1:
        active_element = active_element.recordings[n]
    else:
        return  active_element.segments[n]

to directly navigate to the segment.

But a recursive implementation is also fine (an maybe cleaner == nicer) if done nicely :)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like better the recursive implementation, but let me know if you prefer the iterative one.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Iterative is always to be preferred for Python.

return self.subcorpora[subcorpus_name].get_segment_by_full_name(segment_name_from_subcorpus)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not like that the functions get_segment_by_{full_,}name differ in the efficiency of the implementation and also in the behavior if a segment is not found.
Also the get_segment_by_full_name is recursive and then the name that is passed is not the full_name any more.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should therefore get rid of get_segment_by_name as it currently is.


def get_recording_by_full_name(self, name: str) -> Optional[Segment]:
"""
:return: the segment specified by its full name
"""
if name == "":
# Found nothing.
return None

if name in self.segments:
return self.segments[name]
Icemole marked this conversation as resolved.
Show resolved Hide resolved
else:
subcorpus_name = name.split("/")[0]
segment_name_from_subcorpus = name[len(f"{subcorpus_name}/"):]
return self.subcorpora[subcorpus_name].get_segment_by_full_name(segment_name_from_subcorpus)

def all_recordings(self) -> Iterable[Recording]:
yield from self.recordings.values()
for sc in self.subcorpora.values():
yield from sc.all_recordings()

def all_speakers(self) -> Iterable[Speaker]:
yield from self.speakers.values()
for sc in self.subcorpora:
yield from sc.all_speakers()

def top_level_recordings(self) -> Iterable[Recording]:
yield from self.recordings.values()

def top_level_subcorpora(self) -> Iterable[Corpus]:
yield from self.subcorpora.values()

def top_level_speakers(self) -> Iterable[Speaker]:
yield from self.speakers.values()

def remove_recording(self, recording: Recording):
del self.recordings[recording.name]
for sc in self.subcorpora.values():
sc.remove_recording(recording)

def remove_recordings(self, recordings: List[Recording]):
for recording in recordings:
del self.recordings[recording.name]
for sc in self.subcorpora:
sc.remove_recordings(recordings)

def add_recording(self, recording: Recording):
assert isinstance(recording, Recording)
recording.corpus = self
self.recordings[recording.name] = recording

def add_subcorpus(self, corpus: Corpus):
assert isinstance(corpus, Corpus)
corpus.parent_corpus = self
self.subcorpora[corpus.name] = corpus

def add_speaker(self, speaker: Speaker):
assert isinstance(speaker, Speaker)
self.speakers[speaker.name] = speaker

def fullname(self) -> str:
if self.parent_corpus is not None:
return self.parent_corpus.fullname() + "/" + self.name
else:
return self.name

def filter_segments(self, filter_function: FilterFunction):
"""
filter all segments (including in subcorpora) using filter_function
:param filter_function: takes arguments corpus, recording and segment, returns True if segment should be kept
"""
for r in self.recordings.values():
r.segments = [s for s in r.segments.values() if filter_function(self, r, s)]
for sc in self.subcorpora:
sc.filter_segments(filter_function)

def _dump_internal(self, out: TextIO, indentation: str = ""):
if self.parent_corpus is None:
out.write('<corpus name="%s">\n' % self.name)
else:
out.write('%s<subcorpus name="%s">\n' % (indentation, self.name))

for s in self.speakers.values():
s.dump(out, indentation + " ")
if self.speaker_name is not None:
out.write('%s <speaker name="%s"/>\n' % (indentation, self.speaker_name))

for r in self.recordings.values():
r.dump(out, indentation + " ")

for sc in self.subcorpora.values():
sc._dump_internal(out, indentation + " ")

if self.parent_corpus is None:
out.write("</corpus>\n")
else:
out.write("%s</subcorpus>\n" % (indentation,))


class Recording(NamedEntity, CorpusSection):
def __init__(self):
super().__init__()
Expand Down Expand Up @@ -338,6 +486,20 @@ def add_segment(self, segment: Segment):
self.segments.append(segment)


class RecordingV2(Recording):
Icemole marked this conversation as resolved.
Show resolved Hide resolved
def __init__(self):
super().__init__()
self.audio: Optional[str] = None
self.corpus: Optional[Corpus] = None
self.segments: Dict[str, Segment] = {}

def add_segment(self, segment: Segment):
assert isinstance(segment, Segment)
segment.recording = self
self.segments[segment.name] = segment



class Segment(NamedEntity):
def __init__(self):
super().__init__()
Expand Down
Loading