Description
Description of Problem:
The new architecture will rely on the ability to fingerprint all data that passes through the training graph in-order to determine what needs to be re-run when re-training.
We can start adding fingerprint capabilities to existing data types in preparation.
The scope this issue is core training data objects.
Overview of the Solution:
A fingerprint is a unique, stable and small representation of a piece of data. For example an MD5 hash for text. (Note: the python builtin hash()
is not stable!)
Objects that are returned from graph nodes during training need to have a method fingerprint
which returns a fingerprint for the object.
Below is a list of all the objects covered by this issue. Some may already have fingerprinting capability, so it is also part of this issue to confirm it works correctly, make sure there is a unit test, and make sure it is accessible from .fingerprint()
.
rasa.shared.core.training_data.structures.StoryGraph
- This holds all the stories training data
- A
StoryGraph
is made up ofStoryStep
s which in turn haveCheckpoint
s andEvent
s. - However
StoryGraph
already has afingerprint
method which uses theYAMLStoryWriter
along withrasa.shared.utils.io.deep_container_fingerprint
which should be sufficient
rasa.shared.core.generator.TrackerWithCachedStates
- Once the stories are featurized they become
TrackerWithCachedStates
- This contains:
events: List[Event]
slots: Iterable[Slot]
_states_for_hashing: Deque[FrozenSet[Tuple[Text, FrozenSet[Tuple[Text, Tuple[Union[float, Text]]]]]]]
_states_for_hashing
is how the past states are cached in the tracker. As it is made up of numbers and text in fairly standard python structures it should be fine to fingerprint, although this functionality needs to be added.
- Once the stories are featurized they become
rasa.shared.core.events.Event
- A single step in a tracker.
- There shouldn't be anything crazy in any of the events that is hard to fingerprint
- Although many different subclasses that all contain different things.
- Worst case scenario we could use
as_story_string
which already gives a unique string representation for each event subclass and is a required method.
rasa.shared.core.slots.Slot
- A piece of remembered information, held in the tracker.
- Similar to
Event
as there are lots of subclasses which are all quite different. - required method
as_feature
could be used for the fingerprint.
rasa.shared.core.domain.Domain
- The domain holds a lot of stuff, but we don't need to go into exhaustive detail as it already has a
fingerprint
method.
- The domain holds a lot of stuff, but we don't need to go into exhaustive detail as it already has a
Definition of Done:
- all listed objects have
.fingerprint()
. - all fingerprints are tested.