Skip to content

Add fingerprinting capability to Core training data classes. #9133

Closed
@joejuzl

Description

Description of Problem:
The new architecture will rely on the ability to fingerprint all data that passes through the training graph in-order to determine what needs to be re-run when re-training.
We can start adding fingerprint capabilities to existing data types in preparation.

The scope this issue is core training data objects.

Overview of the Solution:

A fingerprint is a unique, stable and small representation of a piece of data. For example an MD5 hash for text. (Note: the python builtin hash() is not stable!)
Objects that are returned from graph nodes during training need to have a method fingerprint which returns a fingerprint for the object.

Below is a list of all the objects covered by this issue. Some may already have fingerprinting capability, so it is also part of this issue to confirm it works correctly, make sure there is a unit test, and make sure it is accessible from .fingerprint().

  • rasa.shared.core.training_data.structures.StoryGraph
    • This holds all the stories training data
    • A StoryGraph is made up of StorySteps which in turn have Checkpoints and Events.
    • However StoryGraph already has a fingerprint method which uses the YAMLStoryWriter along with rasa.shared.utils.io.deep_container_fingerprint which should be sufficient
  • rasa.shared.core.generator.TrackerWithCachedStates
    • Once the stories are featurized they become TrackerWithCachedStates
    • This contains:
      • events: List[Event]
      • slots: Iterable[Slot]
      • _states_for_hashing: Deque[FrozenSet[Tuple[Text, FrozenSet[Tuple[Text, Tuple[Union[float, Text]]]]]]]
    • _states_for_hashing is how the past states are cached in the tracker. As it is made up of numbers and text in fairly standard python structures it should be fine to fingerprint, although this functionality needs to be added.
  • rasa.shared.core.events.Event
    • A single step in a tracker.
    • There shouldn't be anything crazy in any of the events that is hard to fingerprint
    • Although many different subclasses that all contain different things.
    • Worst case scenario we could use as_story_string which already gives a unique string representation for each event subclass and is a required method.
  • rasa.shared.core.slots.Slot
    • A piece of remembered information, held in the tracker.
    • Similar to Event as there are lots of subclasses which are all quite different.
    • required method as_feature could be used for the fingerprint.
  • rasa.shared.core.domain.Domain
    • The domain holds a lot of stuff, but we don't need to go into exhaustive detail as it already has a fingerprint method.

Definition of Done:

  • all listed objects have .fingerprint().
  • all fingerprints are tested.

Metadata

Assignees

Labels

area:rasa-oss 🎡Anything related to the open source Rasa frameworkarea:rasa-oss/training-dataIssues focused around Rasa training data (stories, NLU, domain, etc.)effort:atom-squad/2Label which is used by the Rasa Atom squad to do internal estimation of task sizes.feature:rasa-3.0/architecturetype:enhancement ✨Additions of new features or changes to existing ones, should be doable in a single PR

Type

No type

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions