Description
[edit] Granularities are no longer required in FeatureRow, or FeatureSpecs, as we have removed history from serving store and the serving api. Thus there is also no requirement for it to be in the warehouse store. Additionally the notion of granularity has proven to be confusing to end users. History of issue kept below:
I'd like to discuss feature granularities.
What is granularity
Currently we have a fixed set of feast granularities {seconds, minutes, hours, days}.
It is not always obvious what the feast granularity refers to.
In general a feature is handled by a few different datetimes throughout it's lifecycle:
- the window duration of an aggregation (this is upstream to feast)
- the trigger frequency that an event is emitted per key, likely irregular if more than once per window (this is upstream to feast)
- the ingestion event timestamp that Feast receives during ingestion, determined by the feature creator
- the storage event timestamp used to store and retrieve features in Feast, determined by feast.
The storage event timestamp is derived by rounding the ingestion event timestamp to start of the granularity for all the features in a feature row. Eg: for a granularity of 1 hour, we round the ingestion timestamp to the start of the enclosing hour.
For example, say we have a feature that is aggregated over a 1 hour fixed windows and triggered every one minute. Each minute an update of the 1 hour window aggregation is provided. We would naturally use a 1 hour granularity for this. The ingestion event timestamp should be within the one hour window. The storage event timestamp would be the start of the window.
Another example, say we have a feature that is aggregated over a 10 minute sliding window, and triggered only once at the end of every window. In this case, the feast granularity actually needs to be 1 minute. Which can seem confusing.
Limitations of current approach
Feast rounds the ingested timestamps to a granularity provided by creation, this seemed a convenience, but it hinders the use of custom granularities and it can cause confusion.
For example: because the granularities are an enum and there is not 5 minute option. If we wanted to store and overwrite a new key every five minutes, we would need to use a finer granularity and manually round the ingestion timestamps to the 5 minute marks during feature creation.
Another example: Lets say we have a feature called "product.day.sold". As it is updated throughout the day, it could represent the number of products sold on that day so far, or just as easily it could represent the number of products sold in the last 24 hours at the time it was updated. It could also represent the last 7 days of sold products as it stood on that particular day. Basically the meaning of this feature is determined by how the feature was created. The feature granularity is not enough information, and could be misleading when feature creators are forced to workaround it's limitations.
I suggest that instead of attempting to handle granularities, we should just require that rounding the timestamps should always happen during feature creation, not within Feast, and we should simply store features against the event timestamp provided.
The problem of how to serve keys if do not have a fixed granularity, is not as bad as it sounds.
- firstly, it is only an issue at all when a feature is requested across a time range, not "latest". And "latest" is the most common request.
- secondly, our currently supported stores, BigTable and Redis, both support scans across key date ranges (Redis via our bucketing approach).
Another problem is how do we prevent feature creators from over polluting a key space with far too granular timestamps? We will still have this problem regardless, as a feature creator can always use the "seconds" granularity.
My proposal
- The storage event timestamp should be the same thing as ingestion event timestamp.
- We should drop granularity from FeatureRow and ignore it for ingestion and storage purposes.
- We should drop the requirement that granularity is part of the featureId. So instead of {entityName}.{granularity}.{featureName}, it should just be {entityName}.{featureName}.
- BigQuery tables (which are currently separated by granularities, should instead be separated by a feature's group)
We would be committing to a requirement that timely short scans across a key range are supported by all stores.
Benefits
- An easier to understand data model.
- Enables storing at custom granularities.
- Simplified code
What do people think?
Is there an issue with serving I have missed?