Skip to content

Handle BigQuery partitions when event_timestamp is not the partition column, deal with event_timestamp columns of DATE type #2530

Open
@rory-nia

Description

@rory-nia

Is your feature request related to a problem? Please describe.
At a high level, I'd like to be able to install Feast, describe my sources and FVs, and immediately start working to get historical data and materialize features. However, by not being partition-aware and by not accounting for non-timestamp date columns, a large number of our tables are not usable without modification.

This feature request is specific to the GCP provider and BigQuery offline store. It has two parts: 1) relating to respecting partition when constructing date ranges in SQL, 2) relating to source tables with "timestamp" column not of TIMESTAMP type.

  1. In any table were the partition column is not the same column as the event_timestamp, Feast will ignore partitions and do a full table scan on each query. For large tables, this causes BQ to fail due to "insufficient resources" - not a slot/quote issue, but having to do with needing to marshal too much transient data. A common example with BQ is when doing DAY partitioning using ingestion time, the magic column _PARTITIONTIME is the partition column. Each record will contain its own actual event_timestamp column, but this column is not the partition column.

  2. For tables with an "event_timestamp" that is not a TIMESTAMP type, the SQL can break when there's a type mismatch, e.g. when the column is a DATE type. A common use case is a DAY partitioned summary table that performs aggregations over the day or possibly just takes the final value for each entity/date. These tables have a DATE column on each record. In this case, the DATE column is usually the partition column, but it is not the correct format for the date conditions in the Feast template SQL. I would prefer not to have to create a new column or a view on each table just to make Feast work.

Describe the solution you'd like
I'd like to be able to specify information about the partition column such that Feast could formulate SQL optimized to my table definition taking advantage of partitioning. This information would probably need to include the partition column name, type, and partition type at a minimum. I believe optimized SQL would not be much different from the existing queries but would require, in cases where the partition column is not the same as the event_timestamp column, to add an additional range condition using the partition column.

I'd also like to be able to indicate that my "event_timestamp" column is not a TIMESTAMP type - and have Feast adjust the SQL accordingly. Specifically, I'd like to specify in the Feast BigQuerySource definition that the "event_timestamp" column is the partition column and is of DATE type and have Feast make the corresponding adjustments to the templatized SQL.

Describe alternatives you've considered
To deal with (1), the only alternatives I can think of are a) creating a view that changes the partition column to the event_timestamp column for every table, b) creating a duplicate of each table using the event_timestamp column.

To deal with (2), similarly, either a) create a view with a new timestamp typed column to use as the event_timestamp, b) add a new timestamp column to the existing tables.

All these solutions require a good amount of work and/or creating duplicate tables/views. I would really like Feast to just do the right thing after I create a Source.

Additional context
Happy to provide any further details if the above is not clear.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    Status

    Done

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions