Skip to content

Substrait-based on demand feature views #3945

Closed
@tokoko

Description

@tokoko

Is your feature request related to a problem? Please describe.
On demand feature views as implemented right now are very limited. The only way to specify odfvs is through a python function that takes in pandas Dataframe as input and outputs another pandas Dataframe. This leads to problems for both offline and online interfaces:

  • Even the most scalable offline stores are forced to collect the whole dataset as a single pandas Dataframe to apply odfv function. There's no way for offline stores to incorporate computation in their engines.
  • udfs in odfvs are inherently bound to pandas and python runtime. Non-python feature servers are stuck with the problem of figuring out how to run this functions if necessary. Java feature server for example has a separate python transformation service only for this reason, but that's obviously a subpar solution as the whole point of a java feature server was to avoid python runtime in feature serving in the first place.

Describe the solution you'd like
Allow constructing odfvs as substrait plans. Substrait is a protobuf-based serialization format for relational algebra operations. It is meant to be used as a cross-language and cross-engine format for sharing logical or physical execution plans. It has a number of producers (tools that can generate substrait) and consumers (engines that can run substrait) in different languages.

  • Different offline stores will be able to inspect and incorporate substrait plans in their transformations. Even if that's impossible the default implementation inside feast to apply these functions will avoid pandas.
  • Most importantly, non-python feature servers like a java feature server will be able to apply the functions without a separate python component. Apache Arrow java implementation comes with java bindings to Acero query engine that can consume substrait plans. (https://arrow.apache.org/docs/java/substrait.html#executing-queries-using-substrait-plans)

The example code in my PoC implementation looks something like this:

def generate_substrait():
    import ibis
    from ibis_substrait.compiler.core import SubstraitCompiler

    compiler = SubstraitCompiler()

    t = ibis.table([("conv_rate", "float"), ("acc_rate", "float")], "t")

    expr = t.select((t['conv_rate'] + t['acc_rate']).name('conv_rate_plus_acc_substrait'))

    return compiler.compile(expr).SerializeToString()

substrait_odfv = OnDemandFeatureView(
    name='substrait_view',
    sources=[driver_stats_fv],
    schema=[
        Field(name="conv_rate_plus_acc_substrait", dtype=Float64)
    ],
    substrait_plan=generate_substrait()
)

Substait plan object that feast accepts is bytes and introduces no external dependency. I'm using ibis and ibis-substrait to generate the plan. Right now that's the most practical way to generate substrait plan in python with DataFrame-like API, but this could have been any other substrait producer.

Describe alternatives you've considered
An obvious alternative to substrait is sql-based odfvs, but using SQL has a number of important downsides:

  1. The presence of different sql dialects means that, it will be especially hard to ensure that sql-based feature functions will behave the same way across different offline store and online store implementations.
  2. The user is implicitly bound to their offline store and online store of choice, because the dialect used in sql strings has to match offline store engine.

Having said that, it probably makes sense to support both substrait-based and sql-based odfvs, because at the moment it might be easier for sql-based logic to be incorporated inside offline store engines.

Activity

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions