Skip to content

Conversation

HaoXuAI
Copy link
Collaborator

@HaoXuAI HaoXuAI commented Oct 5, 2025

What this PR does / why we need it:

examples:

  from feast.table_format import IcebergFormat
  from feast.infra.offline_stores.contrib.spark_offline_store.spark_source import SparkSource

  # Create Iceberg format with snapshot option
  iceberg_format = IcebergFormat()
  iceberg_format.set_property("snapshot-id", "your_snapshot_id")

  spark_source = SparkSource(
      name="iceberg_features",
      path="my_catalog.my_database.my_table",  # Catalog table name
      table_format=iceberg_format
  )

  Generated Spark code:
  spark.read.format("iceberg").option("snapshot-id", "your_snapshot_id").load("my_catalog.my_database.my_table")

  ✅ Delta Example:

  from feast.table_format import DeltaFormat

  delta_format = DeltaFormat()
  delta_format.set_property("versionAsOf", "1")

  spark_source = SparkSource(
      name="delta_features",
      path="s3://bucket/delta-table",
      table_format=delta_format
  )

  Generated Spark code:
  spark.read.format("delta").option("versionAsOf", "1").load("s3://bucket/delta-table")

  ✅ Hudi Example:

  from feast.table_format import HudiFormat

  hudi_format = HudiFormat()
  hudi_format.set_property("hoodie.datasource.query.type", "snapshot")
  hudi_format.set_property("as.of.instant", "20231010120000")

  spark_source = SparkSource(
      name="hudi_features",
      path="s3://bucket/hudi-table",
      table_format=hudi_format
  )

  Generated Spark code:
  spark.read.format("hudi")\
    .option("hoodie.datasource.query.type", "snapshot")\
    .option("as.of.instant", "20231010120000")\
    .load("s3://bucket/hudi-table")

Which issue(s) this PR fixes:

Misc

Signed-off-by: HaoXuAI <sduxuhao@gmail.com>
Signed-off-by: HaoXuAI <sduxuhao@gmail.com>
Signed-off-by: HaoXuAI <sduxuhao@gmail.com>
@HaoXuAI HaoXuAI requested a review from a team as a code owner October 5, 2025 21:47
@HaoXuAI HaoXuAI changed the title feat: Support table format feat: Support table format: Iceberg, Delta, and Hudi Oct 5, 2025
Signed-off-by: HaoXuAI <sduxuhao@gmail.com>
Signed-off-by: HaoXuAI <sduxuhao@gmail.com>
string date_partition_column_format = 5;

// Table Format (e.g. iceberg, delta, etc)
string table_format = 6;
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO, create TableFormat proto, consolidate with FileFormat proto

@tokoko
Copy link
Collaborator

tokoko commented Oct 6, 2025

+1 on the inclusion of all 3 formats. Still I think we might be able to better design data-source side such that data source definitions don't tie the sources to specific offline stores. For example right now FileSource already supports specifying delta as a format... I don't like that design either because delta and especially iceberg might require more configuration that plain files, but the upside in that design was that Source class wasn't tied to a single offline store and it is actually usable from both dask and duckdb offline stores. Hiding formats behind SparkSource will mean that when a user stores a data source instance in the registry only spark offline store can be used to retrieve the data.

I think we can have best of both worlds if we instead go for adding all these formats as separate independent data sources (DeltaSource, IcebergSource, HudiSource) and then simply teach spark engine (and in the future others) how to read it.

query: The query to be executed in Spark.
path: The path to file data.
file_format: The format of the file data.
file_format: The underlying file format (parquet, avro, csv, json).

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not consolidate now?

@franciscojavierarceo
Copy link
Member

I think we can have best of both worlds if we instead go for adding all these formats as separate independent data sources (DeltaSource, IcebergSource, HudiSource) and then simply teach spark engine (and in the future others) how to read it.

+1

@HaoXuAI
Copy link
Collaborator Author

HaoXuAI commented Oct 6, 2025

@franciscojavierarceo @tokoko consolidation with FilleFormat and new data sources could break the backward compatibility, so I want to do it pace by pace.

@franciscojavierarceo
Copy link
Member

That makes sense

@tokoko
Copy link
Collaborator

tokoko commented Oct 7, 2025

@HaoXuAI Why would new data sources break backwards compatibility though?

@HaoXuAI
Copy link
Collaborator Author

HaoXuAI commented Oct 9, 2025

@HaoXuAI Why would new data sources break backwards compatibility though?

There will be some proto changes, no 100% sure if there will be API changes exposed to users but I think might be the case

Signed-off-by: HaoXuAI <sduxuhao@gmail.com>
Signed-off-by: HaoXuAI <sduxuhao@gmail.com>
@HaoXuAI
Copy link
Collaborator Author

HaoXuAI commented Oct 10, 2025

@franciscojavierarceo @ntkathole mind take a look

Copy link
Member

@franciscojavierarceo franciscojavierarceo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@HaoXuAI i don't see use actually using or testing Spark Table, Iceberg, or Hudi format's outside of our definitions, can you add that?

Can you also add documentation that these formats are now supported?

Otherwise lgtm.

@HaoXuAI
Copy link
Collaborator Author

HaoXuAI commented Oct 14, 2025

@HaoXuAI i don't see use actually using or testing Spark Table, Iceberg, or Hudi format's outside of our definitions, can you add that?

Can you also add documentation that these formats are now supported?

Otherwise lgtm.

@HaoXuAI HaoXuAI closed this Oct 14, 2025
@HaoXuAI HaoXuAI reopened this Oct 14, 2025
@HaoXuAI
Copy link
Collaborator Author

HaoXuAI commented Oct 14, 2025

@HaoXuAI i don't see use actually using or testing Spark Table, Iceberg, or Hudi format's outside of our definitions, can you add that?
Can you also add documentation that these formats are now supported?
Otherwise lgtm.

Gonna update to add the TableFormat proto in the next PR, after that I'll add the docs. And I think the test will need to be changed as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants