Skip to content

Commit

Permalink
docs: Add type system docs and add details to data source docs (feast…
Browse files Browse the repository at this point in the history
…-dev#3108)

* Data source docs

Signed-off-by: Felix Wang <wangfelix98@gmail.com>

* Type system docs

Signed-off-by: Felix Wang <wangfelix98@gmail.com>

* Update data source docs

Signed-off-by: Felix Wang <wangfelix98@gmail.com>

* Update docs

Signed-off-by: Felix Wang <wangfelix98@gmail.com>

Signed-off-by: Felix Wang <wangfelix98@gmail.com>
  • Loading branch information
felixwang9817 authored Aug 23, 2022
1 parent 83cf753 commit fad45ca
Show file tree
Hide file tree
Showing 12 changed files with 114 additions and 3 deletions.
2 changes: 2 additions & 0 deletions docs/SUMMARY.md
Original file line number Diff line number Diff line change
Expand Up @@ -62,7 +62,9 @@
## Reference

* [Codebase Structure](reference/codebase-structure.md)
* [Type System](reference/type-system.md)
* [Data sources](reference/data-sources/README.md)
* [Overview](reference/data-sources/overview.md)
* [File](reference/data-sources/file.md)
* [Snowflake](reference/data-sources/snowflake.md)
* [BigQuery](reference/data-sources/bigquery.md)
Expand Down
6 changes: 5 additions & 1 deletion docs/reference/data-sources/README.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,10 @@
# Data sources

Please see [Data Source](../../getting-started/concepts/data-ingestion.md) for an explanation of data sources.
Please see [Data Source](../../getting-started/concepts/data-ingestion.md) for a conceptual explanation of data sources.

{% content-ref url="overview.md" %}
[overview.md](overview.md)
{% endcontent-ref %}

{% content-ref url="file.md" %}
[file.md](file.md)
Expand Down
5 changes: 5 additions & 0 deletions docs/reference/data-sources/bigquery.md
Original file line number Diff line number Diff line change
Expand Up @@ -30,3 +30,8 @@ BigQuerySource(
```

The full set of configuration options is available [here](https://rtd.feast.dev/en/latest/index.html#feast.infra.offline_stores.bigquery_source.BigQuerySource).

## Supported Types

BigQuery data sources support all eight primitive types and their corresponding array types.
For a comparison against other batch data sources, please see [here](overview.md#functionality-matrix).
5 changes: 5 additions & 0 deletions docs/reference/data-sources/file.md
Original file line number Diff line number Diff line change
Expand Up @@ -22,3 +22,8 @@ parquet_file_source = FileSource(
```

The full set of configuration options is available [here](https://rtd.feast.dev/en/latest/index.html#feast.infra.offline_stores.file_source.FileSource).

## Supported Types

File data sources support all eight primitive types and their corresponding array types.
For a comparison against other batch data sources, please see [here](overview.md#functionality-matrix).
31 changes: 31 additions & 0 deletions docs/reference/data-sources/overview.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
# Overview

## Functionality

In Feast, each batch data source is associated with a corresponding offline store.
For example, a `SnowflakeSource` can only be processed by the Snowflake offline store.
Otherwise, the primary difference between batch data sources is the set of supported types.
Feast has an internal type system, and aims to support eight primitive types (`bytes`, `string`, `int32`, `int64`, `float32`, `float64`, `bool`, and `timestamp`) along with the corresponding array types.
However, not every batch data source supports all of these types.

For more details on the Feast type system, see [here](../type-system.md).

## Functionality Matrix

There are currently four core batch data source implementations: `FileSource`, `BigQuerySource`, `SnowflakeSource`, and `RedshiftSource`.
There are several additional implementations contributed by the Feast community (`PostgreSQLSource`, `SparkSource`, and `TrinoSource`), which are not guaranteed to be stable or to match the functionality of the core implementations.
Details for each specific data source can be found [here](README.md).

Below is a matrix indicating which data sources support which types.

| | File | BigQuery | Snowflake | Redshift | Postgres | Spark | Trino |
| :-------------------------------- | :-- | :-- | :-- | :-- | :-- | :-- | :-- |
| `bytes` | yes | yes | yes | yes | yes | yes | yes |
| `string` | yes | yes | yes | yes | yes | yes | yes |
| `int32` | yes | yes | yes | yes | yes | yes | yes |
| `int64` | yes | yes | yes | yes | yes | yes | yes |
| `float32` | yes | yes | yes | yes | yes | yes | yes |
| `float64` | yes | yes | yes | yes | yes | yes | yes |
| `bool` | yes | yes | yes | yes | yes | yes | yes |
| `timestamp` | yes | yes | yes | yes | yes | yes | yes |
| array types | yes | yes | no | no | yes | yes | no |
5 changes: 5 additions & 0 deletions docs/reference/data-sources/postgres.md
Original file line number Diff line number Diff line change
Expand Up @@ -28,3 +28,8 @@ driver_stats_source = PostgreSQLSource(
```

The full set of configuration options is available [here](https://rtd.feast.dev/en/master/#feast.infra.offline_stores.contrib.postgres_offline_store.postgres_source.PostgreSQLSource).

## Supported Types

PostgreSQL data sources support all eight primitive types and their corresponding array types.
For a comparison against other batch data sources, please see [here](overview.md#functionality-matrix).
5 changes: 5 additions & 0 deletions docs/reference/data-sources/redshift.md
Original file line number Diff line number Diff line change
Expand Up @@ -30,3 +30,8 @@ my_redshift_source = RedshiftSource(
```

The full set of configuration options is available [here](https://rtd.feast.dev/en/master/#feast.infra.offline_stores.redshift_source.RedshiftSource).

## Supported Types

Redshift data sources support all eight primitive types, but currently do not support array types.
For a comparison against other batch data sources, please see [here](overview.md#functionality-matrix).
5 changes: 5 additions & 0 deletions docs/reference/data-sources/snowflake.md
Original file line number Diff line number Diff line change
Expand Up @@ -43,3 +43,8 @@ In particular, you can read more about quote identifiers [here](https://docs.sno
{% endhint %}

The full set of configuration options is available [here](https://rtd.feast.dev/en/latest/index.html#feast.infra.offline_stores.snowflake_source.SnowflakeSource).

## Supported Types

Snowflake data sources support all eight primitive types, but currently do not support array types.
For a comparison against other batch data sources, please see [here](overview.md#functionality-matrix).
5 changes: 5 additions & 0 deletions docs/reference/data-sources/spark.md
Original file line number Diff line number Diff line change
Expand Up @@ -52,3 +52,8 @@ my_spark_source = SparkSource(
```

The full set of configuration options is available [here](https://rtd.feast.dev/en/master/#feast.infra.offline_stores.contrib.spark_offline_store.spark_source.SparkSource).

## Supported Types

Spark data sources support all eight primitive types and their corresponding array types.
For a comparison against other batch data sources, please see [here](overview.md#functionality-matrix).
5 changes: 5 additions & 0 deletions docs/reference/data-sources/trino.md
Original file line number Diff line number Diff line change
Expand Up @@ -27,3 +27,8 @@ driver_hourly_stats = TrinoSource(
```

The full set of configuration options is available [here](https://rtd.feast.dev/en/master/#trino-source).

## Supported Types

Trino data sources support all eight primitive types, but currently do not support array types.
For a comparison against other batch data sources, please see [here](overview.md#functionality-matrix).
2 changes: 0 additions & 2 deletions docs/reference/offline-stores/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,8 +2,6 @@

Please see [Offline Store](../../getting-started/architecture-and-components/offline-store.md) for a conceptual explanation of offline stores.

## Reference

{% content-ref url="overview.md" %}
[overview.md](overview.md)
{% endcontent-ref %}
Expand Down
41 changes: 41 additions & 0 deletions docs/reference/type-system.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,41 @@
# Type System

## Motivation

Feast uses an internal type system to provide guarantees on training and serving data.
Feast currently supports eight primitive types - `INT32`, `INT64`, `FLOAT32`, `FLOAT64`, `STRING`, `BYTES`, `BOOL`, and `UNIX_TIMESTAMP` - and the corresponding array types.
Null types are not supported, although the `UNIX_TIMESTAMP` type is nullable.
The type system is controlled by [`Value.proto`](https://github.com/feast-dev/feast/blob/master/protos/feast/types/Value.proto) in protobuf and by [`types.py`](https://github.com/feast-dev/feast/blob/master/sdk/python/feast/types.py) in Python.
Type conversion logic can be found in [`type_map.py`](https://github.com/feast-dev/feast/blob/master/sdk/python/feast/type_map.py).

## Examples

### Feature inference

During `feast apply`, Feast runs schema inference on the data sources underlying feature views.
For example, if the `schema` parameter is not specified for a feature view, Feast will examine the schema of the underlying data source to determine the event timestamp column, feature columns, and entity columns.
Each of these columns must be associated with a Feast type, which requires conversion from the data source type system to the Feast type system.
* The feature inference logic calls `_infer_features_and_entities`.
* `_infer_features_and_entities` calls `source_datatype_to_feast_value_type`.
* `source_datatype_to_feast_value_type` cals the appropriate method in `type_map.py`. For example, if a `SnowflakeSource` is being examined, `snowflake_python_type_to_feast_value_type` from `type_map.py` will be called.

### Materialization

Feast serves feature values as [`Value`](https://github.com/feast-dev/feast/blob/master/protos/feast/types/Value.proto) proto objects, which have a type corresponding to Feast types.
Thus Feast must materialize feature values into the online store as `Value` proto objects.
* The local materialization engine first pulls the latest historical features and converts it to pyarrow.
* Then it calls `_convert_arrow_to_proto` to convert the pyarrow table to proto format.
* This calls `python_values_to_proto_values` in `type_map.py` to perform the type conversion.

### Historical feature retrieval

The Feast type system is typically not necessary when retrieving historical features.
A call to `get_historical_features` will return a `RetrievalJob` object, which allows the user to export the results to one of several possible locations: a Pandas dataframe, a pyarrow table, a data lake (e.g. S3 or GCS), or the offline store (e.g. a Snowflake table).
In all of these cases, the type conversion is handled natively by the offline store.
For example, a BigQuery query exposes a `to_dataframe` method that will automatically convert the result to a dataframe, without requiring any conversions within Feast.

### Feature serving

As mentioned above in the section on [materialization](#materialization), Feast persists feature values into the online store as `Value` proto objects.
A call to `get_online_features` will return an `OnlineResponse` object, which essentially wraps a bunch of `Value` protos with some metadata.
The `OnlineResponse` object can then be converted into a Python dictionary, which calls `feast_value_type_to_python_type` from `type_map.py`, a utility that converts the Feast internal types to Python native types.

0 comments on commit fad45ca

Please sign in to comment.