Python UDF in Ingestion being used for feature validation #1234

pyalex · 2020-12-16T10:34:18Z

What this PR does / why we need it:

This PR introduces experimental feature that enables using custom python code inside ingestion job. See #1230 for motivation.

Technical details:

Python code being wrapped and pickled inside Python SDK.
Pickled object being staged to one of supported storages (gs / s3)
Reference to stored objects added to FeatureTable's labels
IngestionJob currently only supports _validationUDF label and use provided code for feature validation. The code is being called right after reading data from source and before writing it to store. IngestionJob can optionally (set by flag) drop rows that do not pass validation.

Limitations:
Since python pickle happens on customer's machine and unpickle on Spark worker, there might be issues related to incompatibility of pickle protocols even customer's and worker's python versions are different. Ideally to avoid that the minor part of python versions should match (3.7, eg). However, it was confirmed by tests that lower version on SDK is fine (3.6 on SDK, 3.7 on worker), but not other way.

Which issue(s) this PR fixes:

Fixes #1230

Does this PR introduce a user-facing change?:

**experimental** ability to run custom python code inside streaming ingestion job

pyalex · 2020-12-17T04:48:00Z

/test test-end-to-end

pyalex · 2020-12-17T04:48:58Z

/test test-end-to-end-gcp

pyalex · 2020-12-17T07:21:59Z

/test test-end-to-end-gcp

pyalex · 2020-12-18T02:32:32Z

/test test-end-to-end-aws

Signed-off-by: Oleksii Moskalenko <moskalenko.alexey@gmail.com>

oavdeev · 2020-12-18T19:13:29Z

sdk/python/feast/pyspark/launchers/aws/emr_utils.py

@@ -332,6 +332,7 @@ def _stream_ingestion_step(
            ],
            "Args": ["spark-submit", "--class", "feast.ingestion.IngestionJob"]
            + jars_args
+            + ["--conf", "spark.yarn.isPython=true"]


Is there any downside of doing this? Do you know why it isn't it always set it to true in Spark?

It's yarn specific and from what I found it's used mostly to enable distribution of python related stuff (like pyspark.zip) to yarn workers. It's being set by spark-submit when main file is py-file, which is not the case for our IngestionJob.

Makes sense. Maybe add a comment there for the future generations?

spark/ingestion/src/main/scala/feast/ingestion/sources/bq/BigQueryReader.scala

oavdeev · 2020-12-18T19:19:43Z

Makefile

@@ -45,19 +45,19 @@ lint-java:
 	${MVN} --no-transfer-progress spotless:check

 test-java:
-	${MVN} --no-transfer-progress test
+	${MVN} --no-transfer-progress -DskipITs=true test


why? (i'm not super familiar with our java test machinery, does it now skip some tests it didn't skip before?)

TLDR: mvn test runs only unit tests, everything else should be skipped

we have two toggles skipITs and skipUTs- means skip Integration Tests and skip Unit Tests, we need them to run thoses test suites separately. Before there was no need in skipITs, since in maven pipeline test is before verify (used for IT), but in spark part we have some additional steps (generate-test-source phase) that are required only by integration tests, and don't needed by unit tests (see spark/ingestion/pom.xml). To skip those steps I added this flag here.

infra/scripts/build-ingestion-py-dependencies.sh

oavdeev · 2020-12-18T19:28:01Z

sdk/python/feast/contrib/validation/ge.py

+    from feast import Client, FeatureTable
+
+
+GE_PACKED_ARCHIVE = "https://storage.googleapis.com/feast-jobs/spark/validation/pylibs-ge-%(platform)s.tar.gz"


How does this work from distribution perspective? I worry that if this is not in any way tied to Feast version, we can't upgrade GE without breaking older versions of ingestion job.

Maybe we don't have to address this now, especially while this is contrib, but sometime down the road we probably need to pin the version of this tarball to a Feast version somehow.

I know that smells. But since feature is experimental I think it's ok for now.
As an option for future - we could put this archive inside ingestion jar or docker image (jobservice).

Co-authored-by: Oleg Avdeev <oleg.v.avdeev@gmail.com> Signed-off-by: Oleksii Moskalenko <moskalenko.alexey@gmail.com>

Signed-off-by: Oleksii Moskalenko <moskalenko.alexey@gmail.com>

khorshuheng

/lgtm

feast-ci-bot · 2020-12-22T09:05:29Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: khorshuheng, oavdeev, pyalex

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [khorshuheng,pyalex]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

pyalex requested review from davidheryanto, khorshuheng, woop and zhilingc as code owners December 16, 2020 10:34

feast-ci-bot added do-not-merge/work-in-progress do-not-merge/release-note-label-needed approved needs-kind size/XXL labels Dec 16, 2020

pyalex force-pushed the python-udf branch from 1f70bc1 to 861b1e3 Compare December 17, 2020 06:27

pyalex force-pushed the python-udf branch from e2d22eb to 667b870 Compare December 17, 2020 13:13

pyalex added the kind/feature New feature or request label Dec 18, 2020

feast-ci-bot removed the needs-kind label Dec 18, 2020

pyalex added 13 commits December 18, 2020 14:29

first draft

0e68ff9

Signed-off-by: Oleksii Moskalenko <moskalenko.alexey@gmail.com>

restore udf in ingestion job

b9ed047

Signed-off-by: Oleksii Moskalenko <moskalenko.alexey@gmail.com>

it tests for pandas udf

063d8fc

Signed-off-by: Oleksii Moskalenko <moskalenko.alexey@gmail.com>

e2e test

903e743

Signed-off-by: Oleksii Moskalenko <moskalenko.alexey@gmail.com>

skip udf build

2b86ea7

Signed-off-by: Oleksii Moskalenko <moskalenko.alexey@gmail.com>

$skipITs

0ba6a4b

Signed-off-by: Oleksii Moskalenko <moskalenko.alexey@gmail.com>

$skipITs

13ed5ab

Signed-off-by: Oleksii Moskalenko <moskalenko.alexey@gmail.com>

add reporting to scalatest

366605f

Signed-off-by: Oleksii Moskalenko <moskalenko.alexey@gmail.com>

platform specific libs path

bb72032

Signed-off-by: Oleksii Moskalenko <moskalenko.alexey@gmail.com>

continue on error

afbf8c2

Signed-off-by: Oleksii Moskalenko <moskalenko.alexey@gmail.com>

build pylibs for all platforms

b5efcd6

Signed-off-by: Oleksii Moskalenko <moskalenko.alexey@gmail.com>

build pylibs for all platforms

5a908e3

Signed-off-by: Oleksii Moskalenko <moskalenko.alexey@gmail.com>

pull python path from env

65a2b7b

Signed-off-by: Oleksii Moskalenko <moskalenko.alexey@gmail.com>

pyalex added 7 commits December 18, 2020 14:29

freeze dataproc python version

e56dd9e

Signed-off-by: Oleksii Moskalenko <moskalenko.alexey@gmail.com>

use python from config

fb18c92

Signed-off-by: Oleksii Moskalenko <moskalenko.alexey@gmail.com>

add pyarrow to package

6b67e63

Signed-off-by: Oleksii Moskalenko <moskalenko.alexey@gmail.com>

drop dataclasses to support 3.6

0fd3c3d

Signed-off-by: Oleksii Moskalenko <moskalenko.alexey@gmail.com>

support python 3.6

2d126c6

Signed-off-by: Oleksii Moskalenko <moskalenko.alexey@gmail.com>

set spark.yarn.isPython=true for emr

8d4e476

Signed-off-by: Oleksii Moskalenko <moskalenko.alexey@gmail.com>

pass isPython through spark-submit conf

bf63033

Signed-off-by: Oleksii Moskalenko <moskalenko.alexey@gmail.com>

pyalex force-pushed the python-udf branch from a97dbc5 to bf63033 Compare December 18, 2020 06:40

some cleanup

54fcb6e

Signed-off-by: Oleksii Moskalenko <moskalenko.alexey@gmail.com>

feast-ci-bot added release-note and removed do-not-merge/release-note-label-needed labels Dec 18, 2020

pyalex changed the title ~~[WIP] Python UDF in Ingestion being used for feature validation~~ Python UDF in Ingestion being used for feature validation Dec 18, 2020

feast-ci-bot removed the do-not-merge/work-in-progress label Dec 18, 2020

pyalex added area/ingestion The ingestion Beam component and storage-related items area/sdks labels Dec 18, 2020

lint-java

c348481

Signed-off-by: Oleksii Moskalenko <moskalenko.alexey@gmail.com>

pyalex force-pushed the python-udf branch from 949d13d to c348481 Compare December 18, 2020 07:44

oavdeev reviewed Dec 18, 2020

View reviewed changes

pyalex and others added 2 commits December 21, 2020 17:52

Update infra/scripts/build-ingestion-py-dependencies.sh

a6d272f

Co-authored-by: Oleg Avdeev <oleg.v.avdeev@gmail.com> Signed-off-by: Oleksii Moskalenko <moskalenko.alexey@gmail.com>

revert batch sources fix

4aa0233

Signed-off-by: Oleksii Moskalenko <moskalenko.alexey@gmail.com>

pyalex force-pushed the python-udf branch from 074983d to 4aa0233 Compare December 21, 2020 09:53

oavdeev approved these changes Dec 22, 2020

View reviewed changes

khorshuheng approved these changes Dec 22, 2020

View reviewed changes

feast-ci-bot assigned khorshuheng Dec 22, 2020

feast-ci-bot added the lgtm label Dec 22, 2020

feast-ci-bot merged commit 17edb99 into feast-dev:master Dec 22, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Python UDF in Ingestion being used for feature validation #1234

Python UDF in Ingestion being used for feature validation #1234

pyalex commented Dec 16, 2020 •

edited

Loading

pyalex commented Dec 17, 2020

pyalex commented Dec 17, 2020

pyalex commented Dec 17, 2020

pyalex commented Dec 18, 2020

oavdeev Dec 18, 2020

pyalex Dec 21, 2020

oavdeev Dec 22, 2020

oavdeev Dec 18, 2020

pyalex Dec 21, 2020 •

edited

Loading

oavdeev Dec 18, 2020

pyalex Dec 21, 2020 •

edited

Loading

khorshuheng left a comment

feast-ci-bot commented Dec 22, 2020

		from feast import Client, FeatureTable


		GE_PACKED_ARCHIVE = "https://storage.googleapis.com/feast-jobs/spark/validation/pylibs-ge-%(platform)s.tar.gz"

Python UDF in Ingestion being used for feature validation #1234

Python UDF in Ingestion being used for feature validation #1234

Conversation

pyalex commented Dec 16, 2020 • edited Loading

pyalex commented Dec 17, 2020

pyalex commented Dec 17, 2020

pyalex commented Dec 17, 2020

pyalex commented Dec 18, 2020

oavdeev Dec 18, 2020

Choose a reason for hiding this comment

pyalex Dec 21, 2020

Choose a reason for hiding this comment

oavdeev Dec 22, 2020

Choose a reason for hiding this comment

oavdeev Dec 18, 2020

Choose a reason for hiding this comment

pyalex Dec 21, 2020 • edited Loading

Choose a reason for hiding this comment

oavdeev Dec 18, 2020

Choose a reason for hiding this comment

pyalex Dec 21, 2020 • edited Loading

Choose a reason for hiding this comment

khorshuheng left a comment

Choose a reason for hiding this comment

feast-ci-bot commented Dec 22, 2020

pyalex commented Dec 16, 2020 •

edited

Loading

pyalex Dec 21, 2020 •

edited

Loading

pyalex Dec 21, 2020 •

edited

Loading