Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix Pandas codec decoding from numpy arrays #1751

Merged
merged 3 commits into from
May 16, 2024

Conversation

lhnwrk
Copy link
Contributor

@lhnwrk lhnwrk commented May 9, 2024

Pandas codec currently breaks mlflow runtime's schema enforcement (#1625). This is because numeric mlflow datatypes are assigned NumpyCodec content type by default, which decodes PandasCodec's explicit batch size into a column vector:

input = RequestInput(
	name="foo",
	shape=[3, 1],
	data=[1,2,3],
	datatype="INT64",
)
NumpyCodec.decode_input(input)  # np.array([[1], [2], [3]])

However, pd.Series expects one-dimensional data, and when the _decoded_payload is converted to a pd.Series here returns unexpected results:

# Expect pd.Series([1,2,3]), got pd.Series([(1,), (2,), (3,)])
payload = np.array([[1], [2], [3]])
payload = list(payload)
pd.Series(payload, dtype="int64")

This seems like a bug from pandas side, dtype="int64" should have thrown an error (and in fact it does for higher dimension, but for some reason is okay with [np.array([1]), np.array([2]), np.array([3])] and weirdly casts them into tuples), but either way mlserver should have dropped the trailing axis as pd.Series already implies column vector.

From this test, it seems like the conversion to list in the codec was to accommodate the use case of pd.Series of tensors:

# Expect decoded pd.Series([np.array([1,2,3])]) in test
RequestInput(
	name="a",
	data=[1, 2, 3],
	datatype="FP32",
	shape=[1, 3],
	parameters=Parameters(_decoded_payload=np.array([[1, 2, 3]])),
)

I'd argue in the context of Pandas codec this should be made explicit with shape [1, 1] and datatype BYTES for array elements. In fact, a test above does so, specifying the intended shape is a [2,1] column vector with datatype BYTES for list elements.

# Expect encoded ResponseOutput(name="bar", shape=[2, 1], data=[[1, 2, 3], [4, 5, 6]], datatype="BYTES") in test
pd.Series(data=[[1, 2, 3], [4, 5, 6]], name="bar")

This PR explicitly reshapes np.array payload to expected shapes before conversion to pd.Series.

Copy link
Contributor

@jesse-c jesse-c left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Thanks for the detailed write-up! 🙇🏻

We're resolving some issues around testing and then I plan on getting this in.

@jesse-c
Copy link
Contributor

jesse-c commented May 15, 2024

We've gotten that fix in. @lhnwrk: could you please rebase this branch on top of master?

@lhnwrk
Copy link
Contributor Author

lhnwrk commented May 15, 2024

@jesse-c Rebased on master!

@jesse-c
Copy link
Contributor

jesse-c commented May 16, 2024

@lhnwrk: Thank you! I've had to re-approve it and run the checks, so they're going now. As soon as it passes, I'll merge it in.

@jesse-c jesse-c merged commit 4a1030b into SeldonIO:master May 16, 2024
25 checks passed
@jesse-c
Copy link
Contributor

jesse-c commented May 16, 2024

Merged in! Thank you again, @lhnwrk. 🙏🏻

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants