Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Using HashMethod annotation for numpy arrays fails #5630

Closed
2 tasks done
demmerichs opened this issue Aug 3, 2024 · 6 comments · Fixed by flyteorg/flytekit#2649
Closed
2 tasks done

[BUG] Using HashMethod annotation for numpy arrays fails #5630

demmerichs opened this issue Aug 3, 2024 · 6 comments · Fixed by flyteorg/flytekit#2649
Labels
bug Something isn't working untriaged This issues has not yet been looked at by the Maintainers

Comments

@demmerichs
Copy link

Describe the bug

Following your guide for caching of tasks based on non-literal inputs, but replacing the dataframe in the guide simply with an numpy array, leads to an error:

Error encountered while executing 'wf':
  Failed to convert outputs of task 'foo' at position 0:
  typing.Annotated[numpy.ndarray, <flytekit.core.hash.HashMethod object at 0xFFFFFFFFF>]'s metadata needs to be of type kwtypes.

I used the following hash function:

import hashlib
import numpy as np
def hash_array(arr: np.ndarray) -> str:
    return hashlib.sha1(arr.tobytes()).hexdigest()

and used it as

@task
def foo(  # noqa: F811
    a: int,
    b: str,  # noqa: F821
) -> Annotated[np.ndarray, HashMethod(hash_array)]:  # noqa: F821  # noqa: F821
    arr = np.array([a])  # noqa: F821
    return arr

Expected behavior

Successfully produce a caching of the bar task and a cache hit in subsequent runs.

Additional context to reproduce

I already stepped a bit through the flytekit code with a debugger and found its coming down to numpy arrays having a special transformer type logic in flytekit.

The hashing value is produced successfully internally, but a few code lines down the line hit the to_literal function of NumpyArrayTransformer, which seems to be unable to handle annotated types.

It seems dataframes from pandas don't get this special treatment.
Maybe you could also clarify here in this issue for me, why we need or want a special treatment for np arrays when it is still moving the data to disk and only stores a filepath as meta data of the output. Isn't that how all compound data structures are handled by default, e.g. offloading them to the disk and only storing a file pointer?

Screenshots

No response

Are you sure this issue hasn't been raised already?

  • Yes

Have you read the Code of Conduct?

  • Yes
@demmerichs demmerichs added bug Something isn't working untriaged This issues has not yet been looked at by the Maintainers labels Aug 3, 2024
Copy link

welcome bot commented Aug 3, 2024

Thank you for opening your first issue here! 🛠

Copy link

runllm bot commented Aug 3, 2024

The issue you're experiencing with using HashMethod for numpy arrays is likely due to the special handling of numpy arrays in Flytekit's type system. According to the Flytekit documentation, the HashMethod annotation is used to control memoization behavior for non-literal inputs like pandas dataframes. However, numpy arrays have a specialized transformer logic in Flytekit, which might not fully support annotated types.

To address this, you might need to implement a custom transformer for numpy arrays that can handle the HashMethod annotation. This involves extending the StructuredDatasetEncoder and StructuredDatasetDecoder classes to include your custom hashing logic.

Would you like more details on how to implement a custom transformer for numpy arrays in Flytekit?

Sources:


This is an AI-generated response and your feedback is appreciated! Please leave a 👍 if this is helpful and 👎 if it is not.

Mention @RunLLM to ask follow-up questions. See runllm.com for more!

@demmerichs
Copy link
Author

@RunLLM Sadly most of your comment is repeating my arguments. Regarding writing a custom transformer, I think that because flyte is already providing a special transformer for numpy arrays and logic for a hash method, this error should be resolved on fltye's side instead of asking users to fix it themselves with yet another custom transformer. But could you perhaps elaborate, why flyte has a customer transformer for numpy arrays in the first place?

Copy link

runllm bot commented Aug 3, 2024

Flyte has a custom transformer for numpy arrays to efficiently handle their serialization and deserialization, ensuring compatibility with Flyte's type system and storage mechanisms. This transformer, NumpyArrayTransformer, converts numpy arrays to a file format (e.g., .npy) and manages their storage and retrieval from remote locations.

The error you're encountering arises because the NumpyArrayTransformer does not currently support handling annotated types with HashMethod. This is a limitation in the current implementation.

Would you like more details on how to potentially modify the existing NumpyArrayTransformer to support HashMethod annotations?

Sources:


This is an AI-generated response and your feedback is appreciated! Please leave a 👍 if this is helpful and 👎 if it is not.

Mention @RunLLM to ask follow-up questions. See runllm.com for more!

@demmerichs
Copy link
Author

@RunLLM How does flyte handle data structures with no specialized transformers like pandas dataframes compared to the customized handling of numpy arrays?

Copy link

runllm bot commented Aug 3, 2024

Flyte handles data structures without specialized transformers, like pandas dataframes, by using a more generic approach. For instance, pandas dataframes are typically serialized to a file format (e.g., Parquet) and stored in a remote location, with only the file path being stored as metadata. This is managed by the StructuredDataset class, which provides a flexible way to handle various data formats and storage backends.

In contrast, numpy arrays have a specialized transformer, NumpyArrayTransformer, which directly handles the serialization to .npy files and manages their storage and retrieval. This specialized handling ensures efficient processing and compatibility with Flyte's type system but currently lacks support for annotations like HashMethod.

Would you like more details on how to potentially modify the existing NumpyArrayTransformer to support HashMethod annotations?

Sources:


This is an AI-generated response and your feedback is appreciated! Please leave a 👍 if this is helpful and 👎 if it is not.

Mention @RunLLM to ask follow-up questions. See runllm.com for more!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working untriaged This issues has not yet been looked at by the Maintainers
Projects
Status: Done
Development

Successfully merging a pull request may close this issue.

1 participant