[BUG] Using HashMethod annotation for numpy arrays fails #5630

demmerichs · 2024-08-03T12:03:11Z

Describe the bug

Following your guide for caching of tasks based on non-literal inputs, but replacing the dataframe in the guide simply with an numpy array, leads to an error:

Error encountered while executing 'wf':
  Failed to convert outputs of task 'foo' at position 0:
  typing.Annotated[numpy.ndarray, <flytekit.core.hash.HashMethod object at 0xFFFFFFFFF>]'s metadata needs to be of type kwtypes.

I used the following hash function:

import hashlib
import numpy as np
def hash_array(arr: np.ndarray) -> str:
    return hashlib.sha1(arr.tobytes()).hexdigest()

and used it as

@task
def foo(  # noqa: F811
    a: int,
    b: str,  # noqa: F821
) -> Annotated[np.ndarray, HashMethod(hash_array)]:  # noqa: F821  # noqa: F821
    arr = np.array([a])  # noqa: F821
    return arr

Expected behavior

Successfully produce a caching of the bar task and a cache hit in subsequent runs.

Additional context to reproduce

I already stepped a bit through the flytekit code with a debugger and found its coming down to numpy arrays having a special transformer type logic in flytekit.

The hashing value is produced successfully internally, but a few code lines down the line hit the to_literal function of NumpyArrayTransformer, which seems to be unable to handle annotated types.

It seems dataframes from pandas don't get this special treatment.
Maybe you could also clarify here in this issue for me, why we need or want a special treatment for np arrays when it is still moving the data to disk and only stores a filepath as meta data of the output. Isn't that how all compound data structures are handled by default, e.g. offloading them to the disk and only storing a file pointer?

Screenshots

No response

Are you sure this issue hasn't been raised already?

Yes

Have you read the Code of Conduct?

Yes

The text was updated successfully, but these errors were encountered:

welcome · 2024-08-03T12:03:14Z

Thank you for opening your first issue here! 🛠

runllm · 2024-08-03T12:03:27Z

The issue you're experiencing with using HashMethod for numpy arrays is likely due to the special handling of numpy arrays in Flytekit's type system. According to the Flytekit documentation, the HashMethod annotation is used to control memoization behavior for non-literal inputs like pandas dataframes. However, numpy arrays have a specialized transformer logic in Flytekit, which might not fully support annotated types.

To address this, you might need to implement a custom transformer for numpy arrays that can handle the HashMethod annotation. This involves extending the StructuredDatasetEncoder and StructuredDatasetDecoder classes to include your custom hashing logic.

Would you like more details on how to implement a custom transformer for numpy arrays in Flytekit?

Sources:

This is an AI-generated response and your feedback is appreciated! Please leave a 👍 if this is helpful and 👎 if it is not.

Mention @RunLLM to ask follow-up questions. See runllm.com for more!

demmerichs · 2024-08-03T12:09:36Z

@RunLLM Sadly most of your comment is repeating my arguments. Regarding writing a custom transformer, I think that because flyte is already providing a special transformer for numpy arrays and logic for a hash method, this error should be resolved on fltye's side instead of asking users to fix it themselves with yet another custom transformer. But could you perhaps elaborate, why flyte has a customer transformer for numpy arrays in the first place?

runllm · 2024-08-03T12:09:47Z

Flyte has a custom transformer for numpy arrays to efficiently handle their serialization and deserialization, ensuring compatibility with Flyte's type system and storage mechanisms. This transformer, NumpyArrayTransformer, converts numpy arrays to a file format (e.g., .npy) and manages their storage and retrieval from remote locations.

The error you're encountering arises because the NumpyArrayTransformer does not currently support handling annotated types with HashMethod. This is a limitation in the current implementation.

Would you like more details on how to potentially modify the existing NumpyArrayTransformer to support HashMethod annotations?

Sources:

This is an AI-generated response and your feedback is appreciated! Please leave a 👍 if this is helpful and 👎 if it is not.

Mention @RunLLM to ask follow-up questions. See runllm.com for more!

demmerichs · 2024-08-03T12:11:32Z

@RunLLM How does flyte handle data structures with no specialized transformers like pandas dataframes compared to the customized handling of numpy arrays?

runllm · 2024-08-03T12:11:47Z

Flyte handles data structures without specialized transformers, like pandas dataframes, by using a more generic approach. For instance, pandas dataframes are typically serialized to a file format (e.g., Parquet) and stored in a remote location, with only the file path being stored as metadata. This is managed by the StructuredDataset class, which provides a flexible way to handle various data formats and storage backends.

In contrast, numpy arrays have a specialized transformer, NumpyArrayTransformer, which directly handles the serialization to .npy files and manages their storage and retrieval. This specialized handling ensures efficient processing and compatibility with Flyte's type system but currently lacks support for annotations like HashMethod.

Would you like more details on how to potentially modify the existing NumpyArrayTransformer to support HashMethod annotations?

Sources:

This is an AI-generated response and your feedback is appreciated! Please leave a 👍 if this is helpful and 👎 if it is not.

Mention @RunLLM to ask follow-up questions. See runllm.com for more!

demmerichs added bug Something isn't working untriaged This issues has not yet been looked at by the Maintainers labels Aug 3, 2024

github-project-automation bot added this to Flyte Issues/PRs maintenance Aug 3, 2024

github-project-automation bot moved this to Backlog in Flyte Issues/PRs maintenance Aug 3, 2024

demmerichs mentioned this issue Aug 3, 2024

allow a hash method to be present for numpy arrays flyteorg/flytekit#2649

Merged

3 tasks

wild-endeavor closed this as completed in flyteorg/flytekit#2649 Aug 6, 2024

github-project-automation bot moved this from Backlog to Done in Flyte Issues/PRs maintenance Aug 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Using HashMethod annotation for numpy arrays fails #5630

[BUG] Using HashMethod annotation for numpy arrays fails #5630

demmerichs commented Aug 3, 2024

welcome bot commented Aug 3, 2024

runllm bot commented Aug 3, 2024

demmerichs commented Aug 3, 2024

runllm bot commented Aug 3, 2024

demmerichs commented Aug 3, 2024

runllm bot commented Aug 3, 2024

[BUG] Using HashMethod annotation for numpy arrays fails #5630

[BUG] Using HashMethod annotation for numpy arrays fails #5630

Comments

demmerichs commented Aug 3, 2024

Describe the bug

Expected behavior

Additional context to reproduce

Screenshots

Are you sure this issue hasn't been raised already?

Have you read the Code of Conduct?

welcome bot commented Aug 3, 2024

runllm bot commented Aug 3, 2024

demmerichs commented Aug 3, 2024

runllm bot commented Aug 3, 2024

demmerichs commented Aug 3, 2024

runllm bot commented Aug 3, 2024