Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] Workflow sandbox issues with Protobuf #688

Open
Gr1N opened this issue Nov 12, 2024 · 5 comments
Open

[Bug] Workflow sandbox issues with Protobuf #688

Gr1N opened this issue Nov 12, 2024 · 5 comments
Labels
bug Something isn't working

Comments

@Gr1N
Copy link

Gr1N commented Nov 12, 2024

What are you really trying to do?

Hey!

I'm doing nothing special. I have one workflow and one activity. I use Protobuf messages as input parameters for both cases. I just expect to be able to execute the workflow and wait for it to be completed.

Describe the bug

The issue happens only in the Kubernetes environment.

Traceback (most recent call last):
  File \"/usr/local/lib/python3.11/site-packages/temporalio/converter.py\", line 436, in from_payload
    return google.protobuf.json_format.Parse(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File \"/usr/local/lib/python3.11/site-packages/google/protobuf/json_format.py\", line 421, in Parse
    return ParseDict(js, message, ignore_unknown_fields, descriptor_pool,
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File \"/usr/local/lib/python3.11/site-packages/google/protobuf/json_format.py\", line 446, in ParseDict
    parser.ConvertMessage(js_dict, message, '')
  File \"/usr/local/lib/python3.11/site-packages/google/protobuf/json_format.py\", line 487, in ConvertMessage
    self._ConvertFieldValuePair(value, message, path)
  File \"/usr/local/lib/python3.11/site-packages/google/protobuf/json_format.py\", line 563, in _ConvertFieldValuePair
    if _IsMapEntry(field):
       ^^^^^^^^^^^^^^^^^^
  File \"/usr/local/lib/python3.11/site-packages/google/protobuf/json_format.py\", line 151, in _IsMapEntry
    field.message_type.GetOptions().map_entry)
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File \"<frozen importlib._bootstrap>\", line 1176, in _find_and_load
  File \"<frozen importlib._bootstrap>\", line 1130, in _find_and_load_unlocked
  File \"/usr/local/lib/python3.11/site-packages/temporalio/worker/workflow_sandbox/_importer.py\", line 393, in __getitem__
    return self.current[key]
           ~~~~~~~~~~~~^^^^^
KeyError: 'google.protobuf'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File \"/usr/local/lib/python3.11/site-packages/temporalio/converter.py\", line 306, in from_payloads
    values.append(converter.from_payload(payload, type_hint))
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File \"/usr/local/lib/python3.11/site-packages/temporalio/converter.py\", line 442, in from_payload
    raise RuntimeError(f\"Unknown Protobuf type {message_type}\") from err
RuntimeError: Unknown Protobuf type internal.path.to.protobuf.TaskRequest

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File \"/usr/local/lib/python3.11/site-packages/temporalio/worker/_workflow_instance.py\", line 1625, in _convert_payloads
    return self._payload_converter.from_payloads(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File \"/usr/local/lib/python3.11/site-packages/temporalio/converter.py\", line 308, in from_payloads
    raise RuntimeError(
RuntimeError: Payload at index 0 with encoding json/protobuf could not be converted

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File \"/usr/local/lib/python3.11/site-packages/temporalio/worker/_workflow_instance.py\", line 364, in activate
    self._workflow_input = self._make_workflow_input(start_job)
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File \"/usr/local/lib/python3.11/site-packages/temporalio/worker/_workflow_instance.py\", line 896, in _make_workflow_input
    args = self._convert_payloads(start_job.arguments, arg_types)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File \"/usr/local/lib/python3.11/site-packages/temporalio/worker/_workflow_instance.py\", line 1633, in _convert_payloads
    raise RuntimeError(\"Failed decoding arguments\") from err

Minimal Reproduction

Unfortunately, I don't have steps for reproduction. Everything works perfectly in my and my colleague's local environments.

I would love to hear ideas or proposals on what to look into to understand the root cause. The only thing I can state is that if I add the following import, everything works:

with workflow.unsafe.imports_passed_through():
    import google.protobuf

My other Protobuf imports (the result of code generation from Protobuf definitions) are already under unsafe context manager.

Environment/Versions

  • OS and processor: Linux
  • Temporal Version: 1.8.0
  • Yes, Docker and Kubernetes

Additional context

@Gr1N Gr1N added the bug Something isn't working label Nov 12, 2024
@cretz
Copy link
Member

cretz commented Nov 12, 2024

Unfortunately, I don't have steps for reproduction. Everything works perfectly in my and my colleague's local environments.

Ideally we can replicate so we can help debug the problem. It is interesting that the exact same Python code does different things in different environments. It may be a protobuf version issue or an order of operations issue, but a standalone replication may be needed for us to confirm the issue.

The only thing I can state is that if I add the following import, everything works:

Looking at the traceback, it's a bit of an interesting case where we catch KeyError because that's how we're checking a not-found protobuf, but this is actually having an import-level key error for importing google.protobuf. I wonder if it is lazily importing google.protobuf inside of GetOptions there somehow.

I can't really tell the problem without a replication. Can you confirm exact protobuf version (both library and protoc generator) and continually reduce the k8s-only form of replication until it's very small and standalone?

@Gr1N
Copy link
Author

Gr1N commented Nov 12, 2024

Here are my versions:

protobuf 4.25.3
libprotoc 3.19.6

Here is an example of a workflow file:

from temporalio import workflow

with workflow.unsafe.imports_passed_through():
    from temporalio.v1.lorem_pb2 import (
        LoremWorkflowRequest,
        LoremWorkflowResponse,
    )


@workflow.defn(name="lorem-workflow")
class LoremWorkflow:
    @workflow.run
    async def run(self, req: LoremWorkflowRequest) -> LoremWorkflowResponse:
        return LoremWorkflowResponse(text="i am response")

Could it be the issue with my Protobuf definitions and the fact that I use maps (but again, on local, all good)?

message LoremWorkflowRequest {
  map<string, bool> random = 1;
}

@cretz
Copy link
Member

cretz commented Nov 12, 2024

Here is an example of a workflow file:

To confirm, the exact same code from Python start to end fails in one of your environments but not another? May need to see the full standalone code (i.e. how you are registering/running worker). There may be an operations order issue.

Could it be the issue with my Protobuf definitions and the fact that I use maps (but again, on local, all good)?

I cannot know to be honest without replicating and debugging. The code you show above should work fine in every environment. I wonder if there are Python version differences or something else in the one environment where this happens that could help us figure out how to reliably replicate?

I think the main goal is to confirm the exact same small code that fails in one environment but works in another. Even something as simple as the order something is imported may affect it which is why exact code matters.

@Gr1N
Copy link
Author

Gr1N commented Nov 12, 2024

Yep, I understand that providing insights without a reproducible example is almost impossible.

However, I figured out how to make my workflow workable.

I had my Protobuf imports under unsafe context manager:

with workflow.unsafe.imports_passed_through():
    from temporalio.v1.lorem_pb2 import (
        LoremWorkflowRequest,
        LoremWorkflowResponse,
    )

I changed that to the following, and now I see no exceptions:

from temporalio.v1.lorem_pb2 import (
    LoremWorkflowRequest,
    LoremWorkflowResponse,
)

What is the right way?

@cretz
Copy link
Member

cretz commented Nov 12, 2024

What is the right way?

The first way with the imports passed through is usually better for models because with the second way you're re-importing the models on every single workflow run which costs memory and CPU. Having said that, we automatically mark any import beneath temporalio as pass through I believe, so a bit surprised there is a difference.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants