[Bug] Workflow sandbox issues with Protobuf #688

Gr1N · 2024-11-12T16:48:34Z

What are you really trying to do?

Hey!

I'm doing nothing special. I have one workflow and one activity. I use Protobuf messages as input parameters for both cases. I just expect to be able to execute the workflow and wait for it to be completed.

Describe the bug

The issue happens only in the Kubernetes environment.

Traceback (most recent call last):
  File \"/usr/local/lib/python3.11/site-packages/temporalio/converter.py\", line 436, in from_payload
    return google.protobuf.json_format.Parse(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File \"/usr/local/lib/python3.11/site-packages/google/protobuf/json_format.py\", line 421, in Parse
    return ParseDict(js, message, ignore_unknown_fields, descriptor_pool,
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File \"/usr/local/lib/python3.11/site-packages/google/protobuf/json_format.py\", line 446, in ParseDict
    parser.ConvertMessage(js_dict, message, '')
  File \"/usr/local/lib/python3.11/site-packages/google/protobuf/json_format.py\", line 487, in ConvertMessage
    self._ConvertFieldValuePair(value, message, path)
  File \"/usr/local/lib/python3.11/site-packages/google/protobuf/json_format.py\", line 563, in _ConvertFieldValuePair
    if _IsMapEntry(field):
       ^^^^^^^^^^^^^^^^^^
  File \"/usr/local/lib/python3.11/site-packages/google/protobuf/json_format.py\", line 151, in _IsMapEntry
    field.message_type.GetOptions().map_entry)
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File \"<frozen importlib._bootstrap>\", line 1176, in _find_and_load
  File \"<frozen importlib._bootstrap>\", line 1130, in _find_and_load_unlocked
  File \"/usr/local/lib/python3.11/site-packages/temporalio/worker/workflow_sandbox/_importer.py\", line 393, in __getitem__
    return self.current[key]
           ~~~~~~~~~~~~^^^^^
KeyError: 'google.protobuf'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File \"/usr/local/lib/python3.11/site-packages/temporalio/converter.py\", line 306, in from_payloads
    values.append(converter.from_payload(payload, type_hint))
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File \"/usr/local/lib/python3.11/site-packages/temporalio/converter.py\", line 442, in from_payload
    raise RuntimeError(f\"Unknown Protobuf type {message_type}\") from err
RuntimeError: Unknown Protobuf type internal.path.to.protobuf.TaskRequest

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File \"/usr/local/lib/python3.11/site-packages/temporalio/worker/_workflow_instance.py\", line 1625, in _convert_payloads
    return self._payload_converter.from_payloads(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File \"/usr/local/lib/python3.11/site-packages/temporalio/converter.py\", line 308, in from_payloads
    raise RuntimeError(
RuntimeError: Payload at index 0 with encoding json/protobuf could not be converted

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File \"/usr/local/lib/python3.11/site-packages/temporalio/worker/_workflow_instance.py\", line 364, in activate
    self._workflow_input = self._make_workflow_input(start_job)
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File \"/usr/local/lib/python3.11/site-packages/temporalio/worker/_workflow_instance.py\", line 896, in _make_workflow_input
    args = self._convert_payloads(start_job.arguments, arg_types)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File \"/usr/local/lib/python3.11/site-packages/temporalio/worker/_workflow_instance.py\", line 1633, in _convert_payloads
    raise RuntimeError(\"Failed decoding arguments\") from err

Minimal Reproduction

Unfortunately, I don't have steps for reproduction. Everything works perfectly in my and my colleague's local environments.

I would love to hear ideas or proposals on what to look into to understand the root cause. The only thing I can state is that if I add the following import, everything works:

with workflow.unsafe.imports_passed_through():
    import google.protobuf

My other Protobuf imports (the result of code generation from Protobuf definitions) are already under unsafe context manager.

Environment/Versions

OS and processor: Linux
Temporal Version: 1.8.0
Yes, Docker and Kubernetes

Additional context

The text was updated successfully, but these errors were encountered:

cretz · 2024-11-12T17:15:23Z

Unfortunately, I don't have steps for reproduction. Everything works perfectly in my and my colleague's local environments.

Ideally we can replicate so we can help debug the problem. It is interesting that the exact same Python code does different things in different environments. It may be a protobuf version issue or an order of operations issue, but a standalone replication may be needed for us to confirm the issue.

The only thing I can state is that if I add the following import, everything works:

Looking at the traceback, it's a bit of an interesting case where we catch KeyError because that's how we're checking a not-found protobuf, but this is actually having an import-level key error for importing google.protobuf. I wonder if it is lazily importing google.protobuf inside of GetOptions there somehow.

I can't really tell the problem without a replication. Can you confirm exact protobuf version (both library and protoc generator) and continually reduce the k8s-only form of replication until it's very small and standalone?

Gr1N · 2024-11-12T17:23:00Z

Here are my versions:

protobuf 4.25.3
libprotoc 3.19.6

Here is an example of a workflow file:

from temporalio import workflow

with workflow.unsafe.imports_passed_through():
    from temporalio.v1.lorem_pb2 import (
        LoremWorkflowRequest,
        LoremWorkflowResponse,
    )


@workflow.defn(name="lorem-workflow")
class LoremWorkflow:
    @workflow.run
    async def run(self, req: LoremWorkflowRequest) -> LoremWorkflowResponse:
        return LoremWorkflowResponse(text="i am response")

Could it be the issue with my Protobuf definitions and the fact that I use maps (but again, on local, all good)?

message LoremWorkflowRequest {
  map<string, bool> random = 1;
}

cretz · 2024-11-12T17:47:27Z

Here is an example of a workflow file:

To confirm, the exact same code from Python start to end fails in one of your environments but not another? May need to see the full standalone code (i.e. how you are registering/running worker). There may be an operations order issue.

Could it be the issue with my Protobuf definitions and the fact that I use maps (but again, on local, all good)?

I cannot know to be honest without replicating and debugging. The code you show above should work fine in every environment. I wonder if there are Python version differences or something else in the one environment where this happens that could help us figure out how to reliably replicate?

I think the main goal is to confirm the exact same small code that fails in one environment but works in another. Even something as simple as the order something is imported may affect it which is why exact code matters.

Gr1N · 2024-11-12T19:27:05Z

Yep, I understand that providing insights without a reproducible example is almost impossible.

However, I figured out how to make my workflow workable.

I had my Protobuf imports under unsafe context manager:

with workflow.unsafe.imports_passed_through():
    from temporalio.v1.lorem_pb2 import (
        LoremWorkflowRequest,
        LoremWorkflowResponse,
    )

I changed that to the following, and now I see no exceptions:

from temporalio.v1.lorem_pb2 import (
    LoremWorkflowRequest,
    LoremWorkflowResponse,
)

What is the right way?

cretz · 2024-11-12T20:14:21Z

What is the right way?

The first way with the imports passed through is usually better for models because with the second way you're re-importing the models on every single workflow run which costs memory and CPU. Having said that, we automatically mark any import beneath temporalio as pass through I believe, so a bit surprised there is a difference.

Gr1N added the bug Something isn't working label Nov 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] Workflow sandbox issues with Protobuf #688

[Bug] Workflow sandbox issues with Protobuf #688

Gr1N commented Nov 12, 2024

cretz commented Nov 12, 2024 •

edited

Loading

Gr1N commented Nov 12, 2024

cretz commented Nov 12, 2024

Gr1N commented Nov 12, 2024

cretz commented Nov 12, 2024 •

edited

Loading

[Bug] Workflow sandbox issues with Protobuf #688

[Bug] Workflow sandbox issues with Protobuf #688

Comments

Gr1N commented Nov 12, 2024

What are you really trying to do?

Describe the bug

Minimal Reproduction

Environment/Versions

Additional context

cretz commented Nov 12, 2024 • edited Loading

Gr1N commented Nov 12, 2024

cretz commented Nov 12, 2024

Gr1N commented Nov 12, 2024

cretz commented Nov 12, 2024 • edited Loading

cretz commented Nov 12, 2024 •

edited

Loading

cretz commented Nov 12, 2024 •

edited

Loading