Skip to content

SSL Overflow Errors on distributed 2022.05 #6556

@hhuuggoo

Description

@hhuuggoo

What happened:
I tried to send a ridiculously large numpy array to a dask cluster. This resulted in an OverflowError.

2022-06-09 20:24:23,931 - distributed.batched - ERROR - Error in batched write
Traceback (most recent call last):
  File "/srv/conda/envs/saturn/lib/python3.9/site-packages/distributed/batched.py", line 94, in _background_send
    nbytes = yield self.comm.write(
  File "/srv/conda/envs/saturn/lib/python3.9/site-packages/tornado/gen.py", line 762, in run
    value = future.result()
  File "/srv/conda/envs/saturn/lib/python3.9/site-packages/distributed/comm/tcp.py", line 314, in write
    stream.write(b"")
  File "/srv/conda/envs/saturn/lib/python3.9/site-packages/tornado/iostream.py", line 544, in write
    self._handle_write()
  File "/srv/conda/envs/saturn/lib/python3.9/site-packages/tornado/iostream.py", line 1486, in _handle_write
    super()._handle_write()
  File "/srv/conda/envs/saturn/lib/python3.9/site-packages/tornado/iostream.py", line 971, in _handle_write
    num_bytes = self.write_to_fd(self._write_buffer.peek(size))
  File "/srv/conda/envs/saturn/lib/python3.9/site-packages/tornado/iostream.py", line 1568, in write_to_fd
    return self.socket.send(data)  # type: ignore
  File "/srv/conda/envs/saturn/lib/python3.9/ssl.py", line 1173, in send
    return self._sslobj.write(data)
OverflowError: string longer than 2147483647 bytes

What you expected to happen:

No errors.

Minimal Complete Verifiable Example:

import numpy as np
from dask.distributed import Client

arr = np.random.random(300000000)
c = Client(...) # assuming you have a cluster that supports TLS
print(c.scheduler)
def func(x):
    return x.sum()
fut = c.submit(func, arr)
print('submitted')
result = fut.result()
print('done')

Anything else we need to know?:

There was a related fix in #5141, however that fix works by passing in frame_split_size, which gets passed to distributed.protocol.core.dumps. frame_split_size is only used if msgpack encounters a type it does not understand. The numpy array becomes a bytearray by the time it makes it to dumps, so frame_split_size is not used and we end up with a 2.4GB message.

Tornado actually has a fix that does resolve this issue, however it's not on the latest release of tornado which is quite old.

https://github.com/tornadoweb/tornado/blob/master/tornado/iostream.py#L1565

When I monkey patch write_to_fd, the problem is resolved.

Environment:

  • Dask version: 2022.5
  • Python version:3.9
  • Operating System: Linux
  • Install method (conda, pip, source): pip

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething is broken

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions