Skip to content

Attempt to fix static image export hanging by flushing after printing JSON#149

Closed
adeak wants to merge 1 commit intoplotly:masterfrom
adeak:flush_after_json_printing
Closed

Attempt to fix static image export hanging by flushing after printing JSON#149
adeak wants to merge 1 commit intoplotly:masterfrom
adeak:flush_after_json_printing

Conversation

@adeak
Copy link

@adeak adeak commented Oct 18, 2022

This is a potentially "wishful thinking" attempt to fix (a subset of?) the "writing image stalls" problem with a dozen or so open issues. Unfortunately I haven't been able to test whether it fixes the issue, because I do not have the tooling at hand to build the library (assuming this indeed involves the 50 GB chromium compile). If there's a low-profile way to compile this patch I'd love to test it locally on my "known bad" setup.

My situation:

  1. A coworker and I have almost identical hardware, identical OS version (down to the exact Windows build) and the exact same conda environment. His conda is 4.8 and mine is 4.11, but other than this I couldn't pin down any differences between our setups. For the time being we're stuck with plotly 4.11.0.
  2. Any call to fig.write_image() will hang on my machine, and it will always succeed on coworker's machine. (So it's not the "sometimes hangs on the same system" case some people see.)
  3. I get a hang in any shell I can find: git-bash, cmd, powershell, even WSL. (So it's not the "doesn't work in WSL only" case some people see.)
  4. The hang always happens in the usual suspect self._proc.stdout.readline() line in _ensure_kaleido(), see e.g. traceback in Python 3.6 container image: First fig.to_image call is very slow (~ 5 minutes) #36 (comment)
  5. Running the subprocess' corresponding command directly in the shell gives the same output as Image generation hangs on windows 10 #110 (comment), where the stdout JSON payload even has a trailing newline as it should.
  6. Disabling mathjax doesn't help. Nor does disabling internet entirely. And the hang never ended during the 3-hour slot I let it run once. (So it's probably not the "kaleido waits for timeout" case some people see.)

Buffering?

The only thing I can suspect here is buffering. If for some reason the process is not line buffered (and kaleido runs the subprocess with a binary stdout stream, so I'm not even sure line buffering is a thing there), the readline() call might be stuck, waiting for the stdout stream to flush, which for some reason might not happen on some systems.

Why it might not be buffering:

  1. It's quite surprising if two essentially identical systems (those of coworker and I) have markedly different buffering behaviour.
  2. The newline seems to be printed fine, which normally should flush assuming line buffering (but, again, this is a binary stream for now).
  3. Editing the library to open the subprocess with text mode streams via universal_newlines=True doesn't make the problem go away. Manually calling self._proc.stdout.flush() doesn't make the problem go away. Accessing self._proc.stdout.raw which might be less buffered doesn't make the problem go away. Calling self._proc.stdout.read(70) or even .read(50) (since we expect 70 characters for the JSON) halts all the same.
  4. Trying to reproduce the issue running a small python script that prints a similar linefeed-terminated line (run through the same subprocess.Popen setup) doesn't stall.
  5. Editing the library to open the subprocess with bufsize=50 (say) which is shorter than the 70 bytes of the expected JSON doesn't make the problem go away (it still stalls).

Why it might be buffering:

  1. This would explain why the readline() call freezes in the subprocess call (JSON is stuck in the buffer).
  2. The JSON writer adds the linefeed to the string rather than calling std::endl which would flush.
  3. Frankly, I have no other idea and the manual flush in this patch seems harmless enough to try, considering that the mysterious issue has been around for a long time.

Unfortunately I couldn't find any exact information on what kind of buffering options are available on a given OS, and what defaults are. The only thing I could figure out is that both in my native Windows and in my WSL the io.DEFAULT_BUFFER_SIZE is 8192, so if the output is not line buffered then it would take a lot of output to get it flushed (assuming I even understand the mechanics here correctly).

@Lucas-C
Copy link
Contributor

Lucas-C commented Mar 8, 2023

Hi @adeak

I may be facing the same issue in py-pdf/fpdf2#714, and I'd love to try your patch.
Considering that the patch applies to kaleido/cc/utils.h, will it be taken into account if install Kaleido from your git branch?

pip install git+https://github.com/adeak/Kaleido.git@flush_after_json_printing

Or do I have to compile my own version of the kaleido binary executable?

@adeak
Copy link
Author

adeak commented Mar 8, 2023

That won't work @Lucas-C, because the package has no setup.py, and because it's probably not how it works in general.

Reading the README (especially the "disadvantages" section) sounded like compilation is a huge amount of effort. This is why I didn't try to do that... Even if this PR works I'd have to build kaleido twice: first to check that a local rebuild on its own doesn't make the issue go away, and then rebuild for the PR branch. And it's very likely that

  1. the attempted fix doesn't work, because the issue is not buffering after all, or
  2. the fix works but then the code gets stuck on the next print, somewhere down the line.

Considering that the library seems unmaintained in general, this is not a super encouraging situation for going on potential wild goose chases, alas. When I opened this PR I was hoping that someone with the project already has tooling and CPU hours to build this so we can test.

@Lucas-C
Copy link
Contributor

Lucas-C commented Mar 9, 2023

Thank you for the detailed answer @adeak!

I tried to compile it on my machine yesterday, following the steps in the .circleci/config.yml, win_scripts/fetch_chromium.ps1 & win_scripts/build_kaleido.ps1 files, but it took ages to download & setup all the required repos & tools, so I gave up in the end.

I probably won't dive into this any further.

Whoever reads this: have a nice day!

@Connor-Heindel
Copy link

I pulled the latest version and the fork into a local repo and went through all of the compilation steps, and installed the generated wheel file on my WSL2 setup, which is having the same issue, and unfortunately I had no luck with it. It was still hanging on image generation forever, and the error stream when interrupted was the same.

The next step I'd like to try would be pulling a newer version of Chromium, since this current Kaleido build uses one that is significantly older, but it requires re-tooling the majority of the build process, and won't be a fast process to find all of the changes that have to be made.

@gvwilson gvwilson self-assigned this Jul 26, 2024
@gvwilson
Copy link
Contributor

Thanks for your interest in Kaleido. We are currently working on an overhaul that might address your issue - we hope to have news in a few weeks and will post an update then. Thanks - @gvwilson

@adeak
Copy link
Author

adeak commented Jul 26, 2024

Sounds great, thanks for the update @gvwilson.

@gvwilson gvwilson removed their assignment Aug 3, 2024
@gvwilson gvwilson added community community contribution fix fixes something broken P2 needed for current cycle labels Aug 14, 2024
@gvwilson
Copy link
Contributor

@adeak my apologies that "a few weeks" has turned into a couple of months - we are close to releasing the new version, and if you'd be interested in helping us test it to make sure it addresses the problems you've encountered, please give me a shout (greg.wilson@plot.ly). Thanks very much, and my apologies once again.

@adeak
Copy link
Author

adeak commented Oct 22, 2024

Thanks for the ping @gvwilson. No worrries, I'm glad you are resurrecting the project. I've just sent an email.

@DLabyGmr
Copy link

Hello, sorry for bothering you and stepping into the conversation. I lack a lot of knowledge related to GitHub and programming in general, so I might make some incorrect statements in this message. However, I hope it can still be helpful.

Yesterday, I encountered this issue of "image export hanging," and the hang occurred in self._proc.stdout.readline(). I am using Windows 11 24H2 26100.3194 and kaleido-0.2.1 at this point, which reproduces the mentioned issue.

After searching online, some people recommended downgrading to version kaleido-0.1.0post1. I did so, and the issue was resolved. However, it bothered me, so I took some time to compare the two versions of the repository...

I saw some changes regarding the encoding, and I thought that might be the issue, but it didn’t help. The problem persisted. After this attempt, I concluded that the issue isn't in the executed Python code but must be in the "kaleido.exe" file.

I compared the kaleido.exe files from versions kaleido-0.2.1 and kaleido-0.1.0post1 and noticed a significant size difference (probably due to the inclusion of MathJax, which is likely unrelated to the issue). So, I decided to replace the kaleido.exe file of version 0.2.1 with the one from 0.1.0post1 (so I have kaleido-0.2.1 installed, but with the exe file from 0.1.0post1). The issue was gone; it worked, created the output image, and did it really quickly.

Next, while this may not be particularly useful, I tested using kaleido-0.1.0post1 with the executable file from 0.2.1 just to double-check that the exe file was the issue. The problem was reproduced in this case, with execution hanging at self._proc.stdout.readline().

I finally narrowed it down to the changes made between kaleido-0.1.0post1 and kaleido-0.2.0rc1 (since the issue is also present in 0.2.0rc1). Upon reviewing the differences between these two versions, I can see that there is no change in the source code that builds the executable: v0.1.0.post1...v0.2.0rc1. The only thing that caught my attention is the addition in config.yml of:

run:
    name: Install Debugging Tools for Windows SDK 10.0.19041.685
    command: choco install windows-sdk-10-version-2004-windbg

But perhaps you can find the issue with this information.

In conclusion, it seems that this issue is related to how kaleido.exe is compiled and not directly with the source code.

@ayjayt
Copy link
Collaborator

ayjayt commented Feb 24, 2025

Hi @DLabyGmr

You are correct about many issues. Those versions are four years old and are no longer a tenable strategy for kaleido.

The new versions of kaleido, which are currently only available on github while we wait for documetation/integration tests, will use an external browser. (Kaleido.exe integrated a browser). Unfortunately, the size of the browsers seem to grow without bound over time. So its a constant problem.

I'm going to close this pull request, as these old versions of kaleido are no longer supported. Please feel free to respond and I'll help in anyway that i can.

@ayjayt ayjayt closed this Feb 24, 2025
@adeak
Copy link
Author

adeak commented Feb 24, 2025

@DLabyGmr in the meantime (and if expecting users to have Chrome installed is not an option, then also in the longer term) kaleido 0.1.0/0.1.0.post1 work if you only need to save figures to image files, at least they've worked on all our systems where newer releases hang.

Edit: sorry, I forgot you already knew that. Keeping this comment for the benefit of anyone coming across this in the future.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

community community contribution fix fixes something broken P2 needed for current cycle

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants