Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update TensorFlow as 1.9 on Travis #10674

Merged
merged 5 commits into from
Aug 15, 2018

Conversation

taehoonlee
Copy link
Contributor

@taehoonlee taehoonlee commented Jul 14, 2018

Summary

This PR updates TensorFlow as 1.9 on Travis.

Related Issues

PR Overview

  • This PR requires new unit tests [y/n] (make sure tests are included)
  • This PR requires to update the documentation [y/n] (make sure the docs are up-to-date)
  • This PR is backwards compatible [y/n]
  • This PR changes the current API [y/n] (all API changes need to be approved by fchollet)

@taehoonlee
Copy link
Contributor Author

The time-out error with Python 3.6 still occurs...

@fchollet
Copy link
Collaborator

What can we do debug it? It would be important for us to upgrade.

@fchollet
Copy link
Collaborator

Probably the best thing to do is to attempt to bisect the test codebase until we find which tests are linked to the timeout.

@Dref360
Copy link
Contributor

Dref360 commented Jul 19, 2018

The issue you started for convenience: #10100

We could try to run the docker locally.
https://docs.travis-ci.com/user/common-build-problems/#Troubleshooting-Locally-in-a-Docker-Image

As you said, it may be related to the RAM. I'll try to play around with jobs.
If it's just a memory issue, we could upgrade to a VM based environment. We would gain 3.5GB of RAM. And I think it's free because we are FOSS?

Info: https://docs.travis-ci.com/user/reference/overview/

NOTE: SUCCESS HERE DOESN'T MEAN IT WORKS DON'T MERGE THIS

@Dref360
Copy link
Contributor

Dref360 commented Jul 22, 2018

This seems related to multiprocessing.Manager not being started.
I was able to get a stacktrace here : https://travis-ci.com/Dref360/keras/jobs/135752354#L1764

All the tests that were blocked used fit_generator with generator. (Which uses a Manager).

@fchollet could we get someone from TF on this as well? This may be GIL related.

EDIT: RAM could cause this as well and this is probably one of the reason. Still digging!

@fchollet
Copy link
Collaborator

Thanks for the analysis @Dref360. I will see if anyone on the TF team can understand what's going on.

@Dref360 Dref360 mentioned this pull request Jul 23, 2018
3 tasks
@Dref360
Copy link
Contributor

Dref360 commented Jul 26, 2018

Okay so after a lot of work, multiprocessing seems to be the cause.

Quick fix so that we update.
Add --ignore=tests/keras/utils/data_utils_test.py --ignore=tests/test_multiprocessing.py to travis.yml

Create a new job that will only do those 2?

I'm not sure why, but it requires a lot of memory to start processes for generators. We may want to update this code to a Pool (similar to Sequence).

Note:

  • The processes are correctly killed at the end
  • The issue is at startup
  • This is not a TF issue, TF >= 1.8 just requires more memory.
  • For some reason pytest-xdist creates a lot of idle threads?

@taehoonlee
Copy link
Contributor Author

@Dref360, I agree with you. The multiprocessing tests seems to be the main reason for insufficient memory errors. The additional commit, disabling the tests, can make the CI success on TF 1.9.

I have a question. I tested different backends on Python 2 and 3 with your script in #10756. Both Theano and CNTK showed stable memory usage, but TensorFlow did quite random patterns. For example, TF 1.3 often seems to require more memory than TF 1.9. I want to understand clearly why there is no problem so far and the problem starts to occur since TF 1.8.

@fchollet
Copy link
Collaborator

fchollet commented Aug 9, 2018

@taehoonlee should we merge this PR, or is there a risk that not testing test_multiprocessing with TensorFlow on CI would lead to coverage problems and potentially lead to introducing TF bugs down the road?

@taehoonlee
Copy link
Contributor Author

@fchollet, @Dref360, I did some more testing and found a simple trick.

In tests/test_multiprocessing.py, the only thing we need to do is just to set WORKERS as 2. Then, we can perform all the tests on TF 1.9.

Over memory usage on TF still exists but it is out of the scope of Keras. If we focus on only Keras, there is no risk that not testing multiprocessing stuffs.

Copy link
Collaborator

@fchollet fchollet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's great, thanks for finding a simple fix!

@Dref360
Copy link
Contributor

Dref360 commented Oct 10, 2018

I did some experiments and it seems that using 'spawn' instead of 'fork' is solving our issue.

  • This is only do-able on Python 3
  • This will only fix the OrderedEnqueuer since generators are not pickable.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants