Update TensorFlow as 1.9 on Travis #10674

taehoonlee · 2018-07-14T05:55:31Z

Summary

This PR updates TensorFlow as 1.9 on Travis.

Related Issues

PR Overview

This PR requires new unit tests [y/n] (make sure tests are included)
This PR requires to update the documentation [y/n] (make sure the docs are up-to-date)
This PR is backwards compatible [y/n]
This PR changes the current API [y/n] (all API changes need to be approved by fchollet)

taehoonlee · 2018-07-14T07:04:33Z

The time-out error with Python 3.6 still occurs...

fchollet · 2018-07-14T16:01:03Z

What can we do debug it? It would be important for us to upgrade.

fchollet · 2018-07-19T17:29:26Z

Probably the best thing to do is to attempt to bisect the test codebase until we find which tests are linked to the timeout.

Dref360 · 2018-07-19T18:24:31Z

The issue you started for convenience: #10100

We could try to run the docker locally.
https://docs.travis-ci.com/user/common-build-problems/#Troubleshooting-Locally-in-a-Docker-Image

As you said, it may be related to the RAM. I'll try to play around with jobs.
If it's just a memory issue, we could upgrade to a VM based environment. We would gain 3.5GB of RAM. And I think it's free because we are FOSS?

Info: https://docs.travis-ci.com/user/reference/overview/

NOTE: SUCCESS HERE DOESN'T MEAN IT WORKS DON'T MERGE THIS

Dref360 · 2018-07-22T18:10:38Z

This seems related to multiprocessing.Manager not being started.
I was able to get a stacktrace here : https://travis-ci.com/Dref360/keras/jobs/135752354#L1764

All the tests that were blocked used fit_generator with generator. (Which uses a Manager).

@fchollet could we get someone from TF on this as well? This may be GIL related.

EDIT: RAM could cause this as well and this is probably one of the reason. Still digging!

fchollet · 2018-07-23T19:31:28Z

Thanks for the analysis @Dref360. I will see if anyone on the TF team can understand what's going on.

Dref360 · 2018-07-26T12:45:11Z

Okay so after a lot of work, multiprocessing seems to be the cause.

Quick fix so that we update.
Add --ignore=tests/keras/utils/data_utils_test.py --ignore=tests/test_multiprocessing.py to travis.yml

Create a new job that will only do those 2?

I'm not sure why, but it requires a lot of memory to start processes for generators. We may want to update this code to a Pool (similar to Sequence).

Note:

The processes are correctly killed at the end
The issue is at startup
This is not a TF issue, TF >= 1.8 just requires more memory.
For some reason pytest-xdist creates a lot of idle threads?

taehoonlee · 2018-07-28T10:31:11Z

@Dref360, I agree with you. The multiprocessing tests seems to be the main reason for insufficient memory errors. The additional commit, disabling the tests, can make the CI success on TF 1.9.

I have a question. I tested different backends on Python 2 and 3 with your script in #10756. Both Theano and CNTK showed stable memory usage, but TensorFlow did quite random patterns. For example, TF 1.3 often seems to require more memory than TF 1.9. I want to understand clearly why there is no problem so far and the problem starts to occur since TF 1.8.

fchollet · 2018-08-09T21:05:08Z

@taehoonlee should we merge this PR, or is there a risk that not testing test_multiprocessing with TensorFlow on CI would lead to coverage problems and potentially lead to introducing TF bugs down the road?

taehoonlee · 2018-08-15T08:07:04Z

@fchollet, @Dref360, I did some more testing and found a simple trick.

In tests/test_multiprocessing.py, the only thing we need to do is just to set WORKERS as 2. Then, we can perform all the tests on TF 1.9.

Over memory usage on TF still exists but it is out of the scope of Keras. If we focus on only Keras, there is no risk that not testing multiprocessing stuffs.

fchollet

That's great, thanks for finding a simple fix!

Dref360 · 2018-10-10T14:19:16Z

I did some experiments and it seems that using 'spawn' instead of 'fork' is solving our issue.

This is only do-able on Python 3
This will only fix the OrderedEnqueuer since generators are not pickable.

Update TensorFlow as 1.9 on Travis

e77da4c

Dref360 mentioned this pull request Jul 23, 2018

Memory leak in *_generator #10756

Closed

3 tasks

Test whether the time-out error comes from multiprocessing or not

bd73d33

taehoonlee added 3 commits August 15, 2018 15:37

Another try

f532d3c

Style fixes

45adb54

Another try

c238597

fchollet approved these changes Aug 15, 2018

View reviewed changes

fchollet merged commit 67fc091 into keras-team:master Aug 15, 2018

taehoonlee deleted the update_tf_travis branch August 18, 2018 12:26

taehoonlee mentioned this pull request Aug 18, 2018

Fix CI on TensorFlow #10929

Merged

ziky90 mentioned this pull request Aug 22, 2018

TensorFlow 1.8 upgrade causes Keras CI build to stall tensorflow/tensorflow#19038

Closed

ziky90 mentioned this pull request Sep 24, 2018

Fix/test_multiprocessing.py and data_utils_test.py, increase test coverage #11214

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update TensorFlow as 1.9 on Travis #10674

Update TensorFlow as 1.9 on Travis #10674

taehoonlee commented Jul 14, 2018 •

edited

Loading

taehoonlee commented Jul 14, 2018

fchollet commented Jul 14, 2018

fchollet commented Jul 19, 2018

Dref360 commented Jul 19, 2018 •

edited

Loading

Dref360 commented Jul 22, 2018 •

edited

Loading

fchollet commented Jul 23, 2018

Dref360 commented Jul 26, 2018 •

edited

Loading

taehoonlee commented Jul 28, 2018

fchollet commented Aug 9, 2018

taehoonlee commented Aug 15, 2018

fchollet left a comment

Dref360 commented Oct 10, 2018

Update TensorFlow as 1.9 on Travis #10674

Update TensorFlow as 1.9 on Travis #10674

Conversation

taehoonlee commented Jul 14, 2018 • edited Loading

Summary

Related Issues

PR Overview

taehoonlee commented Jul 14, 2018

fchollet commented Jul 14, 2018

fchollet commented Jul 19, 2018

Dref360 commented Jul 19, 2018 • edited Loading

Dref360 commented Jul 22, 2018 • edited Loading

fchollet commented Jul 23, 2018

Dref360 commented Jul 26, 2018 • edited Loading

taehoonlee commented Jul 28, 2018

fchollet commented Aug 9, 2018

taehoonlee commented Aug 15, 2018

fchollet left a comment

Choose a reason for hiding this comment

Dref360 commented Oct 10, 2018

taehoonlee commented Jul 14, 2018 •

edited

Loading

Dref360 commented Jul 19, 2018 •

edited

Loading

Dref360 commented Jul 22, 2018 •

edited

Loading

Dref360 commented Jul 26, 2018 •

edited

Loading