Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Travis-CI testing #111

Merged
merged 40 commits into from Mar 31, 2018
Merged

Travis-CI testing #111

merged 40 commits into from Mar 31, 2018

Conversation

ghost
Copy link

@ghost ghost commented Mar 28, 2018

This PR implements Travis CI for CICE. The configuration is similar to Icepack, and is based on GCC and open-mpi.

There are still a few issues that need to be worked out before I recommend merging this PR. The only tests that currently succeed are the build tests––the run tests all fail. Here is an example build log, with an excerpt below:

#------- 
#repo = https://github.com/anders-dc/CICE.git
#bran = 
#hash = 960aaadcf40762e984dd7a75ea36b96df8feef8b
#hshs = 960aaadcf4
#hshu = Anders Damsgaard <andersd@riseup.net>
#hshd = Wed Mar 28 13:20:19 2018 -0400
#date = 2018-03-28
#time = 17:24:42
#mach = travisCI
#user = travis
#vers = CICE 6.0.0.alpha
#------- 
#---
PASS travisCI_gnu_smoke_gx3_8x2_diag1_run5day build
FAIL travisCI_gnu_smoke_gx3_8x2_diag1_run5day run
#---
PASS travisCI_gnu_smoke_gx3_8x2_diag24_medium_run1year build
FAIL travisCI_gnu_smoke_gx3_8x2_diag24_medium_run1year run
#---
PASS travisCI_gnu_smoke_gx3_4x1_debug_diag1_run5day build
FAIL travisCI_gnu_smoke_gx3_4x1_debug_diag1_run5day run
#---
PASS travisCI_gnu_smoke_gx3_8x2_debug_diag1_run5day build
FAIL travisCI_gnu_smoke_gx3_8x2_debug_diag1_run5day run
#---
PASS travisCI_gnu_smoke_gx3_4x2_diag1_run5day build
FAIL travisCI_gnu_smoke_gx3_4x2_diag1_run5day run
#---
PASS travisCI_gnu_smoke_gx3_4x1_diag1_run5day_thread build
FAIL travisCI_gnu_smoke_gx3_4x1_diag1_run5day_thread run
#---
PASS travisCI_gnu_restart_gx3_8x1_diag1 build
PEND travisCI_gnu_restart_gx3_8x1_diag1 exact-restart
FAIL travisCI_gnu_restart_gx3_8x1_diag1 run-initial
#---
PASS travisCI_gnu_restart_gx3_4x2_debug build
PEND travisCI_gnu_restart_gx3_4x2_debug exact-restart
FAIL travisCI_gnu_restart_gx3_4x2_debug run-initial
#---
PASS travisCI_gnu_restart_gx3_8x2_diag1_pondcesm build
PEND travisCI_gnu_restart_gx3_8x2_diag1_pondcesm exact-restart
FAIL travisCI_gnu_restart_gx3_8x2_diag1_pondcesm run-initial
#---
PASS travisCI_gnu_restart_gx3_8x2_diag1_pondtopo build
PEND travisCI_gnu_restart_gx3_8x2_diag1_pondtopo exact-restart
FAIL travisCI_gnu_restart_gx3_8x2_diag1_pondtopo run-initial
#---
PASS travisCI_gnu_smoke_gx1_32x1_diag1_run5day_thread build
FAIL travisCI_gnu_smoke_gx1_32x1_diag1_run5day_thread run
#---
PASS travisCI_gnu_smoke_gx1_16x2_diag1_run5day build
FAIL travisCI_gnu_smoke_gx1_16x2_diag1_run5day run
#---
PASS travisCI_gnu_smoke_gx1_8x4_debug_run2day build
FAIL travisCI_gnu_smoke_gx1_8x4_debug_run2day run
#---
PASS travisCI_gnu_restart_gx1_32x1 build
PEND travisCI_gnu_restart_gx1_32x1 exact-restart
FAIL travisCI_gnu_restart_gx1_32x1 run-initial
#---
PASS travisCI_gnu_restart_gx1_13x2 build
PEND travisCI_gnu_restart_gx1_13x2 exact-restart
FAIL travisCI_gnu_restart_gx1_13x2 run-initial

15 of 36 tests PASSED
15 of 36 tests FAILED
6 of 36 tests PENDING

I set ICE_MACHINE_TPNODE = 4 in configuration/scripts/machines/env.travisCI, which makes the build steps succeed. However, Travis-CI does not support the resultant nprocs values during execution. By grep'ing the generated casescripts, nprocs ends up with values of 4, 8, 13, 16, or 32. This, by far, exceeds the capabilities of Travis. I suggest designing tests that are suitable for Travis.

Furthermore, I had to remove -Wextra from the compiler flags (configuration/scripts/machines/Macros.travisCI), as Travis fails a build if the size of STDOUT/STDERR text exceeds 4 megabytes.


Developer(s): Anders Damsgaard, Princeton/NOAA-GFDL (github.com/anders-dc, adamsgaard.dk)

Are the code changes bit for bit, different at roundoff level, or more substantial? There are minor changes to the underlying code which shouldn't affect other uses.

Is the documentation being updated with this PR? (Y/N) No.

If not, does the documentation need to be updated separately? (Y/N) No.

@apcraig
Copy link
Contributor

apcraig commented Mar 28, 2018

I see this in the build log for several tests
(abort_ice)ABORTED:
(abort_ice) error = ice: Input nprocs not same as system request
That suggests there is an inconsistency in the tasks/threads used for testing and those defined by the test and/or in namelist. What we need to do is make sure we're setting up tests that can be carried out by travisCI.

Does travisCI support MPI and/or openMP and if so, how many tasks and threads can we have?

@ghost
Copy link
Author

ghost commented Mar 28, 2018

Whoops, I forgot to launch the tests with mpirun. I've fixed that in a143396 and
ed04056. TravisCI does support MPI and OpenMP, but the virtual machines are two cores only. However, I think the question is if we can overload the system with additional threads, which would presumably result in slower execution.

@apcraig
Copy link
Contributor

apcraig commented Mar 28, 2018

It looks like the travisCI machine setup has 4 tasks per node, and also no way to request resources, we are just running interactively. How many resources do we get, just one node? That means we may have to develop a suite that uses no more than 4 tasks*threads for all tests.

@ghost
Copy link
Author

ghost commented Mar 28, 2018

Yes, I set it up for 4 tasks per node in order to be able to build. We get just one node with two cores, and I'm pretty sure these are not hyperthreaded. I agree, the best solution would be to have a test suite which is designed for this environment.

@apcraig
Copy link
Contributor

apcraig commented Mar 28, 2018

We can do that. We'll need to setup a suite of test that use less resources than we have currently defined. It would be nice to get access to more cores though so we can test a mix of task and thread counts with different decompositions. 8 or 16 would be great for instance.

I was just looking to see what VIC is doing and it looks like they use travis for a bunch of unit tests, https://travis-ci.org/UW-Hydro/VIC, but I will try to ask them about whether they are able to test on higher pe counts.

@ghost
Copy link
Author

ghost commented Mar 28, 2018

Sounds good, thanks!

Meanwhile, it looks like we are getting there. I encounter into this error:

Fortran runtime error: Cannot open file '/home/travis/CICE_data/grid/gx3/grid_gx3.bin': No such file or directory

We used a wget call to get external [Icepack_data.tar.gz from a UCAR FTP server]. Is there a similar archive for CICE?

EDIT: Nvm, just found the information in the wiki

@ghost
Copy link
Author

ghost commented Mar 28, 2018

Excellent, thank you Tony. The new test suite seems to mostly succeed (raw log).

#------- 
#repo = https://github.com/anders-dc/CICE.git
#bran = 
#hash = 76c37bf34ea295fbd2ad889375696104e6e50c7e
#hshs = 76c37bf34e
#hshu = Anders Damsgaard <andersd@riseup.net>
#hshd = Wed Mar 28 18:04:22 2018 -0400
#date = 2018-03-28
#time = 22:08:40
#mach = travisCI
#user = travis
#vers = CICE 6.0.0.alpha
#------- 
#---
PASS travisCI_gnu_smoke_gx3_1x2_diag1_run5day build
PASS travisCI_gnu_smoke_gx3_1x2_diag1_run5day run
#---
PASS travisCI_gnu_smoke_gx3_2x1_debug_diag1_run5day build
FAIL travisCI_gnu_smoke_gx3_2x1_debug_diag1_run5day run
#---
PASS travisCI_gnu_smoke_gx3_1x2_debug_diag1_run5day build
FAIL travisCI_gnu_smoke_gx3_1x2_debug_diag1_run5day run
#---
PASS travisCI_gnu_smoke_gx3_1x1_diag1_run5day_thread build
FAIL travisCI_gnu_smoke_gx3_1x1_diag1_run5day_thread run
#---
PASS travisCI_gnu_smoke_gx3_2x1_diag1_run5day_thread build
PASS travisCI_gnu_smoke_gx3_2x1_diag1_run5day_thread run
FAIL travisCI_gnu_smoke_gx3_2x1_diag1_run5day_thread bfbcomp travisCI_gnu_smoke_gx3_1x2_diag1_run5day.travisCItest different-data
#---
PASS travisCI_gnu_restart_gx3_2x1_diag1 build
PASS travisCI_gnu_restart_gx3_2x1_diag1 run-initial
PASS travisCI_gnu_restart_gx3_2x1_diag1 run-restart
PASS travisCI_gnu_restart_gx3_2x1_diag1 exact-restart
#---
PASS travisCI_gnu_restart_gx3_1x2_diag1 build
PASS travisCI_gnu_restart_gx3_1x2_diag1 run-initial
PASS travisCI_gnu_restart_gx3_1x2_diag1 run-restart
PASS travisCI_gnu_restart_gx3_1x2_diag1 exact-restart
#---
PASS travisCI_gnu_restart_gx3_2x1_diag1_pondcesm build
PASS travisCI_gnu_restart_gx3_2x1_diag1_pondcesm run-initial
PASS travisCI_gnu_restart_gx3_2x1_diag1_pondcesm run-restart
PASS travisCI_gnu_restart_gx3_2x1_diag1_pondcesm exact-restart
#---
PASS travisCI_gnu_restart_gx3_2x1_diag1_pondtopo build
PASS travisCI_gnu_restart_gx3_2x1_diag1_pondtopo run-initial
PASS travisCI_gnu_restart_gx3_2x1_diag1_pondtopo run-restart
PASS travisCI_gnu_restart_gx3_2x1_diag1_pondtopo exact-restart

23 of 27 tests PASSED
4 of 27 tests FAILED
0 of 27 tests PENDING

Travis decides to terminate it as it loops through the runlogs in after_failure because of the excessive output.

@apcraig
Copy link
Contributor

apcraig commented Mar 28, 2018

Can we try again, but instead of writing the entire log file at the end, can we just tail -100 each log file?

@ghost
Copy link
Author

ghost commented Mar 28, 2018

Great, this is more informative. Here's the raw log.

EDIT: The runtime errors come from Icepack:

At line 783 of file /home/travis/build/anders-dc/CICE/icepack/columnphysics/icepack_zbgc.F90
Fortran runtime error: Array bound mismatch for dimension 1 of array 'kn_bac' (1/3)

@apcraig
Copy link
Contributor

apcraig commented Mar 29, 2018

We've seen that error before, probably just need to fix an interface call on the cice side. there is another different error for the 1x1 case. i'll have to look at that one a little closer.

@apcraig
Copy link
Contributor

apcraig commented Mar 29, 2018

Just FYI that I have duplicated these errors on another machine with the gnu compiler and am working on them. Hope to have an update soon.

@apcraig
Copy link
Contributor

apcraig commented Mar 30, 2018

@anders-dc I just updated my travis branch again with several fixes.
https://github.com/apcraig/CICE/tree/travis
The specific commit is
apcraig@ad52ab4
I assume you can pull these updates into your branch and run another test? If you have problems with the pull, let me know. thanks!

update CICE to address test failures, several issues added
@ghost
Copy link
Author

ghost commented Mar 30, 2018

Thanks @apcraig, here's the newest run.

@apcraig
Copy link
Contributor

apcraig commented Mar 30, 2018

I'm watching it. We've already hit the log size limit and been going 20 minutes. We need to add an option to the scripts that doesn't write build output to the terminal. I will take care of that next. I've also had an idea that we should be reusing binaries if we can. That's not so easy to do with CICE because the decomposition is built into the build. But maybe we can for travis. Let me prototype that too and see if I can get something that works.

@apcraig
Copy link
Contributor

apcraig commented Mar 30, 2018

It failed, but we can't tell why. I'll try to fix the length of the logging and propose another pull later today.

@ghost
Copy link
Author

ghost commented Mar 30, 2018

I agree although the raw log is still going. The -Wzerotrip compiler warnings (included in -Wall; "Warning: DO loop at (1) will be executed zero times") are also a major issue.

@apcraig
Copy link
Contributor

apcraig commented Mar 30, 2018

You're right Anders, we can see the raw log. Forgot about that. We're still getting a couple errors. I'll look into those too, but we're getting closer.

@ghost
Copy link
Author

ghost commented Mar 30, 2018

Yes, we're getting there. Maybe it would be worth suppressing more compiler warnings. The main product of Travis is the boolean yes/no to whether the compilation and runtime tests for a commit are successful. Only rarely will somebody look into the log of a passed build.

@apcraig
Copy link
Contributor

apcraig commented Mar 30, 2018

@anders-dc OK, there is another set of commits on the travis branch,
https://github.com/apcraig/CICE/tree/travis
specifically
apcraig@6be69e4

You should also add

setenv ICE_MACHINE_QUIETMODE true

to your env.travisCI_gnu file. that will stop the spewing of the build output. if the build fails, it will do a tail -10 automatically on the build log file, so hopefully that will work for us. if not, we'll continue to tweak.

In addition to adding the quiet mode, I have also added a couple tests to the travis suite. I want to see what we get. I have not been able to duplicate the error on another machine. I even used the travisCI Macros file just to make sure it wasn't a small diff in the build settings. I am getting some errors with other compilers (pgi) in what seems to be the same point, but I can't be sure it's the same thing. I spent a few minutes looking at the pgi error but it's going to take a little more work to sort out. My plan is to add an issue.

What I propose is we run this next set of tests and see what we get. Then we should turn off, for now, the ones that are failing on travisCI. We can then push this to master and separately work on the outstanding issues. I think we've made some reasonable progress at this point.

anders-dc and others added 2 commits March 30, 2018 17:23
update travis suite and add quiet mode to scripts
@apcraig
Copy link
Contributor

apcraig commented Mar 30, 2018

So, the latest test suite does more or less what I expected. We definitely have some reproducibility problems, and that's one issue we're not seeing on other platforms so far. That's even without OpenMP. There is work to do, but most of that needs to happen outside Travis. I propose the following changes to the travis_suite, change

smoke gx3 2x1 diag1,run5day smoke_gx3_1x1_diag1_run5day
smoke gx3 1x2 diag1,run5day
smoke gx3 1x1 diag1,run5day,thread smoke_gx3_1x2_diag1_run5day
smoke gx3 2x1 diag1,run5day,thread smoke_gx3_1x2_diag1_run5day

to

#smoke gx3 2x1 diag1,run5day smoke_gx3_1x1_diag1_run5day
smoke gx3 2x1 diag1,run5day
smoke gx3 1x2 diag1,run5day
#smoke gx3 1x1 diag1,run5day,thread smoke_gx3_1x2_diag1_run5day
#smoke gx3 2x1 diag1,run5day,thread smoke_gx3_1x2_diag1_run5day
smoke gx3 2x1 diag1,run5day,thread

Basically, we're turning off the 1x1 test that fails and turning off all the bfb compares for the other tests. Not ideal, but OK for now. @anders-dc can you make that change and retest. If you prefer for me to make the change on my branch, just let me know. thanks!

@ghost
Copy link
Author

ghost commented Mar 30, 2018

Hooray! The build took quite a long time to complete (33 mins), but passed. Thanks @apcraig!

@apcraig
Copy link
Contributor

apcraig commented Mar 30, 2018

Great. I think we an execute the PR now. We should have @eclare108213 give a quick review too. There are some code mods. I may further reduce the test list or try to figure out a way for it to go a little faster. 30 minutes seems a little long for a "quick" status test.

@apcraig apcraig requested review from eclare108213 and apcraig March 30, 2018 23:25
@eclare108213 eclare108213 merged commit 536e2a6 into CICE-Consortium:master Mar 31, 2018
@ghost ghost deleted the travisCI branch March 31, 2018 16:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants