Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

build_on_raijin.sh fails: can't find NetCDF #24

Closed
aekiss opened this issue Aug 18, 2019 · 36 comments · Fixed by COSIMA/oasis3-mct#12
Closed

build_on_raijin.sh fails: can't find NetCDF #24

aekiss opened this issue Aug 18, 2019 · 36 comments · Fixed by COSIMA/oasis3-mct#12
Labels
bug Something isn't working

Comments

@aekiss
Copy link
Contributor

aekiss commented Aug 18, 2019

Here's what happens when I try to compile:

$ module list
Currently Loaded Modulefiles:
  1) pbs            2) dot            3) ncview/2.1.2   4) git/2.9.5
$ cd /short/v45/aek156/sources/new/libaccessom2/
$ ./build_on_raijin.sh
mpifort executable found:
Will assume system MPI implementation is sound. Remove mpifort from PATH to automatically configure MPI
-- Failed to find NetCDF interface for F90
CMake Error at /apps/CMake/3.6.2/share/cmake-3.6/Modules/FindPackageHandleStandardArgs.cmake:148 (message):
  Could NOT find NetCDF (missing: NETCDF_LIBRARIES NETCDF_INCLUDE_DIRS
  NETCDF_HAS_INTERFACES)
Call Stack (most recent call first):
  /apps/CMake/3.6.2/share/cmake-3.6/Modules/FindPackageHandleStandardArgs.cmake:388 (_FPHSA_FAILURE_MESSAGE)
  cmake/FindNetCDF.cmake:119 (find_package_handle_standard_args)
  CMakeLists.txt:38 (find_package)


-- Configuring incomplete, errors occurred!
See also "/short/v45/aek156/sources/new/libaccessom2/build/CMakeFiles/CMakeOutput.log".
/short/v45/aek156/sources/new/libaccessom2
@aekiss aekiss added the bug Something isn't working label Aug 18, 2019
@aekiss
Copy link
Contributor Author

aekiss commented Aug 18, 2019

ping @nichannah - any suggestions?

@aekiss
Copy link
Contributor Author

aekiss commented Aug 19, 2019

for some reason cmake isn't picking up netcdf/4.3.2 despite having module load netcdf/4.3.2 in build_on_raijin.sh.

But if I revert 020b141 and use netcdf/4.4.1.1 instead, it compiles fine and the exe appears to be properly linked:

ldd build/bin/yatm.exe | grep netcdf
	libnetcdff.so.6 => /apps/netcdf/4.4.1.1/lib/libnetcdff.so.6 (0x00007f5ca9f1c000)
	libnetcdf.so.11 => /apps/netcdf/4.4.1.1/lib/libnetcdf.so.11 (0x00007f5ca6b49000)

Is something broken with the netcdf/4.3.2 library installation on raijin?

@nichannah
Copy link
Contributor

nichannah commented Aug 22, 2019

This appears to be a system problem:

(base) [nah599@raijin5 libaccessom2]$ nf-config --prefix
/apps/netcdf/4.3.2/GNU
(base) [nah599@raijin5 libaccessom2]$ module unload netcdf
(base) [nah599@raijin5 libaccessom2]$ module load netcdf/4.4.1.1
(base) [nah599@raijin5 libaccessom2]$ nf-config --prefix
/apps/netcdf/4.4.1.1
(base) [nah599@raijin5 libaccessom2]$ ls /apps/netcdf/4.3.2/GNU
ls: cannot access /apps/netcdf/4.3.2/GNU: No such file or directory
(base) [nah599@raijin5 libaccessom2]$ ls /apps/netcdf/4.3.2
bin  include  lib  share

I have emailed help.

Perhaps we should move everything to 4.4.1.1?

@aekiss
Copy link
Contributor Author

aekiss commented Aug 22, 2019

Sure, if 4.4.1.1 works just as well as anything else.
@marshallward, @aidanheerdegen, @russfiedler - do you see any problems with netcdf 4.4.1.1?

@marshallward
Copy link
Contributor

marshallward commented Aug 22, 2019

I had an issue with the 4.4.1.1 install at some stage, but perhaps whatever it has been fixed.

Is there any reason to not use a 4.6 release?

@aekiss
Copy link
Contributor Author

aekiss commented Aug 23, 2019

good question.

Has anyone tried 4.6.1? (and what's the difference between netcdf/4.6.1 and netcdf/4.6.1p?)

There's also 4.6.3 and 4.7.0 but apparently not on NCI yet.

@marshallward
Copy link
Contributor

4.6.1p is using parallel netcdf (pHDF5, MPI-IO, etc), no need to use that one.

@aekiss
Copy link
Contributor Author

aekiss commented Aug 26, 2019

I gather from this that we should avoid using netcdf/4.4.* as it fails on floating point errors: http://cosima.org.au/index.php/2018/06/12/technical-working-group-meeting-june-2018/

@aekiss
Copy link
Contributor Author

aekiss commented Aug 27, 2019

Link to helpdesk discussion: https://track.nci.org.au/servicedesk/customer/portal/5/HELP-163255

@benmenadue
Copy link

You can't rely on e.g. nf-config and similar on Raijin as we use a custom layout to support the multi-compiler and multi-MPI installations we use. This also means that testing for the existence of files will probably fail because they're actually in compiler-specific subdirectories.

Instead, we have compiler and linker wrappers that add include and library paths as needed based on what modules are loaded. In the case of NetCDF, the only thing you need to do is to load the module and add the needed -l arguments at link time -- -lnetcdf for the C interface and -lnetcdff for the Fortran interface. You don't need (nor should you include) any -I or -L arguments -- this will be taken care of by the wrappers.

This works very well with autoconf scripts as they just attempt to use the library first (which will always work on Raijin). If that fails, then they'll invoke extra logic to try to work around it (e.g. add extra -I and -L flags).

aekiss added a commit that referenced this issue Aug 27, 2019
aekiss added a commit that referenced this issue Aug 27, 2019
@aekiss
Copy link
Contributor Author

aekiss commented Aug 27, 2019

Thanks @benmenadue, I think I understand now. I've just make a PR that fixes the issue.
@nichannah how does this look?

@benmenadue
Copy link

@aekiss #27 looks like it should work, although it's Raijin-specific -- e.g. on your desktop machine you'll probably want to still use the FindNetCDF. Perhaps add a CMake flag that enables / disables using that logic so that you can specifically disable it on systems that use modules?

cmake -DFIND_NETCDF=OFF ...
option(FIND_NETCDF "Use FindNetCDF to generate NetCDF configuration" ON)
if(${FIND_NETCDF})
  find_package(NetCDF REQUIRED)
else()
  set(NETCDF_LIBRARIES netcdff)
endif()

@aekiss
Copy link
Contributor Author

aekiss commented Aug 27, 2019

The compiler messages indicate that oasis is being compiled with -I/apps/netcdf/4.2.1.1/include but yatm.exe gets the right netcdf version

ldd build/bin/yatm.exe | grep netcdf
	libnetcdff.so.5 => /apps/netcdf/4.3.2/lib/Intel/libnetcdff.so.5 (0x00007f74df9a7000)
	libnetcdf.so.7 => /apps/netcdf/4.3.2/lib/libnetcdf.so.7 (0x00007f74de014000)

@aekiss
Copy link
Contributor Author

aekiss commented Aug 27, 2019

Thanks @benmenadue, that looks like a good idea

@benmenadue
Copy link

@aekiss That will give you undefined behaviour -- the -I argument will make it use the headers and libraries from that version, overriding what you get via the module. Essentially, you're compiling against one version but linking against another.

@aekiss
Copy link
Contributor Author

aekiss commented Aug 27, 2019

Hmm, that doesn't sound at all good!
@nichannah can you see how to fix this?

@benmenadue
Copy link

@aekiss I'm not sure why you would be getting a reference to 4.2.1.1; I can't find that version string anywhere in this repo. My guess is you've got some left-over CMake cache or an unclean environment.

@aekiss
Copy link
Contributor Author

aekiss commented Aug 27, 2019

thanks @benmenadue but I still get -I/apps/netcdf/4.2.1.1/include with a clean clone.
The issue is in https://github.com/COSIMA/oasis3-mct which is cloned as part of the cmake build.
But that source also has no mention of 4.2.1.1.

@aekiss
Copy link
Contributor Author

aekiss commented Aug 27, 2019

and my LD_LIBRARY_PATH is empty

@benmenadue
Copy link

@aekiss That's odd... I just tried it and it looks fine to me, for example

mpifort -convert big_endian -i4 -r8 -O3 -g -traceback -fno-alias -ip -align all -fpe0 -assume buffered_io -check noarg_temp_created -I/short/z00/bjm900/help/oasis/libaccessom2/build/oasis3-mct-prefix/src/oasis3-mct/Linux/build/lib/psmile.MPI1 -I/short/z00/bjm900/help/oasis/libaccessom2/build/oasis3-mct-prefix/src/oasis3-mct/Linux/build/lib/pio -I/short/z00/bjm900/help/oasis/libaccessom2/build/oasis3-mct-prefix/src/oasis3-mct/Linux/build/lib/mct -Duse_netCDF -Duse_comm_MPI1 -DTREAT_OVERLAY -I/apps/netcdf/4.4.1.1/include -c   /short/z00/bjm900/help/oasis/libaccessom2/build/oasis3-mct-prefix/src/oasis3-mct/lib/scrip/src/kinds_mod.f

(the -I is coming from oasis3-mct's configure, but is correct and so harmless). This is from doing this:

git clone 'https://github.com/COSIMA/libaccessom2.git'
curl -L 'https://patch-diff.githubusercontent.com/raw/COSIMA/libaccessom2/pull/27.patch' | patch -p1
./build_on_raijin.sh

(and I had no modules loaded before this).

Do you have any module load commands in your shell initialisation files?

@aekiss
Copy link
Contributor Author

aekiss commented Aug 27, 2019

my .profile loads these:

module load dot
module load ncview
module load git
module load conda/analysis3
module list
Currently Loaded Modulefiles:
  1) pbs                                        4) git/2.9.5
  2) dot                                        5) conda/analysis3-19.04(default:analysis3)
  3) ncview/2.1.2

@benmenadue
Copy link

It may be unrelated, but that conda module pollutes your environment really badly. It replaces all the of the system tools with its own, e.g.:

17:10 bjm900@raijin7 ~ > env | grep gcc
GCC_NM=/g/data3/hh5/public/apps/miniconda3/envs/analysis3-19.04/bin/x86_64-conda_cos6-linux-gnu-gcc-nm
GCC_RANLIB=/g/data3/hh5/public/apps/miniconda3/envs/analysis3-19.04/bin/x86_64-conda_cos6-linux-gnu-gcc-ranlib
GCC=/g/data3/hh5/public/apps/miniconda3/envs/analysis3-19.04/bin/x86_64-conda_cos6-linux-gnu-gcc
GCC_AR=/g/data3/hh5/public/apps/miniconda3/envs/analysis3-19.04/bin/x86_64-conda_cos6-linux-gnu-gcc-ar

and made changes to the global compiler and linker flags:

17:12 bjm900@raijin7 ~ > env | grep FLAGS
LDFLAGS=-Wl,-O2 -Wl,--sort-common -Wl,--as-needed -Wl,-z,relro -Wl,-z,now -Wl,--disable-new-dtags -Wl,--gc-sections
CPPFLAGS=-DNDEBUG -D_FORTIFY_SOURCE=2 -O2
DEBUG_CPPFLAGS=-D_DEBUG -D_FORTIFY_SOURCE=2 -Og
CFLAGS=-march=nocona -mtune=haswell -ftree-vectorize -fPIC -fstack-protector-strong -fno-plt -O2 -ffunction-sections -pipe
DEBUG_CFLAGS=-march=nocona -mtune=haswell -ftree-vectorize -fPIC -fstack-protector-all -fno-plt -Og -g -Wall -Wextra -fvar-tracking-assignments -ffunction-sections -pipe

This is one of the reasons we strongly recommend not using conda-like packaging systems.

Try with all of those module load and module use commands commented out so that you get a clean environment.

@benmenadue
Copy link

Ah -- you have this in your .login file:

module load netcdf

which will pick up the default version of 4.2.1.1. Perhaps something in oasis3-mct is stepping through (t)csh and picking it up here?

@aekiss
Copy link
Contributor Author

aekiss commented Aug 27, 2019

Odd. I've commented out every module load that looks relevant in .login and .profile but still have -I/apps/netcdf/4.2.1.1/include.
build_on_raijin.sh also includes a module purge so won't that deal with this issue?

@benmenadue
Copy link

It should, unless it's being picked up again by another script running. Do you mind if I try it as you to see if I can find where it's coming from?

@aekiss
Copy link
Contributor Author

aekiss commented Aug 28, 2019

be my guest - thanks!

@aekiss
Copy link
Contributor Author

aekiss commented Aug 28, 2019

do you need me to do anything or are you already able to log in as me?

@benmenadue
Copy link

Sorry, didn't get a chance to look at this this afternoon, but I can sudo to you and test it out.

@aekiss
Copy link
Contributor Author

aekiss commented Aug 28, 2019

ok thanks :-)

@aekiss
Copy link
Contributor Author

aekiss commented Sep 2, 2019

from https://opus.nci.org.au/display/Help/Gadi%3A+NCI%27s+New+Supercomputer
"Only the latest versions of third-party software packages will be built and installed on the new HPC system."
so I guess we should get NCI to install netcdf 4.7.0 and build with that?

@aekiss
Copy link
Contributor Author

aekiss commented Sep 11, 2019

I still have this issue, despite taking out every potentially relevant module load from my .login, .profile, .rashrc and .bashrc.

When I do

git clone https://github.com/COSIMA/libaccessom2.git
cd libaccessom2
git checkout 24-netcdf-not-found
./build_on_raijin.sh

the compiler messages include -I/apps/netcdf/4.2.1.1/include despite netcdf/4.3.2 being specified in build_on_raijin.sh.

This differs from what I get with ldd build/bin/yatm.exe | grep netcdf:

	libnetcdff.so.5 => /apps/netcdf/4.3.2/lib/Intel/libnetcdff.so.5 (0x00007f51fddcc000)
	libnetcdf.so.7 => /apps/netcdf/4.3.2/lib/libnetcdf.so.7 (0x00007f51fc439000)

@benmenadue if you have a spare moment to try this logged in as me that would be great.

@aidanheerdegen - could you try doing the above to see whether the problem is isolated to me?

@benmenadue
Copy link

@aekiss oasis3-mct-prefix/src/oasis3-mct/util/make_dir/config.nci has an un-versioned module load netcdf that will be picking up the default version.

module purge
module load intel-fc/17.0.1.132
module load intel-cc/17.0.1.132
module load netcdf
module load openmpi/1.10.2

This might be the source of that. But if so, I'm not sure why it wasn't happening for me as well.

@aekiss
Copy link
Contributor Author

aekiss commented Sep 11, 2019

thanks @benmenadue, well spotted! When I do module load netcdf I get netcdf/4.2.1.1, so that is probably the culprit.

aekiss added a commit to COSIMA/oasis3-mct that referenced this issue Sep 11, 2019
aekiss added a commit that referenced this issue Sep 11, 2019
@aekiss
Copy link
Contributor Author

aekiss commented Sep 11, 2019

Yep, I can confirm that this fixes it. Thanks again @benmenadue

@aekiss
Copy link
Contributor Author

aekiss commented Sep 20, 2019

reopening - also need to deal with PR #27 which should probably be made more portable as per Ben's suggestion #24 (comment)

@aekiss aekiss reopened this Sep 20, 2019
@nichannah
Copy link
Contributor

I have merged #32 which includes Ben's suggestion.

nichannah pushed a commit to COSIMA/oasis3-mct that referenced this issue Jan 19, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants