Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

updates to drfini, ufbtab, ufbmex #625

Merged
merged 3 commits into from
Oct 11, 2024
Merged

updates to drfini, ufbtab, ufbmex #625

merged 3 commits into from
Oct 11, 2024

Conversation

jbathegit
Copy link
Collaborator

@jbathegit jbathegit commented Oct 11, 2024

Fixes #623
including new outtest13 test code.

Also found and fixed a similar do loop issue in ufbtab and added a new test to expand ufbtab code coverage.

Also simplified some coding in calls to x48 and x84.

@jbathegit
Copy link
Collaborator Author

Hmm OK, an error that I've never seen before is now popping up with gcov in the test step of the developer check.

This happens after all of the tests themselves have run successfully, beginning at about line 7937 of the test step output. If you scroll all the way to the bottom of the test step output, it mentions that there's a bug in the gcov tool, but if you click on the URL it takes you to a stale "not found" page.

snippet1

I could try adding the suggested option to the gcov command in the developer.yml file, but if you scroll up there are other diagnostic messages as well that are concerning, which leads me to think that it's now just giving up and excluding some of the test codes altogether from its coverage analysis, and which is kind of the whole point here.

snippet2

@edwardhartnett @AlexanderRichert-NOAA @aerorahul @jack-woollen have you ever seen anything like this before? I did try rerunning the developer check to see if it was maybe just some one-time glitch with gcov, but no dice.

@AlexanderRichert-NOAA
Copy link
Contributor

No, I don't think I've run into that. My initial thought would be to either see if there's been a change in the gcovr version, or see if there's some part of the modified code that's tripping it up (doesn't look it though, since the line in question isn't new).

@jbathegit
Copy link
Collaborator Author

jbathegit commented Oct 11, 2024

I did some more digging and found that there's a block of code in subroutine makestab that's now being hit more than 2^32 times:

      do k=1,maxjl
        knt(k) = 0
      enddo

As far as I can tell this is legitimate code for initializing an internal storage array when building a jump/link table. But with all of the testing we're now doing it's apparently being hit so many times across all of the various test programs that it's blowing past the largest number that an internal variable can hold inside of the gcovr python code, and which in turn throws the error. This is a known issue in gcovr, and I've now adopted their suggested solution of adding an option to gcovr so that it spits out a warning (once :-) and no longer crashes.

@jack-woollen, you're more of an expert than me on the innards of subroutine makestab, so I'm going to defer to you as to whether that above block of code really needs to initialize all maxjl members of the knt array every time through the loop inside that subroutine. If there's maybe some way for the overall logic to still work without continuously looping through the entire array so many times during each call to the subroutine (maybe it only really needs to initialize a smaller subsection of the array to 0, rather than all maxjl of the members?), then maybe that's a worthwhile update at some later date for purposes of efficiency to cut down overall run time? Just let me know - thanks!

@jbathegit It looks like the knt array clearing may be able to move out in front of the loop it's currently in.

! Set up expansion segments for type 'SUB', 'DRP', and 'DRS' nodes.
newn = 0
do k=1,maxjl
knt(k) = 0
enddo
do n=1,ntab
and so on

It passed all the tests when I tried that out.

Copy link
Contributor

@jack-woollen jack-woollen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jbathegit I'll comment that the original loop looked like the following, which I think accounts for the changing nval(lun). I like your update better than the original, but I'm not sure where the bug came in!

C COMFORM THE TEMPLATES TO THE DELAYED REPLICATION FACTORS
C --------------------------------------------------------

  M = 0
  N = 0

10 DO N=N+1,NVAL(LUN)
NODE = INV(N,LUN)
IF(ITP(NODE).EQ.1 .AND. TAG(NODE).EQ.DRFTAG) THEN
M = M+1
CALL USRTPL(LUN,N,MDRF(M))
GOTO 10
ENDIF
ENDDO

@jbathegit
Copy link
Collaborator Author

jbathegit commented Oct 11, 2024

Interesting, it looks like that particular change came in sometime between v12.0.0 and v12.1.0, and honestly that was likely my own doing since I was refactoring a lot during that time, including trying to eliminate GOTO statements where possible.

However, what's even more interesting is that the aforementioned IASI encoding logic doesn't work correctly with either v12.0.0 or v12.1.0. But it does work correctly with v11.5.0 and v11.7.0, and which suggests that whatever the real bug was in this particular case may have been introduced sometime during the interval between v11.7.0 and v12.0.0. So that's a bit weird, but in any case thanks for the feedback!

@jbathegit Which code is failing? I think we should find the real bug.

@jbathegit jbathegit merged commit 2d9b1a6 into develop Oct 11, 2024
10 checks passed
@jbathegit jbathegit deleted the jba_delreps branch October 11, 2024 20:02
@jack-woollen
Copy link
Contributor

@jbathegit It looks like the knt array clearing may be able to move out in front of the loop it's currently in.

! Set up expansion segments for type 'SUB', 'DRP', and 'DRS' nodes.
newn = 0
do k=1,maxjl
knt(k) = 0
enddo
do n=1,ntab
and so on

It passed all the tests when I tried that out.

@jbathegit
Copy link
Collaborator Author

Thanks @jack-woollen , I'll look into setting up a PR for that next week.

In the meantime, my merge commit for this branch is failing its automated checkout for the Linux and developer CIs (but not for MacOS or Intel), and even though all of the CIs ran without incident for the latest commit within this PR. The error seems to be Python-related in the install-deps step:

snippet3

@AlexanderRichert-NOAA @edwardhartnett @aerorahul any ideas what may be going on here? Again, it's odd that all of the same CI tests ran fine just a few hours ago within this PR, but now some of them are failing during the merge commit to the develop branch!?

@AlexanderRichert-NOAA
Copy link
Contributor

Looks like a bug with the apt python installations. The ubuntu-latest runner just switched to 24.04, which will require some modifications anyway (24.04 doesn't have gcc 11), so for now you could revert to ubuntu-22.04 and I can look into it more this coming week.

@jbathegit
Copy link
Collaborator Author

Thanks for taking a look at this @AlexanderRichert-NOAA

I presume your short-term suggestion means to change:

runs-on: ubuntu-latest

to:

runs-on: ubuntu-22.04

in the developer.yml, Spack.yml, Linux.yml and Intel.yml files? I could do that, but I'm also going to be away at a meeting during the next couple of days anyway, so it might also make sense to just leave it alone for now and give you some time to explore the other modifications you were suggesting. So I'll just sit tight for the time being and check back with you on Friday.

In the meantime, what's really interesting to me is that the Intel checks passed anyway (even with the above line) and it was only the developer, Spack and Linux checks that failed during the merge commit. Furthermore, all of the CI checks passed on this PR barely 2 hours earlier at around 2:00pm EDT, but then suddenly the developer, Spack and Linux ones didn't work anymore at around 4:00pm EDT when I did the merge commit. So I guess that means they must have switched the runner sometime during that 2-hour window.

@jbathegit
Copy link
Collaborator Author

Hi @AlexanderRichert-NOAA just checking to see if you've had any luck or made any progress with this? I also saw where you recently opened a separate issue #627, which I presume is related.

Please let me know - thanks!

@AlexanderRichert-NOAA
Copy link
Contributor

Yes, there turn out to be a couple issues but all fixable. Before I put in a PR-- the ubuntu-22.04 runner provides gcc's 10-12, whereas ubuntu-latest (24.04) provides 12-14. Do you have any preference toward testing those earlier versions, or are you okay with just testing within the more recent set? We can do all ubuntu-latest, all ubuntu-22.04, or a mix (eventually we'll want to test with gcc 14 once we figure out why that's failing...).

@jbathegit
Copy link
Collaborator Author

I'm not particularly partial to either option. But all things being equal, I'd presume it makes more sense to always use ubuntu-latest if/whenever we can.

That said, on WCOSS2 it looks like the only available gcc versions are 10.2.0, 10.3,0, 11.2.0 and 12.1.0. So I'm not sure how much help I could be in testing out gcc versions 13-14, short of just pushing up PRs in trial-and-error mode and seeing what breaks in the runners.

Bottom line, if you think it's better to stick with ubuntu-22.04 or a mix for the time being, then I'm fine with that instead.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

fix bug in drfini for multiple replication factors
3 participants