Skip to content

Issue/1226 filename parsing - some path logic.#1228

Merged
WardBrian merged 27 commits intodevelopfrom
issue/1226-filename-parsing
Jan 4, 2024
Merged

Issue/1226 filename parsing - some path logic.#1228
WardBrian merged 27 commits intodevelopfrom
issue/1226-filename-parsing

Conversation

@mitzimorris
Copy link
Member

Submisison Checklist

  • Run tests: ./runCmdStanTests.py src/test
  • Declare copyright holder and open-source license: see below

Summary:

Make parsing of filename suffixes more robust - adds logic to handle filepath separators plus some edge cases.

Intended Effect:

Correct weirdness reported in #1226

How to Verify:

Unit tests

Side Effects:

N/A

Documentation:

N/A

Copyright and Licensing

Please list the copyright holder for the work you are submitting (this will be you or your assignee, such as a university or company): Columbia University

By submitting this pull request, the copyright holder is agreeing to license the submitted work under the following licenses:

@mitzimorris mitzimorris requested a review from WardBrian January 2, 2024 05:07
@mitzimorris
Copy link
Member Author

@WardBrian or @dylex - could you please check on why this failed?

@WardBrian
Copy link
Member

I think it is because you checked in a change to the stan submodule which seems to point to a non-existent commit

Copy link
Member

@WardBrian WardBrian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for tackling. How difficult do you think it would be to also fix #1213?

@mitzimorris
Copy link
Member Author

How difficult do you think it would be to also fix #1213?

not difficult at all - will add to this PR.

@mitzimorris
Copy link
Member Author

checks are failing because performance tests specify output files with suffix ".tmp".
submitted PR to cmdstan perf tests repo: stan-dev/performance-tests-cmdstan#62

@WardBrian - will this work?

@bob-carpenter
Copy link
Member

Could someone please point me to or write doc for how this is supposed to behave? This PR just says "Make parsing of filename suffixes more robust" and the issue #1226 just lists a bunch of quirky behavior without a spec for what should happen in general.

The problem starts with the name of the argument, output file=foo/.bar/baz, because we use this argument to produce an output directory and output base file name prefix. If we just took those two things as arguments, we wouldn't need any string processing at all and it'd be clear what's going on to the user.

To allow the current behavior, I would suggest (a) splitting directory off, and (b) operating on the file to remove the final period and anything following it.

The code needs to check that the string-based path is reasonable. That is, we can't just have foo/. as input as that leaves no base file. We can't have * in the path, etc. Something needs to validate these are truly paths.

If it didn't break backward compatibility, I would say just slap another .csv onto the end of foo/bar.csv and be done with it.

@mitzimorris
Copy link
Member Author

made all suggested changes, added fix for #1213 - ready for re-review.

@mitzimorris
Copy link
Member Author

Could someone please point me to or write doc for how this is supposed to behave? This PR just says "Make parsing of filename suffixes more robust" and the issue #1226 just lists a bunch of quirky behavior without a spec for what should happen in general.

  • CmdStan output filename behavoirs
    - if running a single chain, the CSV output filename is whatever the file argument specifies; however, if the filename doesn't have suffix csv this will be added.
    - if running multiple chains, then the chain id is appended to the end of the filename, e.g. output_1.csv
    - for Pathfinder, there's a "save-single-paths" option, this creates a set of per-chain CSV samples which include both tag "_path" as well as chain id - e.g., PSIS sample: output.csv, plus per-chain output_path_1.csv, etc.
    - for Pathfinder, the diagnostic file is JSON format, for the sampler the diagnostic file is CSV format.

The CmdStan output argument allows 2 sub-arguments:

  • file - this is the CSV file of draws from the sampler
  • diagnostic file - added specifically for the sampler diagnostics and gradients in CSV format

This design (sic) is problematic because:

  • multi-chain single-process runs need to use the specified output file names as templates and hack in chain id information
  • it was never stipulated that output filenames should end in ".csv"
  • Pathfinder can output both the PSIS sample and the inidividual pathfinder samples.
  • the diagnostic file from Pathfinder is JSON, not CSV.

W/R/T creating good output filenames:

  • this PR fixes the problem that directory names which contain '.' were being parsed as suffixes.
  • this PR tries to address problems of specified filenames that are actually just a filepath (per reported quirkiness)
  • output file suffixes should be either ".csv" or ".json", according to the output file type.

The problem starts with the name of the argument, output file=foo/.bar/baz, because we use this argument to produce an output directory and output base file name prefix. If we just took those two things as arguments, we wouldn't need any string processing at all and it'd be clear what's going on to the user.

absolutely. CmdStanPy only has argument output_dir and it handles the logic of naming all the different kinds of output files with unique and consistent names.

To allow the current behavior, I would suggest (a) splitting directory off, and (b) operating on the file to remove the final period and anything following it.

That's pretty much the way that things are re-imlemented here. The problem is that C++14 doesn't have good filepath parsing routines and the Boost filesystem library is not header-only. However, C++17 will have a filesystem library, so eventually this will be easier to support.

The code needs to check that the string-based path is reasonable. That is, we can't just have foo/. as input as that leaves no base file. We can't have * in the path, etc. Something needs to validate these are truly paths.

Agreed.

If it didn't break backward compatibility, I would say just slap another .csv onto the end of foo/bar.csv and be done with it.

Backwards compatibility is how we got here. We just fixed an annoying feature of the downstream performance tests which created output files that ended in ".tmp" not ".csv".

@WardBrian
Copy link
Member

If we just took those two things as arguments, we wouldn't need any string processing at all and it'd be clear what's going on to the user.

I think that even this would have some subtle issues, like if the 'base file prefix' also contained a path separator.

I think the fundamental flaw is we should never have set it up like this in the first place. IMO, when multiple files are required, we should have required the user to give us a comma separated list of the correct length and removed any guesswork about "if I specify X as the argument, the actual files created on disk will have the name..."

To allow the current behavior, I would suggest (a) splitting directory off, and (b) operating on the file to remove the final period and anything following it.

I think this is reasonable. We are also within our rights to say "besides for CSV outputs, we will force the correct extension". So the bug reported in #1213 would yield foo.bar_path_1.baz as a CSV file, but foo.bar_path_1.json as the diagnostic file. That would maintain backward compatibility (by allowing csv outputs to still have arbitrary extensions) but be a bit more sane going forward with new formats

Copy link
Member

@WardBrian WardBrian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few small things, should be last review. Thanks!

Copy link
Member

@WardBrian WardBrian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks again for tackling!

@WardBrian WardBrian merged commit 5d29f37 into develop Jan 4, 2024
@WardBrian WardBrian deleted the issue/1226-filename-parsing branch January 4, 2024 14:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

More output file name quirks pathfinder: output files don't respect multiple periods, clobber outputs

5 participants