Issue/1226 filename parsing - some path logic. by mitzimorris · Pull Request #1228 · stan-dev/cmdstan

mitzimorris · 2024-01-02T05:06:56Z

Submisison Checklist

Run tests: ./runCmdStanTests.py src/test
Declare copyright holder and open-source license: see below

Summary:

Make parsing of filename suffixes more robust - adds logic to handle filepath separators plus some edge cases.

Intended Effect:

Correct weirdness reported in #1226

How to Verify:

Unit tests

Side Effects:

N/A

Documentation:

N/A

Copyright and Licensing

Please list the copyright holder for the work you are submitting (this will be you or your assignee, such as a university or company): Columbia University

By submitting this pull request, the copyright holder is agreeing to license the submitted work under the following licenses:

Code: BSD 3-clause (https://opensource.org/licenses/BSD-3-Clause)
Documentation: CC-BY 4.0 (https://creativecommons.org/licenses/by/4.0/)

…velop

mitzimorris · 2024-01-02T15:53:54Z

@WardBrian or @dylex - could you please check on why this failed?

WardBrian · 2024-01-02T16:02:44Z

I think it is because you checked in a change to the stan submodule which seems to point to a non-existent commit

WardBrian

Thanks for tackling. How difficult do you think it would be to also fix #1213?

src/cmdstan/command_helper.hpp

src/test/interface/command_helper_test.cpp

src/cmdstan/command_helper.hpp

stan

mitzimorris · 2024-01-02T17:28:55Z

How difficult do you think it would be to also fix #1213?

not difficult at all - will add to this PR.

mitzimorris · 2024-01-02T19:20:38Z

checks are failing because performance tests specify output files with suffix ".tmp".
submitted PR to cmdstan perf tests repo: stan-dev/performance-tests-cmdstan#62

@WardBrian - will this work?

…-dev/cmdstan into issue/1226-filename-parsing

src/cmdstan/command_helper.hpp

bob-carpenter · 2024-01-02T21:57:22Z

Could someone please point me to or write doc for how this is supposed to behave? This PR just says "Make parsing of filename suffixes more robust" and the issue #1226 just lists a bunch of quirky behavior without a spec for what should happen in general.

The problem starts with the name of the argument, output file=foo/.bar/baz, because we use this argument to produce an output directory and output base file name prefix. If we just took those two things as arguments, we wouldn't need any string processing at all and it'd be clear what's going on to the user.

To allow the current behavior, I would suggest (a) splitting directory off, and (b) operating on the file to remove the final period and anything following it.

The code needs to check that the string-based path is reasonable. That is, we can't just have foo/. as input as that leaves no base file. We can't have * in the path, etc. Something needs to validate these are truly paths.

If it didn't break backward compatibility, I would say just slap another .csv onto the end of foo/bar.csv and be done with it.

mitzimorris · 2024-01-02T22:20:44Z

made all suggested changes, added fix for #1213 - ready for re-review.

…-dev/cmdstan into issue/1226-filename-parsing

mitzimorris · 2024-01-02T23:30:52Z

Could someone please point me to or write doc for how this is supposed to behave? This PR just says "Make parsing of filename suffixes more robust" and the issue #1226 just lists a bunch of quirky behavior without a spec for what should happen in general.

CmdStan output filename behavoirs
- if running a single chain, the CSV output filename is whatever the file argument specifies; however, if the filename doesn't have suffix csv this will be added.
- if running multiple chains, then the chain id is appended to the end of the filename, e.g. output_1.csv
- for Pathfinder, there's a "save-single-paths" option, this creates a set of per-chain CSV samples which include both tag "_path" as well as chain id - e.g., PSIS sample: output.csv, plus per-chain output_path_1.csv, etc.
- for Pathfinder, the diagnostic file is JSON format, for the sampler the diagnostic file is CSV format.

The CmdStan output argument allows 2 sub-arguments:

file - this is the CSV file of draws from the sampler
diagnostic file - added specifically for the sampler diagnostics and gradients in CSV format

This design (sic) is problematic because:

multi-chain single-process runs need to use the specified output file names as templates and hack in chain id information
it was never stipulated that output filenames should end in ".csv"
Pathfinder can output both the PSIS sample and the inidividual pathfinder samples.
the diagnostic file from Pathfinder is JSON, not CSV.

W/R/T creating good output filenames:

this PR fixes the problem that directory names which contain '.' were being parsed as suffixes.
this PR tries to address problems of specified filenames that are actually just a filepath (per reported quirkiness)
output file suffixes should be either ".csv" or ".json", according to the output file type.

The problem starts with the name of the argument, output file=foo/.bar/baz, because we use this argument to produce an output directory and output base file name prefix. If we just took those two things as arguments, we wouldn't need any string processing at all and it'd be clear what's going on to the user.

absolutely. CmdStanPy only has argument output_dir and it handles the logic of naming all the different kinds of output files with unique and consistent names.

To allow the current behavior, I would suggest (a) splitting directory off, and (b) operating on the file to remove the final period and anything following it.

That's pretty much the way that things are re-imlemented here. The problem is that C++14 doesn't have good filepath parsing routines and the Boost filesystem library is not header-only. However, C++17 will have a filesystem library, so eventually this will be easier to support.

The code needs to check that the string-based path is reasonable. That is, we can't just have foo/. as input as that leaves no base file. We can't have * in the path, etc. Something needs to validate these are truly paths.

Agreed.

If it didn't break backward compatibility, I would say just slap another .csv onto the end of foo/bar.csv and be done with it.

Backwards compatibility is how we got here. We just fixed an annoying feature of the downstream performance tests which created output files that ended in ".tmp" not ".csv".

WardBrian · 2024-01-03T14:42:17Z

If we just took those two things as arguments, we wouldn't need any string processing at all and it'd be clear what's going on to the user.

I think that even this would have some subtle issues, like if the 'base file prefix' also contained a path separator.

I think the fundamental flaw is we should never have set it up like this in the first place. IMO, when multiple files are required, we should have required the user to give us a comma separated list of the correct length and removed any guesswork about "if I specify X as the argument, the actual files created on disk will have the name..."

To allow the current behavior, I would suggest (a) splitting directory off, and (b) operating on the file to remove the final period and anything following it.

I think this is reasonable. We are also within our rights to say "besides for CSV outputs, we will force the correct extension". So the bug reported in #1213 would yield foo.bar_path_1.baz as a CSV file, but foo.bar_path_1.json as the diagnostic file. That would maintain backward compatibility (by allowing csv outputs to still have arbitrary extensions) but be a bit more sane going forward with new formats

…-dev/cmdstan into issue/1226-filename-parsing

WardBrian

A few small things, should be last review. Thanks!

src/cmdstan/command.hpp

src/test/interface/pathfinder_test.cpp

WardBrian

Thanks again for tackling!

mitzimorris added 10 commits September 21, 2023 18:02

Merge branch 'develop' of https://github.com/stan-dev/cmdstan into de…

8104299

…velop

Merge branch 'develop' of https://github.com/stan-dev/cmdstan into de…

9704f3d

…velop

Merge branch 'develop' of https://github.com/stan-dev/cmdstan into de…

135bd2a

…velop

Merge branch 'develop' of https://github.com/stan-dev/cmdstan into de…

52cc518

…velop

Merge branch 'develop' of https://github.com/stan-dev/cmdstan into de…

257b0b1

…velop

Merge branch 'develop' of https://github.com/stan-dev/cmdstan into de…

41b7fb0

…velop

Merge branch 'develop' of https://github.com/stan-dev/cmdstan into de…

9572042

…velop

better filename parsing, first unit test

4d8caa6

added unit tests for all cases in issue

810fe93

logic cleanup

2988c88

mitzimorris requested a review from WardBrian January 2, 2024 05:07

WardBrian requested changes Jan 2, 2024

View reviewed changes

src/cmdstan/command_helper.hpp Outdated Show resolved Hide resolved

src/test/interface/command_helper_test.cpp Outdated Show resolved Hide resolved

src/cmdstan/command_helper.hpp Outdated Show resolved Hide resolved

stan Show resolved Hide resolved

mitzimorris and others added 3 commits January 2, 2024 12:50

fix issue 1213

34d1451

fix?

f6b9809

[Jenkins] auto-formatting by clang-format version 10.0.0-4ubuntu1

4d0dfb1

mitzimorris and others added 4 commits January 2, 2024 14:21

Merge branch 'issue/1226-filename-parsing' of https://github.com/stan…

abb37f0

…-dev/cmdstan into issue/1226-filename-parsing

changes per code review

6211071

changes per code review

42ad2c2

[Jenkins] auto-formatting by clang-format version 10.0.0-4ubuntu1

54355f1

This was linked to issues Jan 2, 2024

pathfinder: output files don't respect multiple periods, clobber outputs #1213

Closed

More output file name quirks #1226

Closed

WardBrian reviewed Jan 2, 2024

View reviewed changes

src/cmdstan/command_helper.hpp Outdated Show resolved Hide resolved

mitzimorris and others added 3 commits January 2, 2024 17:15

changes per code review

d344bd1

merge fix

14963d0

[Jenkins] auto-formatting by clang-format version 10.0.0-4ubuntu1

e4e7f0c

Merge branch 'issue/1226-filename-parsing' of https://github.com/stan…

10e8cfc

…-dev/cmdstan into issue/1226-filename-parsing

mitzimorris and others added 5 commits January 3, 2024 16:06

changes per code review; cleanup for edge cases

db77e05

Merge commit 'd35e6ec8820494ec48e7542062a0a1897104d0d9' into HEAD

a71cbaf

[Jenkins] auto-formatting by clang-format version 10.0.0-4ubuntu1

9e2e599

Merge branch 'issue/1226-filename-parsing' of https://github.com/stan…

ce89901

…-dev/cmdstan into issue/1226-filename-parsing

merge fix

d38a01c

WardBrian requested changes Jan 3, 2024

View reviewed changes

src/cmdstan/command.hpp Outdated Show resolved Hide resolved

src/cmdstan/command.hpp Outdated Show resolved Hide resolved

src/cmdstan/command.hpp Outdated Show resolved Hide resolved

src/test/interface/pathfinder_test.cpp Outdated Show resolved Hide resolved

changes per code review

c7f3ce6

WardBrian approved these changes Jan 4, 2024

View reviewed changes

WardBrian merged commit 5d29f37 into develop Jan 4, 2024

WardBrian deleted the issue/1226-filename-parsing branch January 4, 2024 14:48

This was referenced Jan 17, 2024

bernoulli example fails to run on latest release #1238

Closed

Check string is nonempty before indexing #1239

Merged

Uh oh!

Conversation

mitzimorris commented Jan 2, 2024

Submisison Checklist

Summary:

Intended Effect:

How to Verify:

Side Effects:

Documentation:

Copyright and Licensing

Uh oh!

mitzimorris commented Jan 2, 2024

Uh oh!

WardBrian commented Jan 2, 2024

Uh oh!

WardBrian left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mitzimorris commented Jan 2, 2024

Uh oh!

mitzimorris commented Jan 2, 2024

Uh oh!

Uh oh!

bob-carpenter commented Jan 2, 2024

Uh oh!

mitzimorris commented Jan 2, 2024

Uh oh!

mitzimorris commented Jan 2, 2024

Uh oh!

WardBrian commented Jan 3, 2024

Uh oh!

WardBrian left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

WardBrian left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants