-
Notifications
You must be signed in to change notification settings - Fork 37
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor path function handling #642
Conversation
for more information, see https://pre-commit.ci
Codecov Report
@@ Coverage Diff @@
## master #642 +/- ##
==========================================
- Coverage 78.42% 78.34% -0.08%
==========================================
Files 65 65
Lines 7119 7090 -29
Branches 1564 1435 -129
==========================================
- Hits 5583 5555 -28
- Misses 1226 1227 +1
+ Partials 310 308 -2
Continue to review full report at Codecov.
|
signac/contrib/import_export.py
Outdated
# Check that the user-specified path generates a unique mapping | ||
link_check = {j.workspace(): path_function(j) for j in jobs} | ||
if len(set(link_check.values())) != len(link_check): | ||
print("Error caught inside _make_path_function") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still printing errors while thinking about what to do
signac/contrib/linked_view.py
Outdated
@@ -99,19 +99,26 @@ def create_linked_view(project, prefix=None, job_ids=None, index=None, path=None | |||
raise RuntimeError(err_msg) | |||
|
|||
path_function = _make_path_function(jobs, path) | |||
|
|||
links = {} | |||
incomplete_links = {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The old code generated the link mapping in a way that we could not identify if it was 1-1.
The old code put the user specified path as the key in the dictionary. Everywhere else we put the actual job path as the key of the dictionary.
I changed the 1 place in the code that was affected by swapping this dictionary around.
I renamed links
to incomplete_links
to see both ways of doing things
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should raise an error here generally, and also remove incomplete_links
in favor of links
I added the error in I think it will be least disruptive to leave the old code as is in |
How shall I test this? |
signac/contrib/import_export.py
Outdated
link_check = {j.workspace(): path_function(j) for j in jobs} | ||
if len(set(link_check.values())) != len(link_check): | ||
raise RuntimeError( | ||
"Links generated with the given path function are not 1-1. The easiest " | ||
"way to fix this is to add the job id to the path specification." | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest the following changes:
- Raise the error for the first failure case rather than evaluating the paths for all jobs before erroring out
- Use a
set
for O(1) lookup and insertion - Show the path causing a failure in the error message.
- Use a
ValueError
since it's the user-provided path value (optional, feel free to keepRuntimeError
)
link_check = {j.workspace(): path_function(j) for j in jobs} | |
if len(set(link_check.values())) != len(link_check): | |
raise RuntimeError( | |
"Links generated with the given path function are not 1-1. The easiest " | |
"way to fix this is to add the job id to the path specification." | |
) | |
links = set() | |
for job in jobs: | |
job_path = path_function(job) | |
if job_path in links: | |
raise ValueError( | |
f"Multiple jobs resolve to the same path '{job_path}'. " | |
"The easiest way to fix this is to add the job id to the " | |
"path specification." | |
) | |
links.add(job_path) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 for using a set, it doesn't look like the dict key matters here. I prefer constructing the entire set at once though (via something like links = set(path_function(job) for job in jobs)
) rather than erroring on the first duplicate. The main reason is that the error message in the suggestion is potentially misleading: it indicates that there is only a single conflict, when in fact there may (most likely there will) be multiple. Even worse, if you have a heterogeneous schema the paths may be underspecified for different reasons. If we really want to provide the most useful diagnostic (but it's a bit more work and I'd understand if you wanted to skip), we should use the for loop, use a flag to indicate that duplicates were found, then log all duplicates using the logger so that the user can inspect the problem. Barring that, I think that constructing the set in one shot and not showing a specific path is probably more appropriate here.
As a bonus, the single-pass approach will be faster in the normal case (no failure). The generator-based construction will be faster than a for loop, you won't need to do a containment check on every add, and the final test is a simple length check
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @bdice @vyasr !
I combined both approaches. Thanks for the advice to use set()
.
I like being able to see colliding paths, but not by default, in case there are 1000s of them. I put them in the debug log.
I am wondering if it would be faster for most cases as @vyasr points out to first do the simple length check and then only iterate through all jobs if that fails? This would result in duplicated code but could be faster for huge workspaces?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vyasr I'm a little unsure about what preference you're expressing in your statement:
Barring that, I think that constructing the set in one shot and not showing a specific path is probably more appropriate here.
Now that the "most useful diagnostic" is implemented with a for loop, would you prefer that to the simpler error (constructed with a set generator expression) that does not show a specific path? I think you make good points here and the simpler case aligns with what is done for exports:
signac/signac/contrib/import_export.py
Lines 328 to 333 in 60646fd
# Determine export path for each job. | |
paths = {job.workspace(): path_function(job) for job in jobs} | |
# Check whether the mapped paths are unique. | |
if len(set(paths.values())) != len(paths): | |
raise RuntimeError("Paths generated with given path function are not unique!") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bdice sorry for the confusion, let me clarify. My ordered prefs were:
- Check if lengths mismatch, report all unmatched paths in the debug log via
logger.debug(...)
and then raise an exception - Check if lengths mismatch and raise an error without specifying paths
- Loop through and find all mismatches, then raise an error with all of them if any exist
As far as I can tell, what is currently implemented is option (2), which is the simpler case that you mention. I do think that the best option is to start with a length check first.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree that what's released is option 2. The code I added is option 1 except that it does not do the simple length check first.
I do think that the best option is to start with a length check first.
I think you're agreeing with what I was asking about here:
to first do the simple length check and then only iterate through all jobs if that fails
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I am agreeing with that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I made a brief pass of review and have some high-level suggestions.
raise ValueError( | ||
"The path argument must either be `None`, `False`, or of type `str`." | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This disagrees with the docstring that says it accepts a callable
. Based on my suggestion in this comment, we should probably let this function pass through a callable and validate all possible path function values with _check_path_function
at the end before returning.
signac/contrib/import_export.py
Outdated
if callable(path): | ||
path_function = path | ||
_check_path_function(jobs, path_spec=path, path_function=path_function) | ||
else: | ||
path_function = _make_path_function(jobs, path) | ||
# path_function is checked inside _make_path_function |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like _make_path_function
can be used in place of the logic here. We should be able to avoid having two paths (if callable(path)
in the snippet above) that independently call _check_path_function
, since _make_path_function
can take a callable. Might need to refactor _make_path_function
to validate at the end. See other comment.
if callable(path): | |
path_function = path | |
_check_path_function(jobs, path_spec=path, path_function=path_function) | |
else: | |
path_function = _make_path_function(jobs, path) | |
# path_function is checked inside _make_path_function | |
path_function = _make_path_function(jobs, path) |
signac/contrib/import_export.py
Outdated
else: | ||
path_function = _make_path_function(jobs, path) | ||
# path_function is checked inside _make_path_function | ||
|
||
# Determine export path for each job. | ||
paths = {job.workspace(): path_function(job) for job in jobs} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We're already computing these paths when validating for uniqueness. Should that function return the generated paths to avoid doing the same computation twice? This might require some additional refactoring.
from .schema import _build_job_statepoint_index | ||
|
||
if len(jobs) <= 1: | ||
# The lambda must (optionally) take a format spec argument to match the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"must (optionally)" seems inconsistent. It's either required or it's not. (I'm not sure what this does and would need to do further inspection to make a more helpful suggestion - let me know if you need additional help here.)
class _SchemaPathEvaluationError(RuntimeError): | ||
"""Raised for errors in schema path evaluation.""" | ||
|
||
pass |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This needs to be a public class (no underscore) defined in signac/contrib/errors.py
, because it can be raised to the user.
logger.debug(f"Generated path '{job_path}' is not unique.") | ||
else: | ||
links.add(job_path) | ||
raise RuntimeError( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be a SchemaPathEvaluationError
?
I agree that the import_export code needs refactoring. I'm going to split this PR into one for the bug fix for the release and another for the refactor. @bdice was okay with this in a chat offline |
Given the large number of reviews, I am removing myself. |
I'm following @b-butler and removing myself. |
@cbkerr what's the plan here? |
I've extracted enough to be a bug fix into #666. It should be easier to review without moving lots of code around. |
@cbkerr awesome. If you want to merge master into this branch and fix conflicts, I would be fine with repurposing this PR for the refactor. The conversation here is illuminating and worth preserving next to the final set of changes IMO. |
@cbkerr do you need help with next steps on this PR? |
I'll get back to this after APS in mid March. |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
@cbkerr do you still plan on getting back to this? |
@cbkerr do you still think this is work worth doing, and just not something that you'll get around to, or is this something we can skip altogether? If you think it's worthwhile, could you write up an issue describing what should be done in case someone wants to pick this up in the future? |
This PR switched from a bug fix to a refactor. See #666 for the bug fix only.
Original Title: Don't generate views with underspecified path provided by user
Signac has unexpected behavior when generating a view if the user specifies a custom path that doesn't uniquely specifiy jobs.
I would expect signac to link to all jobs matching the description in the view folder.
What @Nipuli-Gunaratne and I found was that signac just picks one job.
Description
We check that the mapping of source-> link is 1-1 in other parts of
import_export.py
but don't check user's provided path function.Two posssible solutions I see are:
In this draft PR, I find the places in the code for adding the error checking code in two places it might fit:
import_export.py::_make_path_function
and inlinked_view.py::create_linked_view
.I think it works best in
_make_path_function
.Motivation and Context
If you make this test project
and generate a view with a user-specified custom path that does not uniquely identify jobs
signac view test_error "a/{a}"
Signac just picks one of the jobs to link.
a=1 gets job 8aacdb17187e6acf2b175d4aa08d7213 (b=2) and not 386b19932c82f3f9749dd6611e846293 (b=1)
a=2 gets job 5e4d14d82c320bafb2f1286fe486d1f8 (b=1) and not d48f81ad571306570e2eb9fe7920cd3c (b=2)
Fix that we'd suggest to users OR try to do automatically:
Remake the path specification as "a/{a}/id/{id}"
Types of Changes
1The change breaks (or has the potential to break) existing functionality.
Checklist:
If necessary: