Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor path function handling #642

Closed
wants to merge 37 commits into from
Closed

Conversation

cbkerr
Copy link
Member

@cbkerr cbkerr commented Nov 3, 2021

This PR switched from a bug fix to a refactor. See #666 for the bug fix only.

Original Title: Don't generate views with underspecified path provided by user

Signac has unexpected behavior when generating a view if the user specifies a custom path that doesn't uniquely specifiy jobs.

I would expect signac to link to all jobs matching the description in the view folder.

What @Nipuli-Gunaratne and I found was that signac just picks one job.

Description

We check that the mapping of source-> link is 1-1 in other parts of import_export.py but don't check user's provided path function.

Two posssible solutions I see are:

  1. Generate an error and exit. Suggest the fix in the error message.
  2. Try to fix the problem by adding jobs ids to the path.

In this draft PR, I find the places in the code for adding the error checking code in two places it might fit: import_export.py::_make_path_function and in linked_view.py::create_linked_view.

I think it works best in _make_path_function.

Motivation and Context

If you make this test project

#init.py
import signac

project = signac.init_project('Test-view')

jobs = [dict(a=1,b=1),
        dict(a=1,b=2),
        dict(a=2,b=1),
        dict(a=2,b=2)
        ]


for j in jobs:
    job = project.open_job(j)
    job.init()
    print(j, job.id)

and generate a view with a user-specified custom path that does not uniquely identify jobs

signac view test_error "a/{a}"

Signac just picks one of the jobs to link.
a=1 gets job 8aacdb17187e6acf2b175d4aa08d7213 (b=2) and not 386b19932c82f3f9749dd6611e846293 (b=1)
a=2 gets job 5e4d14d82c320bafb2f1286fe486d1f8 (b=1) and not d48f81ad571306570e2eb9fe7920cd3c (b=2)

Fix that we'd suggest to users OR try to do automatically:

Remake the path specification as "a/{a}/id/{id}"

Types of Changes

  • Documentation update
  • Bug fix
  • New feature
  • Breaking change1

1The change breaks (or has the potential to break) existing functionality.

Checklist:

If necessary:

  • I have updated the API documentation as part of the package doc-strings.
  • I have created a separate pull request to update the framework documentation on signac-docs and linked it here.
  • I have updated the changelog and added all related issue and pull request numbers for future reference (if applicable). See example below.

@codecov
Copy link

codecov bot commented Nov 3, 2021

Codecov Report

Merging #642 (0cda86d) into master (44f59ee) will decrease coverage by 0.07%.
The diff coverage is 75.90%.

❗ Current head 0cda86d differs from pull request most recent head 0837127. Consider uploading reports for the commit 0837127 to get more accurate results
Impacted file tree graph

@@            Coverage Diff             @@
##           master     #642      +/-   ##
==========================================
- Coverage   78.42%   78.34%   -0.08%     
==========================================
  Files          65       65              
  Lines        7119     7090      -29     
  Branches     1564     1435     -129     
==========================================
- Hits         5583     5555      -28     
- Misses       1226     1227       +1     
+ Partials      310      308       -2     
Impacted Files Coverage Δ
signac/contrib/utility.py 58.99% <74.02%> (+7.14%) ⬆️
signac/__main__.py 71.89% <100.00%> (ø)
signac/contrib/import_export.py 79.60% <100.00%> (+1.29%) ⬆️
signac/contrib/linked_view.py 84.02% <100.00%> (+0.11%) ⬆️
signac/contrib/filesystems.py 51.00% <0.00%> (-3.21%) ⬇️
signac/common/connection.py 38.27% <0.00%> (-0.76%) ⬇️
signac/common/crypt.py 63.26% <0.00%> (-0.74%) ⬇️
signac/core/json.py 86.20% <0.00%> (-0.46%) ⬇️
signac/common/host.py 40.00% <0.00%> (-0.43%) ⬇️
signac/core/jsondict.py 46.99% <0.00%> (-0.29%) ⬇️
... and 13 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 44f59ee...0837127. Read the comment docs.

# Check that the user-specified path generates a unique mapping
link_check = {j.workspace(): path_function(j) for j in jobs}
if len(set(link_check.values())) != len(link_check):
print("Error caught inside _make_path_function")
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still printing errors while thinking about what to do

@@ -99,19 +99,26 @@ def create_linked_view(project, prefix=None, job_ids=None, index=None, path=None
raise RuntimeError(err_msg)

path_function = _make_path_function(jobs, path)

links = {}
incomplete_links = {}
Copy link
Member Author

@cbkerr cbkerr Nov 3, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The old code generated the link mapping in a way that we could not identify if it was 1-1.

The old code put the user specified path as the key in the dictionary. Everywhere else we put the actual job path as the key of the dictionary.

I changed the 1 place in the code that was affected by swapping this dictionary around.

I renamed links to incomplete_links to see both ways of doing things

Copy link
Member

@b-butler b-butler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should raise an error here generally, and also remove incomplete_links in favor of links

@cbkerr
Copy link
Member Author

cbkerr commented Nov 3, 2021

I think we should raise an error here generally, and also remove incomplete_links in favor of links

I added the error in import_export.py but as I was trying to use the inverted dictionary for links, I realized that the existing _update_view function needs the link as the key.

I think it will be least disruptive to leave the old code as is in create_linked_view

@cbkerr cbkerr marked this pull request as ready for review November 4, 2021 18:25
@cbkerr cbkerr requested review from a team as code owners November 4, 2021 18:25
@cbkerr cbkerr requested review from lyrivera and removed request for a team November 4, 2021 18:25
@cbkerr
Copy link
Member Author

cbkerr commented Nov 4, 2021

How shall I test this?

@cbkerr cbkerr requested a review from b-butler November 4, 2021 18:25
@bdice bdice removed the request for review from a team November 5, 2021 00:10
Comment on lines 245 to 250
link_check = {j.workspace(): path_function(j) for j in jobs}
if len(set(link_check.values())) != len(link_check):
raise RuntimeError(
"Links generated with the given path function are not 1-1. The easiest "
"way to fix this is to add the job id to the path specification."
)
Copy link
Member

@bdice bdice Nov 8, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest the following changes:

  • Raise the error for the first failure case rather than evaluating the paths for all jobs before erroring out
  • Use a set for O(1) lookup and insertion
  • Show the path causing a failure in the error message.
  • Use a ValueError since it's the user-provided path value (optional, feel free to keep RuntimeError)
Suggested change
link_check = {j.workspace(): path_function(j) for j in jobs}
if len(set(link_check.values())) != len(link_check):
raise RuntimeError(
"Links generated with the given path function are not 1-1. The easiest "
"way to fix this is to add the job id to the path specification."
)
links = set()
for job in jobs:
job_path = path_function(job)
if job_path in links:
raise ValueError(
f"Multiple jobs resolve to the same path '{job_path}'. "
"The easiest way to fix this is to add the job id to the "
"path specification."
)
links.add(job_path)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for using a set, it doesn't look like the dict key matters here. I prefer constructing the entire set at once though (via something like links = set(path_function(job) for job in jobs)) rather than erroring on the first duplicate. The main reason is that the error message in the suggestion is potentially misleading: it indicates that there is only a single conflict, when in fact there may (most likely there will) be multiple. Even worse, if you have a heterogeneous schema the paths may be underspecified for different reasons. If we really want to provide the most useful diagnostic (but it's a bit more work and I'd understand if you wanted to skip), we should use the for loop, use a flag to indicate that duplicates were found, then log all duplicates using the logger so that the user can inspect the problem. Barring that, I think that constructing the set in one shot and not showing a specific path is probably more appropriate here.

As a bonus, the single-pass approach will be faster in the normal case (no failure). The generator-based construction will be faster than a for loop, you won't need to do a containment check on every add, and the final test is a simple length check

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @bdice @vyasr !
I combined both approaches. Thanks for the advice to use set().

I like being able to see colliding paths, but not by default, in case there are 1000s of them. I put them in the debug log.

I am wondering if it would be faster for most cases as @vyasr points out to first do the simple length check and then only iterate through all jobs if that fails? This would result in duplicated code but could be faster for huge workspaces?

Copy link
Member

@bdice bdice Nov 18, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vyasr I'm a little unsure about what preference you're expressing in your statement:

Barring that, I think that constructing the set in one shot and not showing a specific path is probably more appropriate here.

Now that the "most useful diagnostic" is implemented with a for loop, would you prefer that to the simpler error (constructed with a set generator expression) that does not show a specific path? I think you make good points here and the simpler case aligns with what is done for exports:

# Determine export path for each job.
paths = {job.workspace(): path_function(job) for job in jobs}
# Check whether the mapped paths are unique.
if len(set(paths.values())) != len(paths):
raise RuntimeError("Paths generated with given path function are not unique!")

Copy link
Contributor

@vyasr vyasr Nov 18, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bdice sorry for the confusion, let me clarify. My ordered prefs were:

  1. Check if lengths mismatch, report all unmatched paths in the debug log via logger.debug(...) and then raise an exception
  2. Check if lengths mismatch and raise an error without specifying paths
  3. Loop through and find all mismatches, then raise an error with all of them if any exist

As far as I can tell, what is currently implemented is option (2), which is the simpler case that you mention. I do think that the best option is to start with a length check first.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree that what's released is option 2. The code I added is option 1 except that it does not do the simple length check first.

I do think that the best option is to start with a length check first.

I think you're agreeing with what I was asking about here:

to first do the simple length check and then only iterate through all jobs if that fails

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I am agreeing with that.

@bdice bdice added this to the v1.8.0 milestone Dec 10, 2021
@bdice bdice added the bug Something isn't working label Dec 10, 2021
Copy link
Member

@bdice bdice left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made a brief pass of review and have some high-level suggestions.

Comment on lines +731 to +733
raise ValueError(
"The path argument must either be `None`, `False`, or of type `str`."
)
Copy link
Member

@bdice bdice Dec 10, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This disagrees with the docstring that says it accepts a callable. Based on my suggestion in this comment, we should probably let this function pass through a callable and validate all possible path function values with _check_path_function at the end before returning.

Comment on lines 89 to 94
if callable(path):
path_function = path
_check_path_function(jobs, path_spec=path, path_function=path_function)
else:
path_function = _make_path_function(jobs, path)
# path_function is checked inside _make_path_function
Copy link
Member

@bdice bdice Dec 10, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like _make_path_function can be used in place of the logic here. We should be able to avoid having two paths (if callable(path) in the snippet above) that independently call _check_path_function, since _make_path_function can take a callable. Might need to refactor _make_path_function to validate at the end. See other comment.

Suggested change
if callable(path):
path_function = path
_check_path_function(jobs, path_spec=path, path_function=path_function)
else:
path_function = _make_path_function(jobs, path)
# path_function is checked inside _make_path_function
path_function = _make_path_function(jobs, path)

else:
path_function = _make_path_function(jobs, path)
# path_function is checked inside _make_path_function

# Determine export path for each job.
paths = {job.workspace(): path_function(job) for job in jobs}
Copy link
Member

@bdice bdice Dec 10, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're already computing these paths when validating for uniqueness. Should that function return the generated paths to avoid doing the same computation twice? This might require some additional refactoring.

from .schema import _build_job_statepoint_index

if len(jobs) <= 1:
# The lambda must (optionally) take a format spec argument to match the
Copy link
Member

@bdice bdice Dec 10, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"must (optionally)" seems inconsistent. It's either required or it's not. (I'm not sure what this does and would need to do further inspection to make a more helpful suggestion - let me know if you need additional help here.)

Comment on lines +592 to +595
class _SchemaPathEvaluationError(RuntimeError):
"""Raised for errors in schema path evaluation."""

pass
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs to be a public class (no underscore) defined in signac/contrib/errors.py, because it can be raised to the user.

logger.debug(f"Generated path '{job_path}' is not unique.")
else:
links.add(job_path)
raise RuntimeError(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be a SchemaPathEvaluationError?

@cbkerr
Copy link
Member Author

cbkerr commented Dec 14, 2021

I agree that the import_export code needs refactoring. I'm going to split this PR into one for the bug fix for the release and another for the refactor. @bdice was okay with this in a chat offline

@b-butler
Copy link
Member

b-butler commented Jan 3, 2022

Given the large number of reviews, I am removing myself.

@b-butler b-butler removed their request for review January 3, 2022 15:00
@lyrivera
Copy link

lyrivera commented Jan 4, 2022

I'm following @b-butler and removing myself.

@lyrivera lyrivera removed their request for review January 4, 2022 15:27
@vyasr
Copy link
Contributor

vyasr commented Jan 18, 2022

@cbkerr what's the plan here?

@cbkerr
Copy link
Member Author

cbkerr commented Jan 21, 2022

I've extracted enough to be a bug fix into #666. It should be easier to review without moving lots of code around.
I'll rename this PR or make an issue to focus on the refactor.

@vyasr
Copy link
Contributor

vyasr commented Jan 25, 2022

@cbkerr awesome. If you want to merge master into this branch and fix conflicts, I would be fine with repurposing this PR for the refactor. The conversation here is illuminating and worth preserving next to the final set of changes IMO.

@cbkerr cbkerr changed the title Don't generate views with underspecified path provided by user Refactor path function handling Jan 26, 2022
@vyasr
Copy link
Contributor

vyasr commented Feb 21, 2022

@cbkerr do you need help with next steps on this PR?

@cbkerr
Copy link
Member Author

cbkerr commented Feb 22, 2022

I'll get back to this after APS in mid March.

@cbkerr cbkerr added refactor Code refactoring and removed bug Something isn't working labels Feb 22, 2022
@stale
Copy link

stale bot commented Jun 12, 2022

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale label Jun 12, 2022
@b-butler
Copy link
Member

b-butler commented Jul 8, 2022

@cbkerr do you still plan on getting back to this?

@stale stale bot removed the stale label Jul 8, 2022
@vyasr
Copy link
Contributor

vyasr commented Sep 18, 2022

@cbkerr checking in here again, do you still plan to tackle this? Since the bug fix was merged in #666 and this is just a refactor, can the changes be retargeted onto next for version 2.0 instead of 1.8?

@cbkerr cbkerr closed this Sep 18, 2022
@vyasr
Copy link
Contributor

vyasr commented Sep 19, 2022

@cbkerr do you still think this is work worth doing, and just not something that you'll get around to, or is this something we can skip altogether? If you think it's worthwhile, could you write up an issue describing what should be done in case someone wants to pick this up in the future?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
refactor Code refactoring
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants