Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Detect Schema for different type of Project to Sync #280

Merged
merged 16 commits into from
Feb 10, 2020

Conversation

vishav1771
Copy link
Contributor

@vishav1771 vishav1771 commented Jan 29, 2020

Description

Changed the detect_schema function to compare on the bases of data types instead of values.

Motivation and Context

Fixes #230

Types of Changes

  • Documentation update
  • Bug fix
  • New feature
  • Breaking change1

1The change breaks (or has the potential to break) existing functionality.

Checklist:

If necessary:

  • I have updated the API documentation as part of the package doc-strings.
  • I have created a separate pull request to update the framework documentation on signac-docs and linked it here.
  • I have updated the changelog and added all related issue and pull request numbers for future reference (if applicable). See example below.

Example for a changelog entry: Fix issue with launching rockets to the moon (#101, #212).

@vishav1771
Copy link
Contributor Author

vishav1771 commented Jan 29, 2020

Hi @csadorf @vyasr, I have added tests to show the new behavior of sync function with different types of projects. Is this behavior fine or need to change anything?

@vyasr
Copy link
Contributor

vyasr commented Jan 29, 2020

@vishav1771 it looks like your tests are failing because the docstring you added to your test has an extra trailing quote (four instead of three). I'll look at the tests now.

@vishav1771
Copy link
Contributor Author

@vyasr I have not done any changes in the detect_schema yet. I just wanted to confirm the intended behavior of sync of function (as shown by the test ) is correct or not?

@vyasr
Copy link
Contributor

vyasr commented Jan 29, 2020

Yes, you've absolutely done the right thing. To make sure I understand these tests:

  • test_mixed_same_data_type will fail with current signac, but we believe it should pass.
  • test_mixed_diff_data_type will fail with current signac and we believe it should fail
  • test_mixed_multiple_data_type will fail with current signac, and you think it should succeed in one direction but not the other (the direction where one project's key-values are a strict superset of the other).

Is that interpretation correct? One more I might add is where the types don't match but the values match exactly e.g. float(1) and int(1) as the values. It would also be good to test a case where one has a superset of the other's keys, i.e. both projects have a={1, 2}, but one project also has jobs with b=1.

Reading these tests makes me think that we are perhaps too stringent with our tests of invalid schema. The existence of the check_schema flag perhaps makes that acceptable, however it's not clear to me what a user of this would expect in all of these cases. For instance, I might naively only expect synchronization to fail if two jobs with the same statepoint fail to synchronize, i.e. I'm wondering if maybe the expected default for most users is simply check_schema=False. I would like to get some feedback from more users/developers on how they would expect synchronization to work. @csadorf would you be OK with polling a few others before making a decision on the "correct" behavior?

@csadorf
Copy link
Contributor

csadorf commented Jan 30, 2020

Reading these tests makes me think that we are perhaps too stringent with our tests of invalid schema.

I agree. The initial intention when I added the schema check was to prevent cases where one would apply a migration to one project data space (project A) and then sync it with a diverged copy (project B), e.g., one where a key was renamed, and now one would end up with a bunch of additional jobs that were originally supposed to be the same.

So here is my proposal, I think that we should keep the current behavior where sync really only syncs when the schema is exactly the same. However, we should clarify in the error message that this does not meany anything bad if you intend to sync two data space that are similar, but not identical in terms of schema. We could consider defusing the warning message and to use a different option to override just that check that sounds less menacing then -f/--force, something like -m/--merge or so.

When I look at the original issue, the error message that we see is very confusing:

Keys found only in the source schema: dt, totalSteps, k_bending, period, phi, seed, a, reversalRate, totalMonomers, f_p
Keys found only in the destination schema: dt, totalSteps, k_bending, period, phi, seed, a, reversalRate, totalMonomers, f_p

That just doesn't make any sense since it claims that certain keys were only found in the source or destination schema, but they are actually identical.

This is what needs to be addressed.

@vyasr
Copy link
Contributor

vyasr commented Jan 30, 2020

I think that's a reasonable approach for now. If we go this route of not changing how schema checking works, long term I would advocate that for 2.0 we change the default behavior to check_schema=False. I think that is the behavior people are actually expecting. As long as the per-job synchronization is robust with respect to conflicts, the worst thing people can do is end up with duplicate jobs that can be easily removed or migrated correctly using the signac shell interface.

@vishav1771
Copy link
Contributor Author

@vyasr So, Only need to change the warning message and a different option to override for now?

@vyasr
Copy link
Contributor

vyasr commented Jan 30, 2020

Yes, I think that for now we would just want to add a --no-check-schema option and then change the text of the error.

Copy link
Contributor

@csadorf csadorf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please update this PR based on master and apply discussed modifications to this PR.

@vishav1771 vishav1771 marked this pull request as ready for review January 31, 2020 21:51
@vishav1771 vishav1771 requested review from a team as code owners January 31, 2020 21:51
@vishav1771 vishav1771 requested review from Tobias-Dwyer, csadorf and vyasr and removed request for a team January 31, 2020 21:51
@codecov
Copy link

codecov bot commented Feb 3, 2020

Codecov Report

Merging #280 into master will increase coverage by 0.05%.
The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #280      +/-   ##
==========================================
+ Coverage   64.81%   64.86%   +0.05%     
==========================================
  Files          40       40              
  Lines        5644     5644              
==========================================
+ Hits         3658     3661       +3     
+ Misses       1986     1983       -3
Impacted Files Coverage Δ
signac/contrib/project.py 90.54% <0%> (+0.42%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c845639...074723f. Read the comment docs.

@vishav1771 vishav1771 requested a review from csadorf February 3, 2020 11:08
@vishav1771 vishav1771 requested review from csadorf and removed request for Tobias-Dwyer February 3, 2020 19:00
Copy link
Contributor

@csadorf csadorf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great! Approved pending the application of the minor suggestions that I added.

@vishav1771 One final question, have you verified that this will solve the specific issue for the data space posted in #230 ?

@vishav1771
Copy link
Contributor Author

@vishav1771 One final question, have you verified that this will solve the specific issue for the data space posted in #230 ?

Yes, I have verified it with statepoints given in #230 and few other cases. However, while testing I came across a test case shown :

project1:
{
 job1:{
    'a' : 0,
    'b' : 0,
  }
 job2:{
    'a' : 1,
    'b' : 1,
  }
}

project2:
{
 job1:{
    'a' : 0,
    'b' : 1,
  }
 job2:{
    'a' : 1,
    'b' : 0,
  }
}

They will generate the same schema and will pass check_schema. I don't think that this is the intended behaviour. Can you please verify?

@csadorf
Copy link
Contributor

csadorf commented Feb 4, 2020

@vishav1771 One final question, have you verified that this will solve the specific issue for the data space posted in #230 ?

Yes, I have verified it with statepoints given in #230 and few other cases.

Great, thank you.

They will generate the same schema and will pass check_schema. I don't think that this is the intended behaviour. Can you please verify?

While the two projects don't have the identical set of jobs, their implicit schema is indeed the same so IMO it is intended behavior that they would sync without problem. Otherwise we would need to redefine that merging means that one syncs two projects that have the exact same set of jobs. I don't think that's what we want. @vyasr Thoughts?

@vyasr
Copy link
Contributor

vyasr commented Feb 4, 2020

@vishav1771 One final question, have you verified that this will solve the specific issue for the data space posted in #230 ?
They will generate the same schema and will pass check_schema. I don't think that this is the intended behaviour. Can you please verify?

While the two projects don't have the identical set of jobs, their implicit schema is indeed the same so IMO it is intended behavior that they would sync without problem. Otherwise we would need to redefine that merging means that one syncs two projects that have the exact same set of jobs. I don't think that's what we want. @vyasr Thoughts?

I agree. I think this question is revisiting the motivation for the original requirement that schema had to be exactly identical, which was to avoid e.g. renaming or adding keys to schema and then trying to sync. If the keys and the types of the values corresponding to each key are the same between two schema, any jobs that exist in the source but not the destination must be truly distinct statepoints, so I think this behavior is the correct one.

@csadorf
Copy link
Contributor

csadorf commented Feb 4, 2020

@vyasr Great, please merge at your discretion.

Copy link
Contributor

@vyasr vyasr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found a typo, I'm going to apply the suggestion myself then I think it's good to go.

@vyasr vyasr self-requested a review February 4, 2020 17:26
@vyasr
Copy link
Contributor

vyasr commented Feb 4, 2020

@vishav1771 I don't have push permissions to your repo. Can you commit my suggestion directly? I will merge after that.

Also, in general it is helpful for us if you complete the checklist when you make the pull request. That lets us track how far you've gotten.

@vishav1771
Copy link
Contributor Author

@vishav1771 I don't have push permissions to your repo. Can you commit my suggestion directly? I will merge after that.

Sure

Also, in general it is helpful for us if you complete the checklist when you make the pull request. That lets us track how far you've gotten.

Sorry about that. I'll keep in mind for next time

@csadorf
Copy link
Contributor

csadorf commented Feb 10, 2020

@vyasr This is ready, isn't it?

@vyasr
Copy link
Contributor

vyasr commented Feb 10, 2020

Yes, thanks for the ping.

Copy link
Contributor

@vyasr vyasr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks!

@vyasr vyasr merged commit 2e73c78 into glotzerlab:master Feb 10, 2020
@bdice bdice added this to the v1.4.0 milestone Feb 27, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

API command sync doesn't identify the schema correctly
4 participants