Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor/reorganize root dir #517

Closed
wants to merge 8 commits into from
Closed

Conversation

vyasr
Copy link
Contributor

@vyasr vyasr commented Feb 26, 2021

Description

Moves the signac shell history and the state point cache into a .signac directory. Adds some standardized functionality for performing this migration to simplify the transition for other files, and transparently performs this migration whenever old files are detected via an automated process so that the transition has no effect on users.

Motivation and Context

Partial solution for #197. We could close it if we make a decision on how to handle signac_statepoints.json in this PR.

Types of Changes

  • Documentation update
  • Bug fix
  • New feature
  • Breaking change1

1The change breaks (or has the potential to break) existing functionality.

Checklist:

If necessary:

  • I have updated the API documentation as part of the package doc-strings.
  • I have created a separate pull request to update the framework documentation on signac-docs and linked it here.
  • I have updated the changelog and added all related issue and pull request numbers for future reference (if applicable). See example below.

@vyasr vyasr requested review from a team as code owners February 26, 2021 21:51
@vyasr vyasr requested review from mikemhenry and jennyfothergill and removed request for a team February 26, 2021 21:51
@codecov
Copy link

codecov bot commented Feb 26, 2021

Codecov Report

Merging #517 (f17d158) into master (7795dad) will decrease coverage by 0.00%.
The diff coverage is 70.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #517      +/-   ##
==========================================
- Coverage   78.16%   78.16%   -0.01%     
==========================================
  Files          65       65              
  Lines        7030     7043      +13     
  Branches     1310     1312       +2     
==========================================
+ Hits         5495     5505      +10     
- Misses       1230     1232       +2     
- Partials      305      306       +1     
Impacted Files Coverage Δ
signac/contrib/project.py 84.78% <64.70%> (-0.30%) ⬇️
signac/__main__.py 72.06% <100.00%> (+0.17%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 7795dad...f17d158. Read the comment docs.

@@ -1982,13 +1984,38 @@ def _add(_id):
else:
logger.debug("In-memory cache is up to date.")

# This function should be removed in signac 2.0.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can't remove this in 2.0 without breaking backwards compatibility with 1.x. Should this migration be handled as a "schema migration" like #253 instead of a signac 2.0 migration?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure how we should handle this. Maybe something else worth discussing in the meeting today.

I think that migration of signac.rc to .signac/config is possibly something to handle via explicit schema migration, because that is user-facing right now, but all of the other files are purely internal so making the change transparently makes more sense to me. I agree that we can't remove this in signac 2.0, but I don't know how long we would continue to include this code. Do we retain it indefinitely? Remove in 3.0?

Furthermore, if we treated this like a schema migration, how long would signac support older schema versions? We would probably have to require migrations before users could use newer signac versions because we might make assumptions about the workspace that may not be valid for older data spaces.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bdice @csadorf to expand on this, since it was apparently decided at the last developer meeting we are choosing to leave signac.rc at the project root rather than moving to .signac/config, all the files moving are purely internal and not user-facing. The disappearance of any of these files will not invalidate the data space or change the way a user interacts with it. I would not consider this a schema migration, but rather a reshuffling of our internals. In fact, I'm not even sure users need to be aware of this change if signac.rc is not moving. Therefore, I don't think we should require any user input to perform this migration, which is what the migration API is based on.

Regarding when we remove this function, I would be fine removing it in 2.0 if we are not moving signac.rc. The worst case scenario is that someone doesn't touch some signac project between now and version 2.0, and then they have to rerun signac update-cache. If someone goes this long without touching a project, I don't think the loss of signac shell histories is a problem either, and an errant .signac_shell_history file persisting at the project root forever is probably not something anyone would ever notice. We could also choose to keep this function for longer if we'd like, but in the long term the main issue I could see with this is people accessing signac data spaces that are archived to read-only directories, in which case the file move would fail.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vyasr @csadorf

Therefore, I don't think we should require any user input to perform this migration, which is what the migration API is based on.

The current approach in this PR of assuming that no project will be negatively affected by moving/changing/deleting files without user confirmation just seems like it's not a good idea. 😕 For example, this PR might not be safe if someone ran parallel jobs and triggered automatic file movements with race conditions. This is a migration as far as I can tell, and it is what we're trying to safeguard with the work @csadorf did for interactive migrations. I believe we need to make this a manual (and optional) migration process for the user.

Because of the user burden that a manual migration incurs, we could wait until finalizing any other potential schema changes (e.g. changes necessary for TOML support or other signac 2.0 plans) to introduce the migration formally. (Side note: that includes the proposed workspace directory changes, if we need to prompt the user to create a symlink workspace pointing to the current project config's defined workspace path.)

Alternatively, we can just make this PR the definition of "schema version 2" and apply future changes in schema versions 3, 4, etc. As long as the migrations are well-defined, it should be easy to make them apply sequentially. signac 2.0 could have schema version 10 (composed of separate changes to cache files, workspaces, configs, etc.), and that would be perfectly fine with me. I think I like this idea best.

If we want to treat signac like a mature database application, I think that safeguards like user-controlled migrations that involve metadata changes or file movements are expected and necessary. I understand that the stakes are low in this case but we shouldn't skip over using the infrastructure that is designed specifically for this purpose. That said, I don't mean for my vote to be an overriding one if other @glotzerlab/signac-committers agree to the approach in this PR (or are willing to accept it as a trade-off of safety / convenience / etc.).

Regarding when we remove this function, I would be fine removing it in 2.0 if we are not moving signac.rc. The worst case scenario is that someone doesn't touch some signac project between now and version 2.0, and then they have to rerun signac update-cache. If someone goes this long without touching a project, I don't think the loss of signac shell histories is a problem either, and an errant .signac_shell_history file persisting at the project root forever is probably not something anyone would ever notice. We could also choose to keep this function for longer if we'd like, but in the long term the main issue I could see with this is people accessing signac data spaces that are archived to read-only directories, in which case the file move would fail.

Timeliness of access (pre-1.7, pre-2.0, post-2.0) shouldn't govern the outcome here. Schema versions should.

Copy link
Contributor Author

@vyasr vyasr Mar 8, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are good points. I'll try and address them:

  • While valid, I don't think the parallel-safety argument should have any impact on whether or not this is a schema migration. We can always use a lock to make the file move safe if necessary.
  • I agree that the assumption that this will always be safe may be problematic. If we're concerned about that, then we may have to make this an opt-in process in some fashion.
  • I think part of what has confused me during this discussion is that signac is using overloaded terminology. I've never spent much time with the schema migration code, so I hadn't thought about it until you mentioned it, and I think the naming put me in the wrong frame of mind. To your point about treating signac as a mature db, I don't actually think that comparison applies here because this problem is quite different from what databases face. Changes in signac's data model are not schema migrations in the sense of a database. A schema migration describes changes in a user's chosen layout of data stored using signac as we describe in this recipe. What we're doing here is more like a "database migration", analogous to MongoDB changing how it stores indexes internally between versions. True databases have the benefit of storing their data internally and therefore controlling the layout. To my knowledge (I could be wrong about this), to migrate data stored in an older version of MongoDB server, the solution is just to upgrade the MongoDB server software itself. There may also be other tools built to support this (for example this tool for MySQL supports migrating databases between newer versions of MySQL), but in general I don't think databases face the same problem we do where the database software can be updated without also updating the underlying data storage format. I don't know if there's a suitable renaming of schema_version that would avoid this problem, though, nor have I thought at all about the backwards compatibility implications of such a change.
  • The main reason that I'm advocating to make this transparent is because all the files being moved are internal (again, analogous to something like an internal SQL index). That kind of change seems like it should be entirely transparent, but I'm open to being convinced otherwise.
  • I think that the existing schema migration code is probably appropriate to use for migrating the workspace That's not 100% pertinent to this discussion, but worth ironing out here since we'll have to do that soon enough. We could help out users there by checking whether the workspace_dir key is set in signac.rc, and if so, raising an exception indicating that the user should run the migration unless workspace_dir = workspace.
  • This change also needs to account for how it impacts signac-flow. signac-flow has no concept of schema version, nor should it because it's not a database, but signac also has no knowledge of signac-flow. What happens if a user runs a manual schema migration with signac on a project with bundled jobs submitted, then tries to check status? Conversely, what happens if a user upgrades signac-flow and .bundles gets moved to .signac/bundles, but they're still using an older version of flow. Does signac-flow need to have a way to patch into the migration? Does the migration logic already provide hooks for this (not to my knowledge)? Does that become a separate migration process?

Copy link
Contributor Author

@vyasr vyasr Mar 8, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One follow-up on my last bullet point above. The fact that signac-flow is susceptible to this kind of problem suggests to me that flow should probably be storing its internal files independently of signac (e.g. in .flow). This kind of interdependency is problematic; if for some reason in the future we decide to rename .signac to .party_parrot, signac shouldn't have to worry about preserving the directory because signac-flow (or another signac-adjacent package) might be using it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. I consider the internal layout of our files part of the "data schema". So yes, there are two different levels of "schemas", but I consider the file layout part of it.
  2. Applying these changes automatically is indeed problematic, so whether we use the framework or not, we should be extremely careful about that, and probably not a good idea.
  3. Data schema migrations are not always user-facing. For example, AiiDA applies migrations all the time and to my knowledge none of those are "user-facing". I am not even sure what that would mean in the context of signac. The only guarantees we make is that our API works, there are no guarantees about the internal structure of a state point file for instance.
  4. I think it is perfectly fine to keep all migrations more or less indefinitely as long as the costs of code maintenance is very low. Once there is a maintenance cost we can require that users install older versions for parts of the migration. For example, it might require signac 2.x to go from 0->8 and then signac 3.x from 8->$current. In this way only very old projects would need this kind of extra step, but the migration would still be possible.
  5. I disagree that "flow" has no data schema and thus does not need any migrations. It absolutely has a data schema in the form of how we store our internal data structures (both on the file system directly and using signac core).
  6. Unfortunately that is something that we did not fully consider when we introduced the migration framework. I am not sure what the best way to deal with that. One solution would be to simply copy the migration framework to flow and maintain flow-related migrations in flow. This PR is effectively introducing an ad-hoc migration, but without the framework.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This discussion is sufficient to convince me that we should not perform this change automatically. That being the case, we might as well use the migration tooling. In that case I'd probably want to do what @bdice suggested and make a new branch off of master in which to collect all the relevant change (the ones in this PR, workspace, maybe TOML), then make the schema version change there once we're happy with all the changes. Maintaining migrations perpetually seems fine to me, I don't see much overhead arising there, at least not for a while (Python 4 etc).

I disagree that the only guarantee we make is that our API works (or at least, I disagree that that is all we aspire to). IMO one of the goals of signac is to store data in a layout that is well-defined and can be parsed even without signac to maximize data sharing and lifetime. A major strength of our model is that signac_statepoint.json clearly identifies the data in a directory. To the extent that I would not consider changing that (e.g. by renaming the file, changing its format, etc, or by a more systematic generalization of our data model as we planned in the original signac 2.0 prototype) I would consider that part of what we promise, or at least something we should aspire to offer.

I agree that data schema migrations are not always user-facing. One way to think about this in the context of AiiDA is that AiiDA is a database built on top of a database i.e. it stores data within another database according to a well-defined internal schema, and that schema can change. Therefore, client code must also account for this. I hadn't considered signac this way since we market signac as a "database on the filesystem", but now I agree that effectively applies with the files on the filesystem as our "underlying database".

From that perspective, flow also has an internal schema. Other workflow managers must have the same problem. AiiDA gets around this by having the workflow and data in the same package, so a single update/schema migration can solve this. What does Fireworks do? They are probably the closest point of comparison for signac-flow as an independent workflow tool. Do they have a procedure for what happens if their internal MongoDB storage layout changes? I agree that it seems like using the migration tooling seems like the right approach for flow, perhaps it just needs to be generalized enough that any tool built on signac can reuse it safely to avoid duplication (maybe that's already largely the case, I'm not sure).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I hadn't considered signac this way since we market signac as a "database on the filesystem", but now I agree that effectively applies with the files on the filesystem as our "underlying database".

Yes! This is how I think about it as well. I am happy with the proposal to combine all schema changes into one PR and handle flow separately.

Copy link
Collaborator

@mikemhenry mikemhenry left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! We have an open question on how long we should support this automatic migartion, but I don't think that needs to be resolved before we add it.

@bdice
Copy link
Member

bdice commented Mar 1, 2021

LGTM! We have an open question on how long we should support this automatic migartion, but I don't think that needs to be resolved before we add it.

If we decide to increment the schema version, that needs to be decided before this is merged, because this change is automatically applied to user projects.

@mikemhenry
Copy link
Collaborator

Any idea why the missing f-string placeholder wasn't caught by CI? f-string is missing placeholders should be something flake8 catches.

@csadorf
Copy link
Contributor

csadorf commented Mar 3, 2021

I'm a bit confused as to why we are not using the migration scheme for this. That's exactly what it was designed for isn't it?

@vyasr
Copy link
Contributor Author

vyasr commented Mar 7, 2021

@csadorf see this thread.

@vyasr
Copy link
Contributor Author

vyasr commented Mar 7, 2021

Any idea why the missing f-string placeholder wasn't caught by CI? f-string is missing placeholders should be something flake8 catches.

@mikemhenry I think that error catches the opposite problem, i.e. an f-string without any placeholders would raise that error: print(f"foo") should just be print("foo") because there's nothing to format. The opposite case of a normal string containing braces cannot be an error because it's valid to have braces in strings. You could imagine a more sophisticated checker that determines whether any token between braces contained in a string is actually a valid identifier as opposed to just being a random string that has no corresponding variable in the current scope, but I think that is too sophisticated for flake8 to catch.

@csadorf
Copy link
Contributor

csadorf commented Mar 8, 2021

@csadorf see this thread.

@vyasr I responded directly there.

@csadorf
Copy link
Contributor

csadorf commented Mar 9, 2021

With respect to the config file, I believe that we had agreed upon placing it in .signac/config according to this comment: #197 (comment)

However, that was not respected in our last dev discussion. I am still in favor of placing it in .signac/ any reason why we shouldn't?

@bdice
Copy link
Member

bdice commented Mar 9, 2021

With respect to the config file, I believe that we had agreed upon placing it in .signac/config according to this comment: #197 (comment)

However, that was not respected in our last dev discussion. I am still in favor of placing it in .signac/ any reason why we shouldn't?

@csadorf A few reasons were suggested. These are paraphrases of what we discussed, I may be missing some details. Tagging @mikemhenry @atravitz because I think they shared thoughts on this too?

  1. If .signac does not contain the project config (only caches and histories) then the hidden folder can be deleted (or not copied/archived because someone missed a “include hidden files” flag) without much negative impact.
  2. Keeping signac.rc visible (not hidden) is important for pedagogy and teaching the structure to new users. Nothing else in .signac is important for understanding the project.
  3. Consistency with current user expectations. The switching cost doesn’t pay off here.

This wasn’t mentioned but I find it relevant: the parallels from .signac to Git / .git are imperfect. For example, .gitmodules (which controls git submodules) lives in the root directory, under source control, rather than in .git, because it is meant to be read/written/committed by the user unlike Git’s internal caches. I view the project config in a somewhat similar way (though the hidden attribute differs).

@vyasr
Copy link
Contributor Author

vyasr commented Mar 9, 2021

With respect to the config file, I believe that we had agreed upon placing it in .signac/config according to this comment: #197 (comment)
However, that was not respected in our last dev discussion. I am still in favor of placing it in .signac/ any reason why we shouldn't?

@csadorf A few reasons were suggested. These are paraphrases of what we discussed, I may be missing some details. Tagging @mikemhenry @atravitz because I think they shared thoughts on this too?

1. If `.signac` does not contain the project config (only caches and histories) then the hidden folder can be deleted (or not copied/archived because someone missed a “include hidden files” flag) without much negative impact.

2. Keeping `signac.rc` visible (not hidden) is important for pedagogy and teaching the structure to new users. Nothing else in `.signac` is important for understanding the project.

3. Consistency with current user expectations. The switching cost doesn’t pay off here.

This wasn’t mentioned but I find it relevant: the parallels from .signac to Git / .git are imperfect. For example, .gitmodules (which controls git submodules) lives in the root directory, under source control, rather than in .git, because it is meant to be read/written/committed by the user unlike Git’s internal caches. I view the project config in a somewhat similar way (though the hidden attribute differs).

I am also still in favor of moving signac.rc, but I assumed from the attendance at the last dev meeting (looking at the dev meeting notes) that all major stakeholders other than myself were present and had agreed to this decision, so I was fine moving forward. Since both @csadorf and @b-butler have expressed support for moving the file in the past, I would like to have at least one discussion at dev meeting (or another short dedicated meeting) with all committers present to finalize this (and perhaps vote on the final decision since our current decision process seems rather ad hoc for such a significant change). I also don't think it's pressing, because there are a number of additional issues that need to be ironed out for the new schema version anyway before we have to move on this decision one way or the other.

Having said that, I think points 1-3 are valid. I disagree with the additional comment, because I'm actually in favor of only exposing modification of the config via our config API rather than supporting direct modification (also suggested by @b-butler here). I think that the move from a file that can be created/modified by the user to a stronger push for using signac init is a good thing, especially because we're making the workspace_dir immutable.

I'm going to close this PR and open a new meta-issue to document what I think need to be next steps forward, since I think this PR will need to be largely rewritten anyway to use the migration API.

@vyasr vyasr closed this Mar 9, 2021
@vyasr vyasr mentioned this pull request Mar 9, 2021
4 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants