-
Notifications
You must be signed in to change notification settings - Fork 37
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor/reorganize root dir #517
Conversation
Codecov Report
@@ Coverage Diff @@
## master #517 +/- ##
==========================================
- Coverage 78.16% 78.16% -0.01%
==========================================
Files 65 65
Lines 7030 7043 +13
Branches 1310 1312 +2
==========================================
+ Hits 5495 5505 +10
- Misses 1230 1232 +2
- Partials 305 306 +1
Continue to review full report at Codecov.
|
@@ -1982,13 +1984,38 @@ def _add(_id): | |||
else: | |||
logger.debug("In-memory cache is up to date.") | |||
|
|||
# This function should be removed in signac 2.0. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can't remove this in 2.0 without breaking backwards compatibility with 1.x. Should this migration be handled as a "schema migration" like #253 instead of a signac 2.0 migration?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure how we should handle this. Maybe something else worth discussing in the meeting today.
I think that migration of signac.rc
to .signac/config
is possibly something to handle via explicit schema migration, because that is user-facing right now, but all of the other files are purely internal so making the change transparently makes more sense to me. I agree that we can't remove this in signac 2.0, but I don't know how long we would continue to include this code. Do we retain it indefinitely? Remove in 3.0?
Furthermore, if we treated this like a schema migration, how long would signac support older schema versions? We would probably have to require migrations before users could use newer signac versions because we might make assumptions about the workspace that may not be valid for older data spaces.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bdice @csadorf to expand on this, since it was apparently decided at the last developer meeting we are choosing to leave signac.rc
at the project root rather than moving to .signac/config
, all the files moving are purely internal and not user-facing. The disappearance of any of these files will not invalidate the data space or change the way a user interacts with it. I would not consider this a schema migration, but rather a reshuffling of our internals. In fact, I'm not even sure users need to be aware of this change if signac.rc
is not moving. Therefore, I don't think we should require any user input to perform this migration, which is what the migration API is based on.
Regarding when we remove this function, I would be fine removing it in 2.0 if we are not moving signac.rc
. The worst case scenario is that someone doesn't touch some signac project between now and version 2.0, and then they have to rerun signac update-cache
. If someone goes this long without touching a project, I don't think the loss of signac shell histories is a problem either, and an errant .signac_shell_history
file persisting at the project root forever is probably not something anyone would ever notice. We could also choose to keep this function for longer if we'd like, but in the long term the main issue I could see with this is people accessing signac data spaces that are archived to read-only directories, in which case the file move would fail.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Therefore, I don't think we should require any user input to perform this migration, which is what the migration API is based on.
The current approach in this PR of assuming that no project will be negatively affected by moving/changing/deleting files without user confirmation just seems like it's not a good idea. 😕 For example, this PR might not be safe if someone ran parallel jobs and triggered automatic file movements with race conditions. This is a migration as far as I can tell, and it is what we're trying to safeguard with the work @csadorf did for interactive migrations. I believe we need to make this a manual (and optional) migration process for the user.
Because of the user burden that a manual migration incurs, we could wait until finalizing any other potential schema changes (e.g. changes necessary for TOML support or other signac 2.0 plans) to introduce the migration formally. (Side note: that includes the proposed workspace directory changes, if we need to prompt the user to create a symlink workspace
pointing to the current project config's defined workspace path.)
Alternatively, we can just make this PR the definition of "schema version 2" and apply future changes in schema versions 3, 4, etc. As long as the migrations are well-defined, it should be easy to make them apply sequentially. signac 2.0 could have schema version 10 (composed of separate changes to cache files, workspaces, configs, etc.), and that would be perfectly fine with me. I think I like this idea best.
If we want to treat signac like a mature database application, I think that safeguards like user-controlled migrations that involve metadata changes or file movements are expected and necessary. I understand that the stakes are low in this case but we shouldn't skip over using the infrastructure that is designed specifically for this purpose. That said, I don't mean for my vote to be an overriding one if other @glotzerlab/signac-committers agree to the approach in this PR (or are willing to accept it as a trade-off of safety / convenience / etc.).
Regarding when we remove this function, I would be fine removing it in 2.0 if we are not moving signac.rc. The worst case scenario is that someone doesn't touch some signac project between now and version 2.0, and then they have to rerun signac update-cache. If someone goes this long without touching a project, I don't think the loss of signac shell histories is a problem either, and an errant .signac_shell_history file persisting at the project root forever is probably not something anyone would ever notice. We could also choose to keep this function for longer if we'd like, but in the long term the main issue I could see with this is people accessing signac data spaces that are archived to read-only directories, in which case the file move would fail.
Timeliness of access (pre-1.7, pre-2.0, post-2.0) shouldn't govern the outcome here. Schema versions should.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are good points. I'll try and address them:
- While valid, I don't think the parallel-safety argument should have any impact on whether or not this is a schema migration. We can always use a lock to make the file move safe if necessary.
- I agree that the assumption that this will always be safe may be problematic. If we're concerned about that, then we may have to make this an opt-in process in some fashion.
- I think part of what has confused me during this discussion is that
signac
is using overloaded terminology. I've never spent much time with the schema migration code, so I hadn't thought about it until you mentioned it, and I think the naming put me in the wrong frame of mind. To your point about treating signac as a mature db, I don't actually think that comparison applies here because this problem is quite different from what databases face. Changes in signac's data model are not schema migrations in the sense of a database. A schema migration describes changes in a user's chosen layout of data stored using signac as we describe in this recipe. What we're doing here is more like a "database migration", analogous to MongoDB changing how it stores indexes internally between versions. True databases have the benefit of storing their data internally and therefore controlling the layout. To my knowledge (I could be wrong about this), to migrate data stored in an older version of MongoDB server, the solution is just to upgrade the MongoDB server software itself. There may also be other tools built to support this (for example this tool for MySQL supports migrating databases between newer versions of MySQL), but in general I don't think databases face the same problem we do where the database software can be updated without also updating the underlying data storage format. I don't know if there's a suitable renaming ofschema_version
that would avoid this problem, though, nor have I thought at all about the backwards compatibility implications of such a change. - The main reason that I'm advocating to make this transparent is because all the files being moved are internal (again, analogous to something like an internal SQL index). That kind of change seems like it should be entirely transparent, but I'm open to being convinced otherwise.
- I think that the existing schema migration code is probably appropriate to use for migrating the
workspace
That's not 100% pertinent to this discussion, but worth ironing out here since we'll have to do that soon enough. We could help out users there by checking whether theworkspace_dir
key is set insignac.rc
, and if so, raising an exception indicating that the user should run the migration unlessworkspace_dir = workspace
. - This change also needs to account for how it impacts
signac-flow
.signac-flow
has no concept of schema version, nor should it because it's not a database, butsignac
also has no knowledge ofsignac-flow
. What happens if a user runs a manual schema migration withsignac
on a project with bundled jobs submitted, then tries to check status? Conversely, what happens if a user upgradessignac-flow
and.bundles
gets moved to.signac/bundles
, but they're still using an older version of flow. Doessignac-flow
need to have a way to patch into the migration? Does the migration logic already provide hooks for this (not to my knowledge)? Does that become a separate migration process?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One follow-up on my last bullet point above. The fact that signac-flow
is susceptible to this kind of problem suggests to me that flow
should probably be storing its internal files independently of signac
(e.g. in .flow
). This kind of interdependency is problematic; if for some reason in the future we decide to rename .signac
to .party_parrot
, signac
shouldn't have to worry about preserving the directory because signac-flow
(or another signac
-adjacent package) might be using it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- I consider the internal layout of our files part of the "data schema". So yes, there are two different levels of "schemas", but I consider the file layout part of it.
- Applying these changes automatically is indeed problematic, so whether we use the framework or not, we should be extremely careful about that, and probably not a good idea.
- Data schema migrations are not always user-facing. For example, AiiDA applies migrations all the time and to my knowledge none of those are "user-facing". I am not even sure what that would mean in the context of signac. The only guarantees we make is that our API works, there are no guarantees about the internal structure of a state point file for instance.
- I think it is perfectly fine to keep all migrations more or less indefinitely as long as the costs of code maintenance is very low. Once there is a maintenance cost we can require that users install older versions for parts of the migration. For example, it might require signac 2.x to go from 0->8 and then signac 3.x from 8->$current. In this way only very old projects would need this kind of extra step, but the migration would still be possible.
- I disagree that "flow" has no data schema and thus does not need any migrations. It absolutely has a data schema in the form of how we store our internal data structures (both on the file system directly and using signac core).
- Unfortunately that is something that we did not fully consider when we introduced the migration framework. I am not sure what the best way to deal with that. One solution would be to simply copy the migration framework to flow and maintain flow-related migrations in flow. This PR is effectively introducing an ad-hoc migration, but without the framework.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This discussion is sufficient to convince me that we should not perform this change automatically. That being the case, we might as well use the migration tooling. In that case I'd probably want to do what @bdice suggested and make a new branch off of master in which to collect all the relevant change (the ones in this PR, workspace, maybe TOML), then make the schema version change there once we're happy with all the changes. Maintaining migrations perpetually seems fine to me, I don't see much overhead arising there, at least not for a while (Python 4 etc).
I disagree that the only guarantee we make is that our API works (or at least, I disagree that that is all we aspire to). IMO one of the goals of signac is to store data in a layout that is well-defined and can be parsed even without signac to maximize data sharing and lifetime. A major strength of our model is that signac_statepoint.json
clearly identifies the data in a directory. To the extent that I would not consider changing that (e.g. by renaming the file, changing its format, etc, or by a more systematic generalization of our data model as we planned in the original signac 2.0 prototype) I would consider that part of what we promise, or at least something we should aspire to offer.
I agree that data schema migrations are not always user-facing. One way to think about this in the context of AiiDA is that AiiDA is a database built on top of a database i.e. it stores data within another database according to a well-defined internal schema, and that schema can change. Therefore, client code must also account for this. I hadn't considered signac this way since we market signac as a "database on the filesystem", but now I agree that effectively applies with the files on the filesystem as our "underlying database".
From that perspective, flow also has an internal schema. Other workflow managers must have the same problem. AiiDA gets around this by having the workflow and data in the same package, so a single update/schema migration can solve this. What does Fireworks do? They are probably the closest point of comparison for signac-flow as an independent workflow tool. Do they have a procedure for what happens if their internal MongoDB storage layout changes? I agree that it seems like using the migration tooling seems like the right approach for flow, perhaps it just needs to be generalized enough that any tool built on signac can reuse it safely to avoid duplication (maybe that's already largely the case, I'm not sure).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I hadn't considered signac this way since we market signac as a "database on the filesystem", but now I agree that effectively applies with the files on the filesystem as our "underlying database".
Yes! This is how I think about it as well. I am happy with the proposal to combine all schema changes into one PR and handle flow separately.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! We have an open question on how long we should support this automatic migartion, but I don't think that needs to be resolved before we add it.
If we decide to increment the schema version, that needs to be decided before this is merged, because this change is automatically applied to user projects. |
Any idea why the missing f-string placeholder wasn't caught by CI? |
I'm a bit confused as to why we are not using the migration scheme for this. That's exactly what it was designed for isn't it? |
@csadorf see this thread. |
@mikemhenry I think that error catches the opposite problem, i.e. an f-string without any placeholders would raise that error: |
@vyasr I responded directly there. |
With respect to the config file, I believe that we had agreed upon placing it in However, that was not respected in our last dev discussion. I am still in favor of placing it in |
@csadorf A few reasons were suggested. These are paraphrases of what we discussed, I may be missing some details. Tagging @mikemhenry @atravitz because I think they shared thoughts on this too?
This wasn’t mentioned but I find it relevant: the parallels from |
I am also still in favor of moving Having said that, I think points 1-3 are valid. I disagree with the additional comment, because I'm actually in favor of only exposing modification of the config via our config API rather than supporting direct modification (also suggested by @b-butler here). I think that the move from a file that can be created/modified by the user to a stronger push for using I'm going to close this PR and open a new meta-issue to document what I think need to be next steps forward, since I think this PR will need to be largely rewritten anyway to use the migration API. |
Description
Moves the signac shell history and the state point cache into a
.signac
directory. Adds some standardized functionality for performing this migration to simplify the transition for other files, and transparently performs this migration whenever old files are detected via an automated process so that the transition has no effect on users.Motivation and Context
Partial solution for #197. We could close it if we make a decision on how to handle
signac_statepoints.json
in this PR.Types of Changes
1The change breaks (or has the potential to break) existing functionality.
Checklist:
If necessary: