Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor/reorganize root dir #517

Closed
wants to merge 8 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 4 additions & 1 deletion signac/__main__.py
Original file line number Diff line number Diff line change
Expand Up @@ -1138,7 +1138,10 @@ def jobs():
else: # interactive
if READLINE:
if "PyPy" not in platform.python_implementation():
fn_hist = project.fn(".signac_shell_history")
# Attempt migration. To be removed in signac 2.0.
old_fn_hist = ".signac_shell_history"
fn_hist = ".signac/signac_shell_history"
project._migrate_internal_file(old_fn_hist, fn_hist)
try:
readline.read_history_file(fn_hist)
readline.set_history_length(1000)
Expand Down
43 changes: 37 additions & 6 deletions signac/contrib/project.py
Original file line number Diff line number Diff line change
Expand Up @@ -221,7 +221,9 @@ class Project:
FN_STATEPOINTS = "signac_statepoints.json"
"The default filename to read from and write state points to."

FN_CACHE = ".signac_sp_cache.json.gz"
_OLD_FN_CACHE = ".signac_sp_cache.json.gz"

FN_CACHE = ".signac/signac_sp_cache.json.gz"
"The default filename for the state point cache file."

_use_pandas_for_html_repr = True # toggle use of pandas for html repr
Expand Down Expand Up @@ -1982,13 +1984,38 @@ def _add(_id):
else:
logger.debug("In-memory cache is up to date.")

# This function should be removed in signac 2.0.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can't remove this in 2.0 without breaking backwards compatibility with 1.x. Should this migration be handled as a "schema migration" like #253 instead of a signac 2.0 migration?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure how we should handle this. Maybe something else worth discussing in the meeting today.

I think that migration of signac.rc to .signac/config is possibly something to handle via explicit schema migration, because that is user-facing right now, but all of the other files are purely internal so making the change transparently makes more sense to me. I agree that we can't remove this in signac 2.0, but I don't know how long we would continue to include this code. Do we retain it indefinitely? Remove in 3.0?

Furthermore, if we treated this like a schema migration, how long would signac support older schema versions? We would probably have to require migrations before users could use newer signac versions because we might make assumptions about the workspace that may not be valid for older data spaces.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bdice @csadorf to expand on this, since it was apparently decided at the last developer meeting we are choosing to leave signac.rc at the project root rather than moving to .signac/config, all the files moving are purely internal and not user-facing. The disappearance of any of these files will not invalidate the data space or change the way a user interacts with it. I would not consider this a schema migration, but rather a reshuffling of our internals. In fact, I'm not even sure users need to be aware of this change if signac.rc is not moving. Therefore, I don't think we should require any user input to perform this migration, which is what the migration API is based on.

Regarding when we remove this function, I would be fine removing it in 2.0 if we are not moving signac.rc. The worst case scenario is that someone doesn't touch some signac project between now and version 2.0, and then they have to rerun signac update-cache. If someone goes this long without touching a project, I don't think the loss of signac shell histories is a problem either, and an errant .signac_shell_history file persisting at the project root forever is probably not something anyone would ever notice. We could also choose to keep this function for longer if we'd like, but in the long term the main issue I could see with this is people accessing signac data spaces that are archived to read-only directories, in which case the file move would fail.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vyasr @csadorf

Therefore, I don't think we should require any user input to perform this migration, which is what the migration API is based on.

The current approach in this PR of assuming that no project will be negatively affected by moving/changing/deleting files without user confirmation just seems like it's not a good idea. 😕 For example, this PR might not be safe if someone ran parallel jobs and triggered automatic file movements with race conditions. This is a migration as far as I can tell, and it is what we're trying to safeguard with the work @csadorf did for interactive migrations. I believe we need to make this a manual (and optional) migration process for the user.

Because of the user burden that a manual migration incurs, we could wait until finalizing any other potential schema changes (e.g. changes necessary for TOML support or other signac 2.0 plans) to introduce the migration formally. (Side note: that includes the proposed workspace directory changes, if we need to prompt the user to create a symlink workspace pointing to the current project config's defined workspace path.)

Alternatively, we can just make this PR the definition of "schema version 2" and apply future changes in schema versions 3, 4, etc. As long as the migrations are well-defined, it should be easy to make them apply sequentially. signac 2.0 could have schema version 10 (composed of separate changes to cache files, workspaces, configs, etc.), and that would be perfectly fine with me. I think I like this idea best.

If we want to treat signac like a mature database application, I think that safeguards like user-controlled migrations that involve metadata changes or file movements are expected and necessary. I understand that the stakes are low in this case but we shouldn't skip over using the infrastructure that is designed specifically for this purpose. That said, I don't mean for my vote to be an overriding one if other @glotzerlab/signac-committers agree to the approach in this PR (or are willing to accept it as a trade-off of safety / convenience / etc.).

Regarding when we remove this function, I would be fine removing it in 2.0 if we are not moving signac.rc. The worst case scenario is that someone doesn't touch some signac project between now and version 2.0, and then they have to rerun signac update-cache. If someone goes this long without touching a project, I don't think the loss of signac shell histories is a problem either, and an errant .signac_shell_history file persisting at the project root forever is probably not something anyone would ever notice. We could also choose to keep this function for longer if we'd like, but in the long term the main issue I could see with this is people accessing signac data spaces that are archived to read-only directories, in which case the file move would fail.

Timeliness of access (pre-1.7, pre-2.0, post-2.0) shouldn't govern the outcome here. Schema versions should.

Copy link
Contributor Author

@vyasr vyasr Mar 8, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are good points. I'll try and address them:

  • While valid, I don't think the parallel-safety argument should have any impact on whether or not this is a schema migration. We can always use a lock to make the file move safe if necessary.
  • I agree that the assumption that this will always be safe may be problematic. If we're concerned about that, then we may have to make this an opt-in process in some fashion.
  • I think part of what has confused me during this discussion is that signac is using overloaded terminology. I've never spent much time with the schema migration code, so I hadn't thought about it until you mentioned it, and I think the naming put me in the wrong frame of mind. To your point about treating signac as a mature db, I don't actually think that comparison applies here because this problem is quite different from what databases face. Changes in signac's data model are not schema migrations in the sense of a database. A schema migration describes changes in a user's chosen layout of data stored using signac as we describe in this recipe. What we're doing here is more like a "database migration", analogous to MongoDB changing how it stores indexes internally between versions. True databases have the benefit of storing their data internally and therefore controlling the layout. To my knowledge (I could be wrong about this), to migrate data stored in an older version of MongoDB server, the solution is just to upgrade the MongoDB server software itself. There may also be other tools built to support this (for example this tool for MySQL supports migrating databases between newer versions of MySQL), but in general I don't think databases face the same problem we do where the database software can be updated without also updating the underlying data storage format. I don't know if there's a suitable renaming of schema_version that would avoid this problem, though, nor have I thought at all about the backwards compatibility implications of such a change.
  • The main reason that I'm advocating to make this transparent is because all the files being moved are internal (again, analogous to something like an internal SQL index). That kind of change seems like it should be entirely transparent, but I'm open to being convinced otherwise.
  • I think that the existing schema migration code is probably appropriate to use for migrating the workspace That's not 100% pertinent to this discussion, but worth ironing out here since we'll have to do that soon enough. We could help out users there by checking whether the workspace_dir key is set in signac.rc, and if so, raising an exception indicating that the user should run the migration unless workspace_dir = workspace.
  • This change also needs to account for how it impacts signac-flow. signac-flow has no concept of schema version, nor should it because it's not a database, but signac also has no knowledge of signac-flow. What happens if a user runs a manual schema migration with signac on a project with bundled jobs submitted, then tries to check status? Conversely, what happens if a user upgrades signac-flow and .bundles gets moved to .signac/bundles, but they're still using an older version of flow. Does signac-flow need to have a way to patch into the migration? Does the migration logic already provide hooks for this (not to my knowledge)? Does that become a separate migration process?

Copy link
Contributor Author

@vyasr vyasr Mar 8, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One follow-up on my last bullet point above. The fact that signac-flow is susceptible to this kind of problem suggests to me that flow should probably be storing its internal files independently of signac (e.g. in .flow). This kind of interdependency is problematic; if for some reason in the future we decide to rename .signac to .party_parrot, signac shouldn't have to worry about preserving the directory because signac-flow (or another signac-adjacent package) might be using it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. I consider the internal layout of our files part of the "data schema". So yes, there are two different levels of "schemas", but I consider the file layout part of it.
  2. Applying these changes automatically is indeed problematic, so whether we use the framework or not, we should be extremely careful about that, and probably not a good idea.
  3. Data schema migrations are not always user-facing. For example, AiiDA applies migrations all the time and to my knowledge none of those are "user-facing". I am not even sure what that would mean in the context of signac. The only guarantees we make is that our API works, there are no guarantees about the internal structure of a state point file for instance.
  4. I think it is perfectly fine to keep all migrations more or less indefinitely as long as the costs of code maintenance is very low. Once there is a maintenance cost we can require that users install older versions for parts of the migration. For example, it might require signac 2.x to go from 0->8 and then signac 3.x from 8->$current. In this way only very old projects would need this kind of extra step, but the migration would still be possible.
  5. I disagree that "flow" has no data schema and thus does not need any migrations. It absolutely has a data schema in the form of how we store our internal data structures (both on the file system directly and using signac core).
  6. Unfortunately that is something that we did not fully consider when we introduced the migration framework. I am not sure what the best way to deal with that. One solution would be to simply copy the migration framework to flow and maintain flow-related migrations in flow. This PR is effectively introducing an ad-hoc migration, but without the framework.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This discussion is sufficient to convince me that we should not perform this change automatically. That being the case, we might as well use the migration tooling. In that case I'd probably want to do what @bdice suggested and make a new branch off of master in which to collect all the relevant change (the ones in this PR, workspace, maybe TOML), then make the schema version change there once we're happy with all the changes. Maintaining migrations perpetually seems fine to me, I don't see much overhead arising there, at least not for a while (Python 4 etc).

I disagree that the only guarantee we make is that our API works (or at least, I disagree that that is all we aspire to). IMO one of the goals of signac is to store data in a layout that is well-defined and can be parsed even without signac to maximize data sharing and lifetime. A major strength of our model is that signac_statepoint.json clearly identifies the data in a directory. To the extent that I would not consider changing that (e.g. by renaming the file, changing its format, etc, or by a more systematic generalization of our data model as we planned in the original signac 2.0 prototype) I would consider that part of what we promise, or at least something we should aspire to offer.

I agree that data schema migrations are not always user-facing. One way to think about this in the context of AiiDA is that AiiDA is a database built on top of a database i.e. it stores data within another database according to a well-defined internal schema, and that schema can change. Therefore, client code must also account for this. I hadn't considered signac this way since we market signac as a "database on the filesystem", but now I agree that effectively applies with the files on the filesystem as our "underlying database".

From that perspective, flow also has an internal schema. Other workflow managers must have the same problem. AiiDA gets around this by having the workflow and data in the same package, so a single update/schema migration can solve this. What does Fireworks do? They are probably the closest point of comparison for signac-flow as an independent workflow tool. Do they have a procedure for what happens if their internal MongoDB storage layout changes? I agree that it seems like using the migration tooling seems like the right approach for flow, perhaps it just needs to be generalized enough that any tool built on signac can reuse it safely to avoid duplication (maybe that's already largely the case, I'm not sure).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I hadn't considered signac this way since we market signac as a "database on the filesystem", but now I agree that effectively applies with the files on the filesystem as our "underlying database".

Yes! This is how I think about it as well. I am happy with the proposal to combine all schema changes into one PR and handle flow separately.

def _migrate_internal_file(self, old_file, new_file):
"""Migrate files based on the 1.x names to the new 2.x schema.

Various files used internally by signac are placed at the project root in
signac 1.x but will be placed in a .signac directory in 2.x. This function
provides a standardized migration protocol.

Parameters
----------
old_file : str
The original filename.
new_file : str
The new filename.

"""
old_fn = self.fn(old_file)
new_fn = self.fn(new_file)

if os.path.exists(old_fn):
_mkdir_p(os.path.dirname(new_fn))
os.rename(old_fn, new_fn)
logger.info(f"Moved {old_fn}->{new_fn} for signac 2.0 schema migration.")

def _remove_persistent_cache_file(self):
"""Remove the persistent cache file (if it exists)."""
try:
os.remove(self.fn(self.FN_CACHE))
except OSError as error:
if error.errno != errno.ENOENT:
raise error
for fn in (self.FN_CACHE, self._OLD_FN_CACHE):
try:
os.remove(self.fn(fn))
except OSError as error:
if error.errno != errno.ENOENT:
raise error

def update_cache(self):
"""Update the persistent state point cache.
Expand All @@ -2005,10 +2032,12 @@ def update_cache(self):
cache = self._read_cache()
cached_ids = set(self._sp_cache)
self._update_in_memory_cache()

if cache is None or set(cache) != cached_ids:
fn_cache = self.fn(self.FN_CACHE)
fn_cache_tmp = fn_cache + "~"
try:
_mkdir_p(self.fn(".signac"))
with gzip.open(fn_cache_tmp, "wb") as cachefile:
cachefile.write(json.dumps(self._sp_cache).encode())
except OSError: # clean-up
Expand All @@ -2028,6 +2057,8 @@ def update_cache(self):
def _read_cache(self):
"""Read the persistent state point cache (if available)."""
logger.debug("Reading cache...")
# To be removed in signac 2.0.
self._migrate_internal_file(self._OLD_FN_CACHE, self.FN_CACHE)
start = time.time()
try:
with gzip.open(self.fn(self.FN_CACHE), "rb") as cachefile:
Expand Down