Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Concurrency issues with state.json can lead to inadvertently changed states #2361

Open
nelson-liu opened this issue Jun 5, 2020 · 0 comments
Labels
backend bug p3 Do it some day.

Comments

@nelson-liu
Copy link
Collaborator

Describe the bug
It seems like the session data for the CLI state.json is being modified outside of the current session, but I can't figure out where or by what. At a high level, I have a script that does cl work <something>, and then a bunch of cl run s. But while it's in the middle of those cl runs, somehow the state.json file gets modified and as a result, some of the runs fail because they can't find bundles that are supposed to be in the worksheet they're in (because the current worksheet for the session was mysteriously changed). As far as I've investigated, i think it's caused by something to do with authentication—whenever state.json is modified unintentionally, the token expiry is pushed back.

As an experiment last night, I deleted state.json and walked away from my computer. When I came back, state.json had been recreated, even though there were no running cl processes that I know of (this is on the NLP cluster, so the process could be on any of the head or worker nodes).

More details:

I have a script that, broadly speaking, does a cl work <some worksheet name>, and then cl runs hundreds of bundles. This can take quite some time (~30 minutes). In my cl runs, I refer to bundles by name (not by uuid).

Sometimes, my cl runs fail because of a NotFoundError—CodaLab is trying to run the bundle, but it can't resolve the name -> uuid. If i ctrl-C / cat ~/.codalab/state.json, I see that the worksheet UUID assigned to my session has been modified! Throughout this whole process, I'm printing out the session ID—it doesn't change whenever I invoke cl, as expected (I've also tried setting CODALAB_SESSION, to no avail). Doing some extra logging, I noticed the following:

My script will cl info a bundle, and it works. Below, I describe the output of the logs.

I print out the state path it's reading from, this is the right state path

Reading state_path /sailhome/nfliu/.codalab/state.json

I print out the read state. This looks fine. I added an extra hostname field to the sessions value, since I suspect something is going on with other nodes sharing the same CODALAB_HOME...

State {'auth': {'http://codalab.stanford.edu': {'token_info': {'access_token': 'PGbYhAMOGZYKo2cmyUBPZ4bgqYt6Vf', 'expires_at': 1591337015.2242806, 'refresh_token': 'Ns4pFceba0HLz8MLVBCN7acqZW1ozW', 'scope': 'default', 'token_type': 'Bearer'}, 'username': 'nfliu'}}, 'last_check_version_datetime': '2020-06-05 05:03:35', 'sessions': {'23215': {'address': 'http://codalab.stanford.edu', 'hostname': 'scdt.stanford.edu', 'worksheet_uuid': '0x09d026dfee1646b0a8439587af6752a6'}}}

I print out the intermediate variables when cl info runs parse_client_worksheet_uuid. This all looks fine, and it returns the right worksheet UUID as expected by the state.

parse_client_worksheet_uuid#spec None
parse_client_worksheet_uuid#worksheet_util.CURRENT_WORKSHEET .
parse_client_worksheet_uuid#Empty spec, returning current worksheet
session#name 23215
{'address': 'http://codalab.stanford.edu', 'hostname': 'scdt.stanford.edu', 'worksheet_uuid': '0x09d026dfee1646b0a8439587af6752a6'}

I make sure to log the returned worksheet id, it's correct.

get_current_worksheet_uuid from session 0x09d026dfee1646b0a8439587af6752a6
get_current_worksheet_uuid 0x09d026dfee1646b0a8439587af6752a6
parse_client_worksheet_uuid#retval 0x09d026dfee1646b0a8439587af6752a6

However, immediately after, I try to do cl info again on a different bundle in the same worksheet, and things fail. It isn't able to find the bundle. The log output is sort of surprising:

It reads the state from the same file, but for some reason, the state has been modified. Now, it has dozens of entries. I'm not sure where they all came from. It's interesting to note that these ones don't have the hostname, so they aren't generated at least by my current session running cl info.

Also, note that the expires_at on the auth token changes. However, I added a logging statement to the _cache_token function that

Reading state_path /sailhome/nfliu/.codalab/state.json
State {'auth': {'http://codalab.stanford.edu': {'token_info': {'access_token': '7vgrZa651Tum2qdSuypxlOHvWZtFfb', 'expires_at': 1591338738.799319, 'refresh_token': '18JuckW1gAUjAYDW2IJHoje2UZMw8J', 'scope': 'default', 'token_type': 'Bearer'}, 'username': 'nfliu'}, 'https://worksheets.codalab.org': {'token_info': {'access_token': 'y1W0g8x3phprj4cs0BODGrSHMMygkT', 'expires_at': 1591162591.4494832, 'refresh_token': 'FvGmYaONC9g9pWC9npOPhpDsLsxkEA', 'scope': 'default', 'token_type': 'Bearer'}, 'username': 'nfliu'}}, 'last_check_version_datetime': '2020-06-05 04:48:06', 'sessions': {'10098': {'address': 'https://worksheets.codalab.org', 'worksheet_uuid': ''}, ... <elided other sessions> ..., '23215': {'address': 'http://codalab.stanford.edu', 'worksheet_uuid': '0x6f70bd37400345c3933a17f413de87cc'}, ... <elided other sessions> ...}}
parse_client_worksheet_uuid#spec None
parse_client_worksheet_uuid#worksheet_util.CURRENT_WORKSHEET .
parse_client_worksheet_uuid#Empty spec, returning current worksheet
session#name 23215
{'address': 'http://codalab.stanford.edu', 'worksheet_uuid': '0x6f70bd37400345c3933a17f413de87cc'}
get_current_worksheet_uuid from session 0x6f70bd37400345c3933a17f413de87cc
get_current_worksheet_uuid 0x6f70bd37400345c3933a17f413de87cc
parse_client_worksheet_uuid#retval 0x6f70bd37400345c3933a17f413de87cc
NotFoundError: bundle spec run-rasor-synthetic_cloze_worddropout.50000-train-sampled_hps_d61e287707c6c2f3029a3da3721863c1 doesn't match any bundles

To Reproduce
I'm working on a minimal reproducible example, but I think it's somewhat machine/setup specific...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backend bug p3 Do it some day.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants