You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Describe the bug
It seems like the session data for the CLI state.json is being modified outside of the current session, but I can't figure out where or by what. At a high level, I have a script that does cl work <something>, and then a bunch of cl run s. But while it's in the middle of those cl runs, somehow the state.json file gets modified and as a result, some of the runs fail because they can't find bundles that are supposed to be in the worksheet they're in (because the current worksheet for the session was mysteriously changed). As far as I've investigated, i think it's caused by something to do with authentication—whenever state.json is modified unintentionally, the token expiry is pushed back.
As an experiment last night, I deleted state.json and walked away from my computer. When I came back, state.json had been recreated, even though there were no running cl processes that I know of (this is on the NLP cluster, so the process could be on any of the head or worker nodes).
More details:
I have a script that, broadly speaking, does a cl work <some worksheet name>, and then cl runs hundreds of bundles. This can take quite some time (~30 minutes). In my cl runs, I refer to bundles by name (not by uuid).
Sometimes, my cl runs fail because of a NotFoundError—CodaLab is trying to run the bundle, but it can't resolve the name -> uuid. If i ctrl-C / cat ~/.codalab/state.json, I see that the worksheet UUID assigned to my session has been modified! Throughout this whole process, I'm printing out the session ID—it doesn't change whenever I invoke cl, as expected (I've also tried setting CODALAB_SESSION, to no avail). Doing some extra logging, I noticed the following:
My script will cl info a bundle, and it works. Below, I describe the output of the logs.
I print out the state path it's reading from, this is the right state path
I print out the read state. This looks fine. I added an extra hostname field to the sessions value, since I suspect something is going on with other nodes sharing the same CODALAB_HOME...
I print out the intermediate variables when cl info runs parse_client_worksheet_uuid. This all looks fine, and it returns the right worksheet UUID as expected by the state.
I make sure to log the returned worksheet id, it's correct.
get_current_worksheet_uuid from session 0x09d026dfee1646b0a8439587af6752a6
get_current_worksheet_uuid 0x09d026dfee1646b0a8439587af6752a6
parse_client_worksheet_uuid#retval 0x09d026dfee1646b0a8439587af6752a6
However, immediately after, I try to do cl info again on a different bundle in the same worksheet, and things fail. It isn't able to find the bundle. The log output is sort of surprising:
It reads the state from the same file, but for some reason, the state has been modified. Now, it has dozens of entries. I'm not sure where they all came from. It's interesting to note that these ones don't have the hostname, so they aren't generated at least by my current session running cl info.
Also, note that the expires_at on the auth token changes. However, I added a logging statement to the _cache_token function that
Describe the bug
It seems like the session data for the CLI
state.json
is being modified outside of the current session, but I can't figure out where or by what. At a high level, I have a script that doescl work <something>
, and then a bunch ofcl run
s. But while it's in the middle of thosecl run
s, somehow thestate.json
file gets modified and as a result, some of the runs fail because they can't find bundles that are supposed to be in the worksheet they're in (because the current worksheet for the session was mysteriously changed). As far as I've investigated, i think it's caused by something to do with authentication—wheneverstate.json
is modified unintentionally, the token expiry is pushed back.As an experiment last night, I deleted
state.json
and walked away from my computer. When I came back,state.json
had been recreated, even though there were no runningcl
processes that I know of (this is on the NLP cluster, so the process could be on any of the head or worker nodes).More details:
I have a script that, broadly speaking, does a
cl work <some worksheet name>
, and thencl run
s hundreds of bundles. This can take quite some time (~30 minutes). In mycl run
s, I refer to bundles by name (not by uuid).Sometimes, my
cl run
s fail because of aNotFoundError
—CodaLab is trying to run the bundle, but it can't resolve the name -> uuid. If i ctrl-C /cat ~/.codalab/state.json
, I see that the worksheet UUID assigned to my session has been modified! Throughout this whole process, I'm printing out the session ID—it doesn't change whenever I invokecl
, as expected (I've also tried settingCODALAB_SESSION
, to no avail). Doing some extra logging, I noticed the following:My script will
cl info
a bundle, and it works. Below, I describe the output of the logs.I print out the state path it's reading from, this is the right state path
I print out the read state. This looks fine. I added an extra
hostname
field to the sessions value, since I suspect something is going on with other nodes sharing the same CODALAB_HOME...I print out the intermediate variables when
cl info
runsparse_client_worksheet_uuid
. This all looks fine, and it returns the right worksheet UUID as expected by the state.I make sure to log the returned worksheet id, it's correct.
However, immediately after, I try to do
cl info
again on a different bundle in the same worksheet, and things fail. It isn't able to find the bundle. The log output is sort of surprising:It reads the state from the same file, but for some reason, the state has been modified. Now, it has dozens of entries. I'm not sure where they all came from. It's interesting to note that these ones don't have the hostname, so they aren't generated at least by my current session running
cl info
.Also, note that the
expires_at
on the auth token changes. However, I added a logging statement to the_cache_token
function thatTo Reproduce
I'm working on a minimal reproducible example, but I think it's somewhat machine/setup specific...
The text was updated successfully, but these errors were encountered: