Slurm: different workers on the same VM don't share the same cache #3710

epicfaace · 2021-07-14T13:30:55Z

For the slurm worker manager

epicfaace · 2021-07-16T21:41:47Z

Another solution:

have one dependency manager per directory (usually just one running per VM), workers can communicate with it through API

Solution we agreed on:

Multiple dependency managers use a sqlite database as a lockfile, which is stored in the shared "cached dependencies" folder
- store: reference count (number of bundles currently using this dependency) per dependency -- once it reaches 0, it's OK to delete
- store: if a dependency is currently being downloaded (so we don't have the same dependency being downloaded multiple times) (note: we can store the last time / last used to keep track of dead workers)
- store: if a dependency is currently being deleted (so we don't have the same dependency being deleted multiple times)
update dependency manager to store worker-state and runs locally, but store cached dependencies in a shared location

epicfaace · 2021-07-20T18:12:15Z

About locking / sqlite:

https://www.sqlite.org/lockingv3.html

#2366

epicfaace · 2021-07-20T18:26:51Z

Storage for GKE:
Objective: Share a cache among VMs, each VM has one worker at a time.

dependencies - stored in the PersistentVolume (common for all workers)
worker-state.json for each worker - stored locally on the VM
runs - store locally on the VM (maybe with Support model checkpoints #3716, we could store it on the PersistentVolume)

Storage for Slurm:
Objective: Share a cache among workers in a single VM, each VM has multiple workers at a time.

dependencies - stored on the VM (shared between workers)
worker-state.json for each worker - stored on the VM in a worker-specific directory
runs - stored on the VM in a worker-specific directory

teetone · 2021-10-13T22:43:48Z

Couple of issues with FileLock:

Basic concurrency test is failing on Windows, so the test is disabled: Soft filelock sometimes deadlocks on PyPy3 Windows tox-dev/filelock#101
The stress test I created is failing over NFS (cluster) with the following error:

============================= test session starts ==============================
platform linux -- Python 3.6.12, pytest-6.2.5, py-1.10.0, pluggy-1.0.0
rootdir: /sailhome/tonyhlee/codalab/py-filelock, configfile: tox.ini
collected 16 items

tests/test_filelock_codalab.py FF..............                          [100%]

=================================== FAILURES ===================================
___________ test_threaded_lock_different_lock_obj_many[SoftFileLock] ___________

self = <ExThread(g0_t4, stopped 139857540142848)>

    def run(self) -> None:
        try:
>           super().run()

tests/test_filelock_codalab.py:201: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = <ExThread(g0_t4, stopped 139857540142848)>

    def run(self):
        """Method representing the thread's activity.
    
        You may override this method in a subclass. The standard run() method
        invokes the callable object passed to the object's constructor as the
        target argument, if any, with sequential and keyword arguments taken
        from the args and kwargs arguments, respectively.
    
        """
        try:
            if self._target:
>               self._target(*self._args, **self._kwargs)

/u/nlp/anaconda/main/anaconda3/envs/tonyhlee_codalab/lib/python3.6/threading.py:864: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

    def t() -> None:
        for _ in range(10_000):
>           with lock:

tests/test_filelock_codalab.py:33: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = <filelock._soft.SoftFileLock object at 0x7f332568af98>

    def __enter__(self) -> "BaseFileLock":
        """
        Acquire the lock.
    
        :return: the lock object
        """
>       self.acquire()

src/filelock/_api.py:199: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = <filelock._soft.SoftFileLock object at 0x7f332568af98>, timeout = -1.0
poll_intervall = 0.05

    def acquire(self, timeout: Optional[float] = None, poll_intervall: float = 0.05) -> AcquireReturnProxy:
        """
        Try to acquire the file lock.
    
        :param timeout: maximum wait time for acquiring the lock, ``None`` means use the default :attr:`~timeout` is and
         if ``timeout < 0``, there is no timeout and this method will block until the lock could be acquired
        :param poll_intervall: interval of trying to acquire the lock file
        :raises Timeout: if fails to acquire lock within the timeout period
        :return: a context object that will unlock the file when the context is exited
    
        .. code-block:: python
    
            # You can use this method in the context manager (recommended)
            with lock.acquire():
                pass
    
            # Or use an equivalent try-finally construct:
            lock.acquire()
            try:
                pass
            finally:
                lock.release()
    
        .. versionchanged:: 2.0.0
    
            This method returns now a *proxy* object instead of *self*,
            so that it can be used in a with statement without side effects.
    
    
        """
        # Use the default timeout, if no timeout is provided.
        if timeout is None:
            timeout = self.timeout
    
        # Increment the number right at the beginning. We can still undo it, if something fails.
        with self._thread_lock:
            self._lock_counter += 1
    
        lock_id = id(self)
        lock_filename = self._lock_file
        start_time = time.time()
        try:
            while True:
                with self._thread_lock:
                    if not self.is_locked:
                        _LOGGER.debug("Attempting to acquire lock %s on %s", lock_id, lock_filename)
>                       self._acquire()

src/filelock/_api.py:155: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = <filelock._soft.SoftFileLock object at 0x7f332568af98>

    def _acquire(self) -> None:
>       raise_on_exist_ro_file(self._lock_file)

src/filelock/_soft.py:13: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

filename = '/sailhome/tonyhlee/codalab/codalab.txt.lock'

    def raise_on_exist_ro_file(filename: str) -> None:
        try:
            file_stat = os.stat(filename)  # use stat to do exists + can write to check without race condition
        except OSError:
            return None  # swallow does not exist or other errors
    
        if file_stat.st_mtime != 0:  # if os.stat returns but modification is zero that's an invalid os.stat - ignore it
            if not (file_stat.st_mode & stat.S_IWUSR):
>               raise PermissionError(f"Permission denied: {filename!r}")
E               PermissionError: Permission denied: '/sailhome/tonyhlee/codalab/codalab.txt.lock'

src/filelock/_util.py:13: PermissionError

The test fails intermittently with a few locks, but fails all the time with many locks.

teetone · 2021-10-13T22:46:38Z

The stress test:

@pytest.mark.parametrize("lock_type", [SoftFileLock])
@pytest.mark.skipif(hasattr(sys, "pypy_version_info") and sys.platform == "win32", reason="deadlocks randomly")
def test_threaded_lock_different_lock_obj_many(lock_type: Type[BaseFileLock], tmp_path: Path) -> None:
    # Runs multiple threads, which acquire the same lock file with a different FileLock object.
    # When thread group i acquires the lock, all the other threads in different groups must not not hold the lock.
    number_of_locks = 100
    number_of_groups = 3
    # TODO: change to path over NFS
    lock_path = tmp_path / "a"

    def create_run(lock_id):
        lock = locks[lock_id]

        def t() -> None:
            for _ in range(10_000):
                with lock:
                    assert lock.is_locked
                    for i, lock_to_check in enumerate(locks):
                        if i == lock_id:
                            continue
                        assert not lock_to_check.is_locked
                    assert lock.is_locked
        return t

    locks = [lock_type(str(lock_path)) for _ in range(number_of_locks)]
    all_threads = [
        [ExThread(create_run(lock_id),  f"g{group}_t{lock_id}") for lock_id in range(number_of_locks)]
        for group in range(number_of_groups)
    ]

    for threads in all_threads:
        for thread in threads:
            print(thread)
            thread.start()

    for threads in all_threads:
        for thread in threads:
            thread.join()

    for lock in locks:
        assert not lock.is_locked

epicfaace · 2021-10-13T22:52:47Z

stack trace

epicfaace · 2021-10-14T00:20:56Z

@teetone can you print the stack trace that you got, and can you ensure that you did run it with the assert not lock_to_check.is_locked code put in?

teetone · 2021-10-14T00:26:10Z

@epicfaace I updated the comment above with the full stack trace and I verified it fails with assert not. I copied the wrong file.

epicfaace · 2021-10-20T23:11:43Z

Next steps (by 10/27)

Compare: how long it takes to read/write from a JSON file normally, vs how long it takes to read/write from a JSON file locked when 50 different workers are trying to access it.
Finish implementing the PR

epicfaace · 2021-10-25T20:39:43Z

Latest design to solve this issue is in this doc: https://docs.google.com/document/d/13_vV7CGngek_I6BWmD7wNNjsN5TMunR8qVIGtfJYzP0/edit

pranavjain · 2022-02-04T19:07:07Z

Currently testing and debugging with Ashwin.
Tentatively, putting up a PR by Feb 11.

epicfaace · 2022-02-05T15:24:32Z

Tony, are there any new issues since the code fix we added on Wednesday?

…

-- Ashwin Ramaswami

On Fri, Feb 4, 2022 at 2:07 PM Pranav Jain ***@***.***> wrote: Currently testing and debugging with Ashwin. Tentatively, putting up a PR by Feb 11. — Reply to this email directly, view it on GitHub <#3710 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAM4MXYRGLYLOQTJYQCGDVTUZQPWNANCNFSM5ALPBORA> . You are receiving this because you were mentioned.Message ID: ***@***.***>

teetone · 2022-02-05T17:24:32Z

Tony, are there any new issues since the code fix we added on Wednesday?
…
-- Ashwin Ramaswami
On Fri, Feb 4, 2022 at 2:07 PM Pranav Jain @.> wrote: Currently testing and debugging with Ashwin. Tentatively, putting up a PR by Feb 11. — Reply to this email directly, view it on GitHub <#3710 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAM4MXYRGLYLOQTJYQCGDVTUZQPWNANCNFSM5ALPBORA . You are receiving this because you were mentioned.Message ID: @.>

It looks promising. I'm still testing at the moment.

pranavjain · 2022-03-01T19:42:40Z

@teetone Please update the ETA

pranavjain · 2022-03-07T11:00:11Z

@teetone is there any update on this one? Can you please update the ETA?

pranavjain · 2022-04-04T21:42:06Z

Merged this one. Will be testing on CS324 once @epicfaace deploys it on that instance.

epicfaace added goal-performance p1 Do it in the next two weeks. labels Jul 14, 2021

epicfaace assigned teetone Jul 14, 2021

This comment has been minimized.

Sign in to view

epicfaace mentioned this issue Jul 20, 2021

Support model checkpoints #3716

Closed

This was referenced Jul 21, 2021

Integrate PersistentVolumes with Kubernetes Worker Manager #3717

Closed

Run jobs with large bundles without excessive copying #3647

Closed

teetone mentioned this issue Nov 29, 2021

NFS-safe Dependency Manager #3861

Merged

4 tasks

epicfaace added the active label Dec 8, 2021

epicfaace mentioned this issue Feb 28, 2022

[CS324 Instance] Preparing state takes too long #4001

Closed

mergify bot closed this as completed in #3861 Apr 4, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Slurm: different workers on the same VM don't share the same cache #3710

Slurm: different workers on the same VM don't share the same cache #3710

epicfaace commented Jul 14, 2021

This comment has been minimized.

epicfaace commented Jul 16, 2021 •

edited

Loading

epicfaace commented Jul 20, 2021

epicfaace commented Jul 20, 2021

teetone commented Oct 13, 2021 •

edited

Loading

teetone commented Oct 13, 2021 •

edited by epicfaace

Loading

epicfaace commented Oct 13, 2021

epicfaace commented Oct 14, 2021

teetone commented Oct 14, 2021

epicfaace commented Oct 20, 2021 •

edited

Loading

epicfaace commented Oct 25, 2021

pranavjain commented Feb 4, 2022

epicfaace commented Feb 5, 2022 via email

teetone commented Feb 5, 2022

pranavjain commented Mar 1, 2022

pranavjain commented Mar 7, 2022

pranavjain commented Apr 4, 2022

Slurm: different workers on the same VM don't share the same cache #3710

Slurm: different workers on the same VM don't share the same cache #3710

Comments

epicfaace commented Jul 14, 2021

This comment has been minimized.

epicfaace commented Jul 16, 2021 • edited Loading

epicfaace commented Jul 20, 2021

epicfaace commented Jul 20, 2021

teetone commented Oct 13, 2021 • edited Loading

teetone commented Oct 13, 2021 • edited by epicfaace Loading

epicfaace commented Oct 13, 2021

epicfaace commented Oct 14, 2021

teetone commented Oct 14, 2021

epicfaace commented Oct 20, 2021 • edited Loading

epicfaace commented Oct 25, 2021

pranavjain commented Feb 4, 2022

epicfaace commented Feb 5, 2022 via email

teetone commented Feb 5, 2022

pranavjain commented Mar 1, 2022

pranavjain commented Mar 7, 2022

pranavjain commented Apr 4, 2022

epicfaace commented Jul 16, 2021 •

edited

Loading

teetone commented Oct 13, 2021 •

edited

Loading

teetone commented Oct 13, 2021 •

edited by epicfaace

Loading

epicfaace commented Oct 20, 2021 •

edited

Loading