Skip to content

Error: failed to collect prefix records during concurrent execution on HPC (SLURM + Lustre) #5476

@fmerinocasallo

Description

@fmerinocasallo

Checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pixi, using pixi --version.

Reproducible example

  1. Set up a Pixi project on a shared network drive (Lustre).
  2. Submit multiple SLURM jobs (e.g., 10+) that execute pixi install --frozen simultaneously in the same project directory.
  3. Some jobs will succeed, while others fail with the File was modified during parsing error.

Commands I ran and their output:

pixi install
Error:   × failed to collect prefix records from '/***/.pixi/envs/default'
  ╰─▶ File was modified during parsing
  help: try `pixi clean` to reset the environment and run the command again

pixi.toml/pyproject.toml file that reproduces my issue:

[workspace]
channels = ["conda-forge", "bioconda", "biobakery"]
platforms = ["linux-64"]

[system-requirements]
libc = { family = "glibc", version = "2.17" }

[dependencies]
pyega3 = ">=5.2.0,<6"
r-base = ">=4.4,<4.5"
humann = "4.0.0a1.*"
metaphlan = "4.1.1.*"
# ... (other dependencies)

pixi info output:

System
------------
       Pixi version: 0.63.2
        TLS backend: rustls
           Platform: linux-64
   Virtual packages: __unix=0=0
                   : __linux=3.10.0=0
                   : __glibc=2.17=0
                   : __archspec=1=skylake_avx512
          Cache dir: /***/.cache/rattler/cache
       Auth storage: /***/.rattler/credentials.json
   Config locations: No config files found

Global
------------
            Bin dir: /***/.pixi/bin
    Environment dir: /***/.pixi/envs
       Manifest dir: /***/.pixi/manifests/pixi-global.toml

Workspace
------------
               Name: ***
            Version: 0.1.0
      Manifest file: /***/pixi.toml
       Last updated: 19-01-2026 07:27:29

Environments
------------
        Environment: default
           Features: default
           Channels: conda-forge, bioconda, biobakery
   Dependency count: 13
       Dependencies: pyega3, r-base, humann, metaphlan
   Target platforms: linux-64
    Prefix location: /***/.pixi/envs/default
System requirements: libc = { family = "glibc", version = "2.17" }

Issue description

I am encountering an intermittent race condition when running multiple instances of a pipeline via SLURM on an HPC cluster.

Even when using pixi install --frozen to prevent environment modifications, concurrent tasks on a shared filesystem (Lustre) sometimes collide when parsing prefix records.

This behavior is non-deterministic:

  • In a batch of 14 concurrent samples, 6 failed with this error while the rest succeeded.
  • Re-running those 6 failed samples later (with fewer processes competing) worked without issue.
  • Conversely, a separate batch of 88 samples, processed with a maximum concurrency of 18 tasks (the CPU quota limit for my user), finished successfully on the first try.

This suggests that the error is not strictly tied to the number of concurrent tasks, but rather to a transient race condition during the initial environment validation on the shared filesystem.

I’m curious if pixi currently implements a locking mechanism for these read/verify operations, or if this is an area where the high latency of Lustre might be causing unexpected behavior.


Regardless of this issue, I would like to thank the maintainers for their incredible work on Pixi; it has significantly improved our bioinformatics workflows! 💙

Expected behavior

Pixi should ideally use a file-locking mechanism (like flock) to ensure that only one process modifies or reads the environment prefix records at a time, or handle concurrent read/verify operations gracefully without crashing when a file is accessed by multiple processes.

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions