Skip to content

Comments

Patch/pq memory opt#141

Merged
singjc merged 16 commits intoPyProphet:masterfrom
singjc:patch/pq_memory_opt
May 18, 2025
Merged

Patch/pq memory opt#141
singjc merged 16 commits intoPyProphet:masterfrom
singjc:patch/pq_memory_opt

Conversation

@singjc
Copy link
Contributor

@singjc singjc commented May 18, 2025

  • Update parquet export to use duckdb views and copy for more efficient streaming and memory use.
  • Added the option to split the osw parquet into a directory of separate precursor data parquet and transition data parquet.
  • Update scoring, ipf and level contexts for directory of split parquet inputs

Example

## Original OSW with 50 runs
merged.osw (275G)

## Conversion to split parquet
pyprophet export-parquet --in merged.osw --out merged.pq --scoring_format --split_transition_data

merged.pq (150G)
|-- precursors_features.parquet (7.7G)
|-- transition_features.parquet (143G)

## Scoring (pass the directory with split parquet)
pyprophet score --level ms1ms2 --in merged.pq
pyprophet score --level transition --in merged.pq

## IPF
pyprophet ipf --in merged.pq

## Context level scoring
pyprophet peptide --in merged.pq --context global
pyprophet protein --in merged.pq --context global

Enhancements to Parquet File Handling:

  • Added a new utility function is_valid_split_parquet_dir to validate directories containing split Parquet files (precursors_features.parquet and transition_features.parquet). This ensures that both required files exist and are valid Parquet files. (pyprophet/data_handling.py, pyprophet/data_handling.pyR139-R161)
  • Introduced new methods read_pyp_parquet_dir_peakgroup_precursor and read_pyp_parquet_dir_transition to handle reading data from split Parquet directories. These methods support precursor-level and transition-level data processing. (pyprophet/ipf.py, [1] [2]

Export Functionality Improvements:

  • Added a --split_transition_data option to the export_parquet command, allowing users to export data into separate Parquet files for precursors and transitions. This provides better organization and compatibility with downstream tools. (pyprophet/main.py, pyprophet/main.pyR571-R672)
  • Updated the Parquet file export logic to handle existing SCORE_IPF_ columns by dropping them before writing new scores, ensuring data consistency. (pyprophet/ipf.py, pyprophet/ipf.pyR1048-R1090)

@singjc singjc merged commit 2d16ab0 into PyProphet:master May 18, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant