Development by vuductung · Pull Request #8 · MannLabs/QuantSelect

vuductung · 2025-03-26T06:31:22Z

No description provided.

…instances

…uctures and expected outputs for mean distance calculations and handle mismatched data lengths in correlation tests.

…_feature_df methods in TestPreprocessing

…ipeline.py

… method in TestPreprocessingPipeline

…n ms2_identifiers while ensuring unique rows in test_data.

…lations and data point counting methods. Refactor correlation matrix computation for efficiency and clarity. Update feature engineering pipelines to include new features and improve logging.

…amline feature selection process.

…ngineering. Introduce new methods for MS1-MS2 correlation calculations, data synchronization, and improved logging. Update type hints for better clarity and maintainability. Streamline data processing methods to handle dictionaries of DataFrames more effectively.

…to improve code clarity.

github-actions · 2025-03-26T06:33:32Z

selectlfq/featureengineering.py

@@ -51,27 +51,48 @@ def _calculate_variance_distance(data):


 @njit


The correlation matrix should have 1.0 along the diagonal to represent the perfect correlation of each fragment with itself. Currently, the diagonal elements are left as zeros which is incorrect for correlation matrices.

@njit def _nan_correlation_matrix(data: np.ndarray) -> np.ndarray: """ Calculate the correlation matrix of the data. Parameters: data: numpy array of shape (n_fragments, n_samples) The data to calculate the correlation matrix of. Returns: numpy array of shape (n_fragments, n_fragments) containing the correlation matrix of the data. """ n = len(data) correlation_matrix = np.zeros((n, n)) for i in range(n): correlation_matrix[i, i] = 1.0 # Set diagonal to 1.0 (perfect self-correlation) for j in range(i + 1, n): # Only compute upper triangle mask = np.isfinite(data[i]) & np.isfinite(data[j]) if np.sum(mask) > 1: xi = data[i][mask] xj = data[j][mask]

github-actions · 2025-03-26T06:33:34Z

selectlfq/preprocessing.py

@@ -139,112 +195,169 @@ def _clean_data(self, data, level):



The method is dropping columns but not storing the remaining columns in the not_dropped_columns attribute in the dictionary case. This would maintain consistency with the behavior in the DataFrame case and match the log message.

def _drop_columns(self, data): if isinstance(data, dict): result = {} for k in data.keys(): result[k] = data[k].drop(columns=ColumnConfig.IDENTIFIERS, errors="ignore") self.not_dropped_columns = list(next(iter(result.values())).columns) return result elif isinstance(data, pd.DataFrame): logger.info( "Columns that are not dropped can be found under attribute 'not_dropped_columns'." ) self.not_dropped_columns = list(data.drop(columns=ColumnConfig.IDENTIFIERS, errors="ignore").columns)

github-actions · 2025-03-26T06:33:36Z

selectlfq/featureengineering.py

@@ -98,7 +119,7 @@ def _nan_correlation_w_ref(data_ref_pairs: tuple[np.ndarray, np.ndarray]) -> np.
        ref = ref[0]


The return value shape comment is incorrect. Based on the implementation, the method returns an array of shape (n_fragments,) not (n_samples, n_features) as currently documented.

def _nan_correlation_w_ref(data_ref_pairs: tuple[np.ndarray, np.ndarray]) -> np.ndarray: """ Calculate the correlation between each row of data and the reference row. Parameters: data_ref_pairs: tuple[np.ndarray, np.ndarray] A tuple containing two numpy arrays: - data: numpy array of shape (n_fragments, n_samples) - ref: numpy array of shape (n_samples,) or (1, n_samples) Returns: numpy array of shape (n_fragments,) containing the correlation between each row of data and the reference row. """ data, ref = data_ref_pairs if ref.ndim > 1 and ref.shape[0] == 1:

github-actions · 2025-03-26T06:33:38Z

selectlfq/featureengineering.py

+    def calculate_ms1_ms2_corr(
+        self,
+        ms1_data_extracted: list[np.ndarray],
+        ms2_data_extracted: list[np.ndarray],


Added a check to ensure the lengths of ms1_data_extracted and ms2_data_extracted match before zipping them. This will provide a more helpful error message rather than potential silent errors or less informative error messages later in the process.

def calculate_ms1_ms2_corr( self, ms1_data_extracted: list[np.ndarray], ms2_data_extracted: list[np.ndarray], ) -> pd.DataFrame: """ Calculate the correlation between the ms1 and ms2 data. Parameters: ms1_data_extracted: list[np.ndarray] The ms1 data extracted from the precursor data. ms2_data_extracted: list[np.ndarray] The ms2 data extracted from the precursor data. Returns: ms1_ms2_corr_data: pd.DataFrame The ms1-ms2 correlation data. """ if len(ms1_data_extracted) != len(ms2_data_extracted): raise ValueError("MS1 and MS2 data must have the same number of elements") zipped_data = zip(ms2_data_extracted, ms1_data_extracted) ms1_ms2_corr = self.parallel_process( inputs=zipped_data, method=_nan_correlation_w_ref, n_jobs=10 ) ms1_ms2_corr = pd.DataFrame(np.vstack(ms1_ms2_corr))

github-actions · 2025-03-26T06:34:06Z

tests/test_featureengineering.py

@@ -77,10 +77,10 @@ def setUp(self):
        ]


Added debugging output to the test to provide more information if the test fails. This will make it easier to diagnose issues with the _calculate_mean_distance function.

def test_calculate_mean_distance(self): data = np.array([[1, 2], [2, 4]]) expected_output = np.array([[0.5, 1.0], [0.5, 1.0]]) result = self.helper._calculate_mean_distance(data) # Print output for debugging if test fails if not np.array_equal(result, expected_output): print(f"Expected: {expected_output}, Got: {result}") np.testing.assert_array_equal(result, expected_output)

github-actions · 2025-03-26T06:34:08Z

selectlfq/featureengineering.py

        )
-        return [total]
+        return [ranks]
+


Added a proper docstring to the method, including Parameters and Returns sections, to improve code documentation and make the method's purpose and usage clearer.

def feature_engineering_pipeline_for_height_and_ms1_intensity(self, data): """ Feature engineering pipeline for height and MS1 intensity data. Parameters ---------- data : array-like The input data for feature engineering. Returns ------- list List of engineered features. """ logger.info("Calculating additional features.") mean_corr, std_corr = zip( *self._feat_eng.parallel_process( data, self._feat_eng.feature_engineering, func="corr", n_jobs=10 ) )

github-actions · 2025-03-26T06:34:10Z

selectlfq/preprocessing.py

@@ -109,26 +108,83 @@ def process(



Added a proper docstring to the _feat_eng_preprocess_pipeline method, including Parameters and Returns sections, to improve code documentation.

def _feat_eng_preprocess_pipeline( self, data: dict[str, pd.DataFrame], level: str, ) -> tuple[list[torch.Tensor], list[torch.Tensor]]: """ Preprocess pipeline for feature engineering. Parameters ---------- data : dict[str, pd.DataFrame] Dictionary containing MS1 and MS2 data. level : str Level to extract data from (e.g., 'pg' for protein group). Returns ------- tuple[list[torch.Tensor], list[torch.Tensor]] Tuple containing standardized feature data and intensity data. """ logger.info( "Processing MS1 data: %d features found - %s", len(set(data["ms1"].keys())), ", ".join(data["ms1"].keys()), ) logger.info( "Processing MS2 data: %d features found - %s", len(set(data["ms2"].keys())), ", ".join(data["ms2"].keys()), )

github-actions · 2025-03-26T06:34:12Z

selectlfq/preprocessing.py

        self, data: pd.DataFrame, level: str = "pg", return_index: bool = False
-    ) -> list:
+    ) -> Union[List[np.ndarray], Tuple[List[np.ndarray]]]:
        """


Updated the return type hint to more accurately reflect the function's behavior when return_index is True. The function returns a tuple of two lists in that case, not a list of tuples as implied by the original return hint.

def extract_data_from_level( self, data: pd.DataFrame, level: str = "pg", return_index: bool = False ) -> Union[List[np.ndarray], Tuple[List[np.ndarray], List[pd.MultiIndex]]]: """ Create a list of np.array based on the level to group by. The data is grouped by the level and the values are extracted and stored in a list. Parameters ---------- data : pd.DataFrame The data to be processed. level : str The level to group by. Either "pg" for protein group or "mod_seq_charge_hash" for precursor. return_index : bool, optional If True, the index is also returned. The index is used to identify the protein group or precursor the data belongs to. Returns ------- Union[List[np.ndarray], Tuple[List[np.ndarray], List[pd.MultiIndex]]] A list of np.array with the extracted values, and optionally a list of indices. """

github-actions · 2025-03-26T06:34:14Z

tests/test_model.py

+from selectlfq.dataloader import DataLoader


 class TestModel(unittest.TestCase):


Added comments to clarify the structure of the test data, which makes it easier to understand the dimensions and purpose of the arrays in the feature_layer test data.

def setUp(self): # Initialize test data with realistic shapes and values self.intensity_layer = [np.array([[1.0, 2.0, 3.0], [4.0, 5.0, 6.0]])] self.feature_layer = [ np.array( [ [[1.0, 2.0, 3.0], [4.0, 5.0, 6.0]], # Sample 1, features for 2 fragments [[7.0, 8.0, 9.0], [10.0, 11.0, 12.0]], # Sample 2, features for 2 fragments ] ) ] # Create a dataloader with our test data self.dataloader = DataLoader( feature_layer=self.feature_layer, intensity_layer=self.intensity_layer ) # Initialize the model with our dataloader self.model = Model(dataloader=self.dataloader)

github-actions · 2025-03-26T06:34:16Z

Overall, the codebase has been improved with better error handling, documentation, and optimizations. Key improvements include:

Optimized correlation matrix calculation in _nan_correlation_matrix by only computing half of the matrix and filling the other half (symmetric matrix).
Added a MS1-MS2 correlation feature calculation pipeline.
Added proper type hints throughout the code.
Improved error handling, particularly in methods that process MS1 and MS2 data together.
Better data synchronization between MS1 and MS2 data.

Suggestions for further improvement:

The code would benefit from more consistent error handling - some methods use specific error messages while others use generic ones.
The _clean_data and other preprocessing methods directly modify the input data. They should be changed to create copies to avoid unintended side effects.
The correlation matrix calculation could be further optimized by using vectorized operations instead of loops where possible.
Test coverage could be expanded to include more edge cases, especially around data processing with missing or invalid values.

github-actions · 2025-03-26T06:34:17Z

The following feedback could not be added to specific lines, but still contains valuable information:

{"change_id": "5", "file_name": "./selectlfq/preprocessing.py", "start_line": 1114, "end_line": 1179, "proposed_code": "def sync_ms1_and_ms2_data(self, data: dict[pd.DataFrame]) -> dict[pd.DataFrame]:\n        \"\"\"\n        Sync the ms1 and ms2 data by the precursor index. The data is\n        sorted by the ion column, so that the data is in the same order\n        for both ms1 and ms2.\n\n        Parameters\n        ----------\n        data : dict[str, pd.DataFrame]\n            The data to sync.\n        Returns\n        -------\n        dict[str, pd.DataFrame]\n            The synced data.\n        \"\"\"\n        if self.ms2_identifiers is None:\n            raise ValueError(\n                \"Identifiers are not defined, load fragment data first to retrieve precursor index\"\n            )\n        else:\n            # Create a copy to avoid modifying the original data\n            result = {}\n            \n            # merge the identifiers\n            for key in data.keys():\n                result[key] = self.ms2_identifiers.merge(\n                    data[key], on=ColumnConfig.PRECURSOR_IDENTIFIERS, how=\"left\"\n                )\n            \n            # sort the data by the ion column\n            for key in result.keys():\n                result[key] = self._sort_by_list(\n                    data=result[key],\n                    col=\"ion\",\n                    reindexed_list=self.ms2_identifiers[\"ion\"].tolist(),\n                )\n                if (\n                    self.ms2_identifiers[\"ion\"].values != result[key][\"ion\"].values\n                ).any():\n                    consistencies = (\n                        self.ms2_identifiers[\"ion\"].values == result[key][\"ion\"].values\n                    )\n                    indices_of_inconsistencies = np.where(~consistencies)[0]\n                    logger.warning(\n                        \"Inconsistent ions for data %s at %s\",\n                        key,\n                        indices_of_inconsistencies,\n                    )\n\n            return result", "comment": "The method modifies the input data directly, which can lead to unexpected side effects. It's better to create a copy of the data and modify that instead. Added a `result` dictionary to store the modified data."}

github-actions · 2025-03-26T06:34:19Z

Number of tokens: input_tokens=65761 output_tokens=4063 max_tokens=4096
review_instructions=''
config={}
thinking: ```
[]

mschwoer

tl;dr ;-)

mschwoer · 2025-03-26T09:49:36Z

selectlfq/featureengineering.py

+                else:
+                    correlation_matrix[i, j] = (
+                        0.0  # No correlation if standard deviation is zero
+                    )
+                    correlation_matrix[j, i] = 0.0
+            else:
+                correlation_matrix[i, j] = 0.0  # Not enough data points
+                correlation_matrix[j, i] = 0.0


similiar to a few days ago: initialize with 0.0 and save the two elses

…clarity and maintainability of correlation calculations.

vuductung added 10 commits March 25, 2025 11:02

Refactor unit tests in test_model.py to utilize DataLoader and Model …

cdd7a92

…instances

Update unit tests in TestFeatureEngineering to correct input data str…

94704ab

…uctures and expected outputs for mean distance calculations and handle mismatched data lengths in correlation tests.

Add unit tests for reconstruct_identifiers_after_extraction and build…

628f326

…_feature_df methods in TestPreprocessing

Add unit tests for PreprocessingPipeline class in test_preprocessingp…

03e1ae5

…ipeline.py

Add unit test for handling missing values in _build_ms1_ms2_corr_data…

c1915ff

… method in TestPreprocessingPipeline

Add unit test for sync_ms1_and_ms2_data method to handle duplicates i…

90473a5

…n ms2_identifiers while ensuring unique rows in test_data.

Enhance feature engineering functionality by adding correlation calcu…

5b4fc80

…lations and data point counting methods. Refactor correlation matrix computation for efficiency and clarity. Update feature engineering pipelines to include new features and improve logging.

Remove unused features from DEFAULT_FEATURES in FeatureConfig to stre…

5d6647a

…amline feature selection process.

Remove commented-out code for training and validation in Model class …

ee39116

…to improve code clarity.

vuductung added the code-review label Mar 26, 2025

vuductung requested review from ammarcsj and mschwoer March 26, 2025 06:31

github-actions bot reviewed Mar 26, 2025

View reviewed changes

mschwoer reviewed Mar 26, 2025

View reviewed changes

Remove redundant code in _nan_correlation_matrix function to enhance …

3599038

…clarity and maintainability of correlation calculations.

vuductung merged commit efeadd5 into main Mar 26, 2025
0 of 6 checks passed

		@@ -51,27 +51,48 @@ def _calculate_variance_distance(data):


		@njit

		@@ -98,7 +119,7 @@ def _nan_correlation_w_ref(data_ref_pairs: tuple[np.ndarray, np.ndarray]) -> np.
		ref = ref[0]

		from selectlfq.dataloader import DataLoader


		class TestModel(unittest.TestCase):

Comments

Conversation

vuductung commented Mar 26, 2025

Uh oh!

github-actions bot Mar 26, 2025

Choose a reason for hiding this comment

Uh oh!

github-actions bot Mar 26, 2025

Choose a reason for hiding this comment

Uh oh!

github-actions bot Mar 26, 2025

Choose a reason for hiding this comment

Uh oh!

github-actions bot Mar 26, 2025

Choose a reason for hiding this comment

Uh oh!

github-actions bot Mar 26, 2025

Choose a reason for hiding this comment

Uh oh!

github-actions bot Mar 26, 2025

Choose a reason for hiding this comment

Uh oh!

github-actions bot Mar 26, 2025

Choose a reason for hiding this comment

Uh oh!

github-actions bot Mar 26, 2025

Choose a reason for hiding this comment

Uh oh!

github-actions bot Mar 26, 2025

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Mar 26, 2025

Uh oh!

github-actions bot commented Mar 26, 2025

Uh oh!

github-actions bot commented Mar 26, 2025

Uh oh!

mschwoer left a comment

Choose a reason for hiding this comment

Uh oh!

mschwoer Mar 26, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants