Skip to content

Comments

Development#8

Merged
vuductung merged 11 commits intomainfrom
development
Mar 26, 2025
Merged

Development#8
vuductung merged 11 commits intomainfrom
development

Conversation

@vuductung
Copy link
Collaborator

No description provided.

…uctures and expected outputs for mean distance calculations and handle mismatched data lengths in correlation tests.
…n ms2_identifiers while ensuring unique rows in test_data.
…lations and data point counting methods. Refactor correlation matrix computation for efficiency and clarity. Update feature engineering pipelines to include new features and improve logging.
…ngineering. Introduce new methods for MS1-MS2 correlation calculations, data synchronization, and improved logging. Update type hints for better clarity and maintainability. Streamline data processing methods to handle dictionaries of DataFrames more effectively.
@@ -51,27 +51,48 @@ def _calculate_variance_distance(data):


@njit

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The correlation matrix should have 1.0 along the diagonal to represent the perfect correlation of each fragment with itself. Currently, the diagonal elements are left as zeros which is incorrect for correlation matrices.

@njit
def _nan_correlation_matrix(data: np.ndarray) -> np.ndarray:
    """
    Calculate the correlation matrix of the data.

    Parameters:
        data: numpy array of shape (n_fragments, n_samples)
            The data to calculate the correlation matrix of.

    Returns:
        numpy array of shape (n_fragments, n_fragments)
        containing the correlation matrix of the data.
    """
    n = len(data)
    correlation_matrix = np.zeros((n, n))

    for i in range(n):
        correlation_matrix[i, i] = 1.0  # Set diagonal to 1.0 (perfect self-correlation)
        for j in range(i + 1, n):  # Only compute upper triangle
            mask = np.isfinite(data[i]) & np.isfinite(data[j])
            if np.sum(mask) > 1:
                xi = data[i][mask]
                xj = data[j][mask]

@@ -139,112 +195,169 @@ def _clean_data(self, data, level):

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The method is dropping columns but not storing the remaining columns in the not_dropped_columns attribute in the dictionary case. This would maintain consistency with the behavior in the DataFrame case and match the log message.

def _drop_columns(self, data):
        if isinstance(data, dict):
            result = {}
            for k in data.keys():
                result[k] = data[k].drop(columns=ColumnConfig.IDENTIFIERS, errors="ignore")
                
            self.not_dropped_columns = list(next(iter(result.values())).columns)
            return result

        elif isinstance(data, pd.DataFrame):
            logger.info(
                "Columns that are not dropped can be found under attribute 'not_dropped_columns'."
            )
            self.not_dropped_columns = list(data.drop(columns=ColumnConfig.IDENTIFIERS, errors="ignore").columns)

@@ -98,7 +119,7 @@ def _nan_correlation_w_ref(data_ref_pairs: tuple[np.ndarray, np.ndarray]) -> np.
ref = ref[0]

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The return value shape comment is incorrect. Based on the implementation, the method returns an array of shape (n_fragments,) not (n_samples, n_features) as currently documented.

def _nan_correlation_w_ref(data_ref_pairs: tuple[np.ndarray, np.ndarray]) -> np.ndarray:
    """
    Calculate the correlation between each row of data and the reference row.

    Parameters:
        data_ref_pairs: tuple[np.ndarray, np.ndarray]
            A tuple containing two numpy arrays:
            - data: numpy array of shape (n_fragments, n_samples)
            - ref: numpy array of shape (n_samples,) or (1, n_samples)

    Returns:
        numpy array of shape (n_fragments,)
        containing the correlation between each row of data and the reference row.
    """

    data, ref = data_ref_pairs

    if ref.ndim > 1 and ref.shape[0] == 1:

def calculate_ms1_ms2_corr(
self,
ms1_data_extracted: list[np.ndarray],
ms2_data_extracted: list[np.ndarray],

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a check to ensure the lengths of ms1_data_extracted and ms2_data_extracted match before zipping them. This will provide a more helpful error message rather than potential silent errors or less informative error messages later in the process.

def calculate_ms1_ms2_corr(
        self,
        ms1_data_extracted: list[np.ndarray],
        ms2_data_extracted: list[np.ndarray],
    ) -> pd.DataFrame:
        """
        Calculate the correlation between the ms1 and ms2 data.

        Parameters:
            ms1_data_extracted: list[np.ndarray]
                The ms1 data extracted from the precursor data.
            ms2_data_extracted: list[np.ndarray]
                The ms2 data extracted from the precursor data.
        Returns:
            ms1_ms2_corr_data: pd.DataFrame
                The ms1-ms2 correlation data.
        """
        
        if len(ms1_data_extracted) != len(ms2_data_extracted):
            raise ValueError("MS1 and MS2 data must have the same number of elements")

        zipped_data = zip(ms2_data_extracted, ms1_data_extracted)
        ms1_ms2_corr = self.parallel_process(
            inputs=zipped_data, method=_nan_correlation_w_ref, n_jobs=10
        )
        ms1_ms2_corr = pd.DataFrame(np.vstack(ms1_ms2_corr))

@@ -77,10 +77,10 @@ def setUp(self):
]

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added debugging output to the test to provide more information if the test fails. This will make it easier to diagnose issues with the _calculate_mean_distance function.

def test_calculate_mean_distance(self):
        data = np.array([[1, 2], [2, 4]])
        expected_output = np.array([[0.5, 1.0], [0.5, 1.0]])

        result = self.helper._calculate_mean_distance(data)
        
        # Print output for debugging if test fails
        if not np.array_equal(result, expected_output):
            print(f"Expected: {expected_output}, Got: {result}")
            
        np.testing.assert_array_equal(result, expected_output)

)
return [total]
return [ranks]

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a proper docstring to the method, including Parameters and Returns sections, to improve code documentation and make the method's purpose and usage clearer.

def feature_engineering_pipeline_for_height_and_ms1_intensity(self, data):
        """
        Feature engineering pipeline for height and MS1 intensity data.
        
        Parameters
        ----------
        data : array-like
            The input data for feature engineering.
            
        Returns
        -------
        list
            List of engineered features.
        """
        logger.info("Calculating additional features.")
        mean_corr, std_corr = zip(
            *self._feat_eng.parallel_process(
                data, self._feat_eng.feature_engineering, func="corr", n_jobs=10
            )
        )

@@ -109,26 +108,83 @@ def process(

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a proper docstring to the _feat_eng_preprocess_pipeline method, including Parameters and Returns sections, to improve code documentation.

def _feat_eng_preprocess_pipeline(
        self,
        data: dict[str, pd.DataFrame],
        level: str,
    ) -> tuple[list[torch.Tensor], list[torch.Tensor]]:
        """
        Preprocess pipeline for feature engineering.
        
        Parameters
        ----------
        data : dict[str, pd.DataFrame]
            Dictionary containing MS1 and MS2 data.
        level : str
            Level to extract data from (e.g., 'pg' for protein group).
            
        Returns
        -------
        tuple[list[torch.Tensor], list[torch.Tensor]]
            Tuple containing standardized feature data and intensity data.
        """
        logger.info(
            "Processing MS1 data: %d features found - %s",
            len(set(data["ms1"].keys())),
            ", ".join(data["ms1"].keys()),
        )
        logger.info(
            "Processing MS2 data: %d features found - %s",
            len(set(data["ms2"].keys())),
            ", ".join(data["ms2"].keys()),
        )

self, data: pd.DataFrame, level: str = "pg", return_index: bool = False
) -> list:
) -> Union[List[np.ndarray], Tuple[List[np.ndarray]]]:
"""

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the return type hint to more accurately reflect the function's behavior when return_index is True. The function returns a tuple of two lists in that case, not a list of tuples as implied by the original return hint.

def extract_data_from_level(
        self, data: pd.DataFrame, level: str = "pg", return_index: bool = False
    ) -> Union[List[np.ndarray], Tuple[List[np.ndarray], List[pd.MultiIndex]]]:
        """
        Create a list of np.array based on the level to group by.
        The data is grouped by the level and the values are
        extracted and stored in a list.

        Parameters
        ----------
        data : pd.DataFrame
            The data to be processed.
        level : str
            The level to group by. Either "pg" for protein group
            or "mod_seq_charge_hash" for precursor.
        return_index : bool, optional
            If True, the index is also returned. The index is used
            to identify the protein group or precursor the data
            belongs to.

        Returns
        -------
        Union[List[np.ndarray], Tuple[List[np.ndarray], List[pd.MultiIndex]]]
            A list of np.array with the extracted values, and optionally a list of indices.
        """

from selectlfq.dataloader import DataLoader


class TestModel(unittest.TestCase):

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added comments to clarify the structure of the test data, which makes it easier to understand the dimensions and purpose of the arrays in the feature_layer test data.

def setUp(self):
        # Initialize test data with realistic shapes and values
        self.intensity_layer = [np.array([[1.0, 2.0, 3.0], [4.0, 5.0, 6.0]])]
        self.feature_layer = [
            np.array(
                [
                    [[1.0, 2.0, 3.0], [4.0, 5.0, 6.0]],  # Sample 1, features for 2 fragments
                    [[7.0, 8.0, 9.0], [10.0, 11.0, 12.0]],  # Sample 2, features for 2 fragments
                ]
            )
        ]
        # Create a dataloader with our test data
        self.dataloader = DataLoader(
            feature_layer=self.feature_layer, intensity_layer=self.intensity_layer
        )
        # Initialize the model with our dataloader
        self.model = Model(dataloader=self.dataloader)

@github-actions
Copy link

Overall, the codebase has been improved with better error handling, documentation, and optimizations. Key improvements include:

  1. Optimized correlation matrix calculation in _nan_correlation_matrix by only computing half of the matrix and filling the other half (symmetric matrix).

  2. Added a MS1-MS2 correlation feature calculation pipeline.

  3. Added proper type hints throughout the code.

  4. Improved error handling, particularly in methods that process MS1 and MS2 data together.

  5. Better data synchronization between MS1 and MS2 data.

Suggestions for further improvement:

  1. The code would benefit from more consistent error handling - some methods use specific error messages while others use generic ones.

  2. The _clean_data and other preprocessing methods directly modify the input data. They should be changed to create copies to avoid unintended side effects.

  3. The correlation matrix calculation could be further optimized by using vectorized operations instead of loops where possible.

  4. Test coverage could be expanded to include more edge cases, especially around data processing with missing or invalid values.

@github-actions
Copy link

The following feedback could not be added to specific lines, but still contains valuable information:

{"change_id": "5", "file_name": "./selectlfq/preprocessing.py", "start_line": 1114, "end_line": 1179, "proposed_code": "def sync_ms1_and_ms2_data(self, data: dict[pd.DataFrame]) -> dict[pd.DataFrame]:\n        \"\"\"\n        Sync the ms1 and ms2 data by the precursor index. The data is\n        sorted by the ion column, so that the data is in the same order\n        for both ms1 and ms2.\n\n        Parameters\n        ----------\n        data : dict[str, pd.DataFrame]\n            The data to sync.\n        Returns\n        -------\n        dict[str, pd.DataFrame]\n            The synced data.\n        \"\"\"\n        if self.ms2_identifiers is None:\n            raise ValueError(\n                \"Identifiers are not defined, load fragment data first to retrieve precursor index\"\n            )\n        else:\n            # Create a copy to avoid modifying the original data\n            result = {}\n            \n            # merge the identifiers\n            for key in data.keys():\n                result[key] = self.ms2_identifiers.merge(\n                    data[key], on=ColumnConfig.PRECURSOR_IDENTIFIERS, how=\"left\"\n                )\n            \n            # sort the data by the ion column\n            for key in result.keys():\n                result[key] = self._sort_by_list(\n                    data=result[key],\n                    col=\"ion\",\n                    reindexed_list=self.ms2_identifiers[\"ion\"].tolist(),\n                )\n                if (\n                    self.ms2_identifiers[\"ion\"].values != result[key][\"ion\"].values\n                ).any():\n                    consistencies = (\n                        self.ms2_identifiers[\"ion\"].values == result[key][\"ion\"].values\n                    )\n                    indices_of_inconsistencies = np.where(~consistencies)[0]\n                    logger.warning(\n                        \"Inconsistent ions for data %s at %s\",\n                        key,\n                        indices_of_inconsistencies,\n                    )\n\n            return result", "comment": "The method modifies the input data directly, which can lead to unexpected side effects. It's better to create a copy of the data and modify that instead. Added a `result` dictionary to store the modified data."}

@github-actions
Copy link

Number of tokens: input_tokens=65761 output_tokens=4063 max_tokens=4096
review_instructions=''
config={}
thinking: ```
[]

Copy link
Collaborator

@mschwoer mschwoer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tl;dr ;-)

Comment on lines 88 to 95
else:
correlation_matrix[i, j] = (
0.0 # No correlation if standard deviation is zero
)
correlation_matrix[j, i] = 0.0
else:
correlation_matrix[i, j] = 0.0 # Not enough data points
correlation_matrix[j, i] = 0.0
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

similiar to a few days ago: initialize with 0.0 and save the two elses

…clarity and maintainability of correlation calculations.
@vuductung vuductung merged commit efeadd5 into main Mar 26, 2025
0 of 6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants