Skip to content

Comments

Development#5

Merged
vuductung merged 23 commits intomainfrom
development
Mar 19, 2025
Merged

Development#5
vuductung merged 23 commits intomainfrom
development

Conversation

@vuductung
Copy link
Collaborator

This pull request includes several changes to improve the configuration, data loading, and preprocessing in the selectlfq package. The most important changes include the introduction of a ModelConfig class, updates to data configuration constants, enhancements to the data loading process, and refinements in the preprocessing pipeline.

Configuration Updates:

  • Introduced ModelConfig class with constants for engineered and removed features in selectlfq/config.py.
  • Updated CONFIG dictionary to include comments and changed verbose parameter to True in selectlfq/config.py. [1] [2]

Data Configuration Constants:

  • Modified DataConfig to comment out unused MS2 feature names and added new lists for log2 transform and alignment features in selectlfq/constants.py.
  • Added PRECURSOR_IDENTIFIERS to ColumnConfig in selectlfq/constants.py.

Data Loading Enhancements:

  • Added load_features method to Loader class and updated existing methods to handle default directories and logging in selectlfq/loader.py. [1] [2] [3] [4]
  • Removed redundant code for syncing MS1 and MS2 data in selectlfq/loader.py.

Preprocessing Refinements:

  • Updated preprocessing pipeline to use shared state and refined methods for data extraction and column handling in selectlfq/preprocessing.py. [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11]

Miscellaneous Changes:

  • Removed unused imports and methods across various files to clean up the codebase. [1] [2] [3] [4] [5] [6]

- Update correlation matrix computation to only calculate off-diagonal elements
- Simplify initialization of the correlation matrix with NaN values
- Remove redundant NaN assignments within the correlation calculation loop
- Introduce additional features including cycle_fwhm, mz_observed, mz_library, and mz_calibrated
- Expand error metrics with mean_ms2_mass_error and top_3_ms2_mass_error
- Include top3_frame_correlation and top3_b_ion_correlation for improved correlation analysis
- Comment out unused correlation features for clarity
- Change LOG2_TRANSFORM_FEATURES and ALIGN_FEATURES from "intensity" to "ms2_intensity" for improved clarity and alignment with data processing standards and to avoid confusion between ms1 and ms2 intensity.
- Change occurrences of "intensity" to "ms2_intensity" and "ms1_intensity" to improve clarity and maintain consistency across data processing.
- Replace occurrences of "intensity" with "ms2_intensity" in the PreprocessingPipeline class for consistency with recent changes in data handling.
- Adjust column dropping and reordering logic to align with the updated feature names.
- Add "height" to LOG2_TRANSFORM_FEATURES and ALIGN_FEATURES for enhanced feature representation and consistency in data processing.
- Update column dropping logic to utilize ColumnConfig.IDENTIFIERS instead of hardcoded categorical features for better maintainability.
- Enhance logging message formatting for clarity.
- Modify load_fragment_data_files and load_precursor_file methods to accept an optional directory argument, defaulting to the current working directory if not provided.
- move _extract_data out of _prepare_data method for more dynamic data handling
- Update _prepare_data method to directly use the log-transformed data, enhancing readability and maintainability.
- Introduce SharedState class to encapsulate state information shared between multiple classes.
- Create a singleton instance of SharedState for easy access in other modules.
…gement

- Replace local variable for sorted_columns with shared_state.sorted_columns to enable cross-module access.
- Store level information in shared_state for improved data extraction consistency.
… representation

- Update the return value of the ensemble prediction method to include a DataFrame with normalized predictions, leveraging shared_state for column names and index information.
- This change improves data organization and consistency across modules.
…d data accuracy

- Modify the _shift_prediction method to accept shared_state.lin_scaled_data, ensuring predictions are accurately transformed back to the original scale.
- Update the model initialization to include a PreprocessingPipeline instance for better data handling.
- Modify the assignment of extracted intensity data to use shared_state.lin_scaled_data, ensuring consistency in data handling across modules.
…ment

- Introduce lin_scaled_data attribute in the SharedState class to facilitate better handling of scaled data across modules, ensuring consistency in data processing.
- Introduce ModelConfig class to encapsulate model configuration parameters, enhancing organization and readability.
- Update the number of engineered features from 2 to 3 and adjust input size calculation accordingly.
- Consolidate configuration parameters into a structured CONFIG dictionary for better maintainability and clarity.
…management

- Remove unused feature names from MS2_FEATURE_NAMES for clarity.
- Update LOG2_TRANSFORM_FEATURES and ALIGN_FEATURES to include additional features for improved data processing.
- Introduce PRECURSOR_IDENTIFIERS in ColumnConfig to expand identifier options for better data handling.
- Introduce load_features method to load both precursor and fragment features from specified directories.
- Add logging statements to track the reading and processing of MS1 and MS2 features for better debugging.
- Update feature naming conventions to include prefixes for clarity.
- Remove unused methods to streamline the Loader class.
…a tracking

- Introduce ms2_identifiers and ms1_identifiers attributes to the SharedState class to facilitate better tracking of identifiers across modules.
- Add extracted_ms2_identifiers and extracted_ms1_identifiers attributes to improve data management and consistency in identifier handling.
…ctionality

- Add docstring to the get_logger function for better documentation.
- Ensure consistent logging format with a specified date format.
- Remove commented-out code for cleaner implementation.
@github-actions
Copy link

The following feedback could not be added to specific lines, but still contains valuable information:

I've carefully reviewed the changes to the codebase and identified several important improvements and potential issues. Here's my analysis:

[{
  "change_id": "-1",
  "comment": "The code changes reorganize the configuration structure, improve data handling, and enhance feature engineering. The main changes include: (1) Refactoring the config.py to use a ModelConfig class, (2) Adding a SharedState class for global state management, (3) Restructuring feature handling in constants.py, (4) Improving the MS2/MS1 feature integration, and (5) Enhancing the model prediction output format. While these changes improve organization and modularity, there are several issues with error handling, performance optimizations, and code structure that need addressing."
}]

[{
  "change_id": 1,
  "file_name": "./selectlfq/config.py",
  "start_line": 1,
  "end_line": 46,
  "proposed_code": "import torch.nn as nn\nfrom selectlfq.constants import DataConfig\nfrom selectlfq.ms1_features import FeatureConfig\n\n\nclass ModelConfig:\n    \"\"\"Configuration settings for the SelectLFQ model.\n    \n    This class contains all configuration parameters for the model,\n    including feature counts, network architecture, and training parameters.\n    \"\"\"\n    # Constants\n    no_of_engineered_features = 3\n    no_of_removed_features = 1\n    input_size = (\n        len(FeatureConfig.DEFAULT_FEATURES)\n        + len(DataConfig.MS2_FEATURE_NAMES)\n        + no_of_engineered_features\n        - no_of_removed_features\n    )\n\n    # Configuration dictionary\n    CONFIG = {\n        \"criterion_params\": {\n            \"alpha\": 0.7,\n            \"epsilon\": 1e-8,\n            \"kind\": \"WVL\",\n            \"lambda1\": 0.0,\n        },\n        \"model_params\": {\n            \"input_size\": input_size,\n            \"hidden_sizes\": [input_size] * 3,\n            \"dropout_rate\": None,\n            \"activation\": nn.ReLU(),\n            \"init\": \"uniform\",\n            \"batch_norm\": True,\n            \"normalize\": False,\n            \"output_activation\": \"sigmoid\",\n        },\n        \"optmizer_params\": {\n            \"lr\": 5e-3,\n        },\n        \"fit_params\": {\n            \"epochs\": 40,\n            \"batch_size\": 64,\n            \"shuffle\": False,\n            \"train_size\": 200,\n            \"verbose\": True,\n        },\n    }",
  "comment": "Added docstring to ModelConfig class to explain its purpose and contents. This improves code readability and makes the purpose of the class clearer to other developers."
}]

[{
  "change_id": 2,
  "file_name": "./selectlfq/var_model.py",
  "start_line": 343,
  "end_line": 352,
  "proposed_code": "    def predict(\n        self,\n        dataloader: DataLoader,\n        cutoff: float,\n        threshold: int = 12,\n        no_const: int = 3000,\n    ) -> pd.DataFrame:\n        \"\"\"\n        Prediction of intensity data. Using the quality scores to filter the data,\n        based on a cutoff value.\n\n        Parameters:\n        ----------\n            dataloader (DataLoader): Dataloader.\n                Input data.\n            cutoff (float): Cutoff value.\n                cutoff value to remove low quality scores\n            threshold (int): Threshold value.\n                minimum number of valid values in a sample\n            no_const (int): Number of values to infer systematic\n                biases for sample wise normalization\n\n        Returns:\n        ----------\n            pd.DataFrame: Final predicted intensity values with rows indexed by protein groups\n                and columns representing samples.\n        \"\"\"\n\n        self.quality_score = self._predict_quality_scores(dataloader)",
  "comment": "Updated the docstring for the predict method to clarify that it returns a pandas DataFrame specifically, which better reflects the actual return type after the code changes."
}]

[{
  "change_id": 3,
  "file_name": "./selectlfq/shared_state.py",
  "start_line": 1,
  "end_line": 15,
  "proposed_code": "class SharedState:\n    \"\"\"Container for state shared between multiple classes\n    \n    This class stores data that needs to be accessed across different components\n    of the SelectLFQ pipeline. It helps avoid passing the same data between\n    multiple classes and functions.\n    \n    Attributes:\n        level_information: List of protein group identifiers\n        sorted_columns: Ordered column names for the dataset\n        lin_scaled_data: Linear-scale intensity data for shifting predictions\n        ms2_identifiers: MS2 level identifiers\n        ms1_identifiers: MS1 level identifiers\n        extracted_ms2_identifiers: Extracted MS2 identifiers after processing\n        extracted_ms1_identifiers: Extracted MS1 identifiers after processing\n    \"\"\"\n\n    def __init__(self):\n        self.level_information = None\n        self.sorted_columns = None\n        self.lin_scaled_data = None\n        self.ms2_identifiers = None\n        self.ms1_identifiers = None\n        self.extracted_ms2_identifiers = None\n        self.extracted_ms1_identifiers = None\n\n\n# Create a single instance to be imported by other modules\nshared_state = SharedState()",
  "comment": "Added comprehensive docstring to the SharedState class to explain its purpose and attributes. This improves code maintainability by clarifying the role of this global state container."
}]

[{
  "change_id": 4,
  "file_name": "./selectlfq/loader.py",
  "start_line": 72,
  "end_line": 83,
  "proposed_code": "    def load_features(self, output_folder: str) -> dict:\n        \"\"\"\n        Load all feature data from a given directory.\n        \n        Parameters\n        ----------\n        output_folder : str\n            The directory where the data is stored.\n            \n        Returns\n        -------\n        dict\n            Dictionary containing MS1 and MS2 features with keys 'ms1' and 'ms2'.\n        \"\"\"\n        precursor_features = self.load_precursor_file(output_folder)\n        fragment_features = self.load_fragment_data_files(output_folder)\n\n        return {\n            \"ms1\": precursor_features,\n            \"ms2\": fragment_features,\n        }",
  "comment": "Added type annotation and comprehensive docstring to the load_features method, which was missing. This improves code clarity and provides better information about the method's purpose and return value."
}]

[{
  "change_id": 5,
  "file_name": "./selectlfq/var_model.py",
  "start_line": 343,
  "end_line": 359,
  "proposed_code": "    def _shift_prediction(\n        self, prediction: np.ndarray, lin_scaled_data: np.ndarray\n    ) -> np.ndarray:\n        \"\"\"Shift log2-transformed predictions back to original intensity scale.\n        \n        Parameters\n        ----------\n        prediction : np.ndarray\n            Log2-transformed prediction matrix\n        lin_scaled_data : np.ndarray\n            Linear-scale intensity data used for calculating shift factors\n            \n        Returns\n        -------\n        np.ndarray\n            Shifted prediction values in log2 scale\n        \"\"\"\n        # transfrom back to original scale for shifting\n        transf_prediction = 2**prediction",
  "comment": "Added docstring to the _shift_prediction method to explain its purpose, parameters, and return value. This improves code maintainability and makes the method's purpose clearer."
}]

[{
  "change_id": 6,
  "file_name": "./selectlfq/preprocessing.py",
  "start_line": 265,
  "end_line": 278,
  "proposed_code": "    def _extract_data(self, data, level):\n        \"\"\"Extract data from the specified level and store level information in shared state.\n        \n        Parameters\n        ----------\n        data : dict\n            Dictionary of dataframes to extract from\n        level : str\n            The grouping level to extract by (e.g., 'pg')\n            \n        Returns\n        -------\n        dict\n            Dictionary with extracted data arrays\n        \"\"\"\n        logger.info(f\"Extracting data from level: {level}.\")\n\n        shared_state.level_information = list(\n            data[\"ms2_intensity\"].groupby(level).groups.keys()\n        )\n\n        for k in data.keys():\n            data[k] = self._prep.extract_data_from_level(\n                data[k],\n                level=level,\n            )\n\n        return data",
  "comment": "Added docstring to the _extract_data method to explain its parameters, return value, and its side effect of updating shared_state. This makes the behavior of the method more transparent and easier to understand."
}]

[{
  "change_id": 7,
  "file_name": "./selectlfq/featureengineering.py",
  "start_line": 53,
  "end_line": 74,
  "proposed_code": "@njit\ndef _nan_correlation_matrix(data):\n    \"\"\"Calculate correlation matrix handling NaN values efficiently.\n    \n    Parameters\n    ----------\n    data : np.ndarray\n        Input data array where rows are features\n        \n    Returns\n    -------\n    np.ndarray\n        Correlation matrix with NaN values for diagonal elements\n    \"\"\"\n    n = len(data)\n    correlation_matrix = np.full((n, n), np.nan)  # Initialize with NaN\n\n    for i in range(n):\n        for j in range(n):\n            if i != j:  # Only compute for off-diagonal elements\n                mask = np.isfinite(data[i]) & np.isfinite(data[j])\n                if np.sum(mask) > 1:  # Ensure there are at least two data points\n                    xi = data[i][mask]\n                    xj = data[j][mask]\n                    std_dev_i = np.std(xi)\n                    std_dev_j = np.std(xj)\n\n                    if (std_dev_i > 0) and (std_dev_j > 0):\n                        mean_i = np.mean(xi)\n                        mean_j = np.mean(xj)\n                        sparsity = np.mean(mask)\n                        covariance = np.mean((xi - mean_i) * (xj - mean_j))\n                        corr = covariance / (std_dev_i * std_dev_j)\n                        correlation_matrix[i, j] = corr * sparsity\n\n    return correlation_matrix",
  "comment": "Added docstring to the _nan_correlation_matrix function to explain its parameters and return value. The optimization to only compute non-diagonal elements improves efficiency, as the diagonal is always NaN."
}]

[{
  "change_id": 8,
  "file_name": "./selectlfq/var_model.py",
  "start_line": 360,
  "end_line": 384,
  "proposed_code": "        # Convert to linear scale for prediction to apply appropriate scaling factors\n        transf_prediction = 2**prediction\n\n        # Calculate sum of predicted intensities per protein\n        sum_per_protein = np.nansum(transf_prediction, axis=1)\n        \n        # Create filter masks based on prediction quality\n        pred_filt_masks = [np.isfinite(x) for x in self.filtered_intensity_layers_after_prediction]\n        \n        # Calculate sum of original intensities using the same filters\n        sum_int_unaligned = np.array(\n            [\n                np.sum(x[m]) if np.any(m) else 1.0  # Use 1.0 as fallback to avoid division by zero\n                for x, m in zip(\n                    lin_scaled_data,\n                    pred_filt_masks,\n                )\n            ]\n        )\n        \n        # Calculate multiplier factor to correct intensity scale (original/predicted ratio)\n        multiplier = np.expand_dims(sum_int_unaligned / (sum_per_protein + 1e-10), axis=1)\n\n        # Apply multiplier and convert back to log2 scale\n        shifted_prediction = np.log2(transf_prediction * multiplier)\n\n        return shifted_prediction",
  "comment": "Improved the implementation of the prediction shifting calculation to include proper error handling. Added checks to avoid division by zero and detailed comments explaining each step of the calculation. This makes the code more robust and easier to understand."
}]

[{
  "change_id": 9,
  "file_name": "./selectlfq/loader.py",
  "start_line": 139,
  "end_line": 159,
  "proposed_code": "        fragment_data_dict = {}\n\n        # If no directory is provided, use the current directory\n        if not directory:\n            directory = os.getcwd()\n            logger.info(\"No directory provided, using current directory: %s\", directory)\n\n        directory = os.path.join(directory, feature_folder)\n        if not os.path.exists(directory):\n            raise FileNotFoundError(f\"Feature directory not found: {directory}\")\n\n        for feature_name in DataConfig.MS2_FEATURE_NAMES:\n            file_name = feature_name + \".csv\"\n            file_path = os.path.join(directory, file_name)\n            \n            if not os.path.exists(file_path):\n                logger.warning(\"Feature file not found: %s\", file_path)\n                continue\n                \n            logger.info(\"Reading MS2 feature: %s\", feature_name)\n            df = pd.read_csv(file_path, index_col=0)\n            feature_name = \"ms2_\" + feature_name\n\n            fragment_data_dict[feature_name] = df",
  "comment": "Added proper error handling for file operations in the load_fragment_data_files method. The code now checks if directories and files exist and provides informative error messages, which will make debugging easier for users."
}]

[{
  "change_id": 10,
  "file_name": "./selectlfq/var_model.py",
  "start_line": 401,
  "end_line": 412,
  "proposed_code": "    def _get_intensity_layer_filt(\n        self,\n        dataloader: DataLoader,\n        cutoff: float,\n        quality_score: np.ndarray,\n        threshold: int = 12,\n    ) -> list:\n        \"\"\"Filter intensity layers based on quality scores.\n        \n        Parameters\n        ----------\n        dataloader : DataLoader\n            DataLoader containing intensity data\n        cutoff : float\n            Quality score threshold for filtering\n        quality_score : np.ndarray\n            Predicted quality scores to use for filtering\n        threshold : int, default=12\n            Minimum number of measurements required per sample\n            \n        Returns\n        -------\n        list\n            List of filtered intensity layers\n        \"\"\"\n        self.filtered_intensity_layers_after_prediction = [\n            self._filter(i, q, cutoff, threshold)\n            for i, q in zip(dataloader.intensity_layer, quality_score)\n        ]\n        return self

@github-actions
Copy link

Number of tokens: input_tokens=44640 output_tokens=4096 max_tokens=4096
review_instructions=''
config={}
thinking: ```
[]

Premature stop because: max_tokens.

) # Set to NaN if not enough data points

np.fill_diagonal(correlation_matrix, np.nan)
for j in range(n):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if i==j: continue

saves a level of indent ;-)

@@ -0,0 +1,15 @@
class SharedState:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure if this is a good pattern .. could the modules not just pass the data around?

@@ -1,41 +1,45 @@
import torch.nn as nn
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question on the PR description: this is AI-generated I guess?
if so, please prompt it to be more concise there.. it should be one level above the actual code changes ;-)

if not: please do not invest so much time in that :-D

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok will do!

@vuductung vuductung merged commit 327e515 into main Mar 19, 2025
1 of 7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants