Conversation
…uctures and expected outputs for mean distance calculations and handle mismatched data lengths in correlation tests.
…_feature_df methods in TestPreprocessing
… method in TestPreprocessingPipeline
…n ms2_identifiers while ensuring unique rows in test_data.
…lations and data point counting methods. Refactor correlation matrix computation for efficiency and clarity. Update feature engineering pipelines to include new features and improve logging.
…amline feature selection process.
…ngineering. Introduce new methods for MS1-MS2 correlation calculations, data synchronization, and improved logging. Update type hints for better clarity and maintainability. Streamline data processing methods to handle dictionaries of DataFrames more effectively.
…to improve code clarity.
| @@ -51,27 +51,48 @@ def _calculate_variance_distance(data): | |||
|
|
|||
|
|
|||
| @njit | |||
There was a problem hiding this comment.
The correlation matrix should have 1.0 along the diagonal to represent the perfect correlation of each fragment with itself. Currently, the diagonal elements are left as zeros which is incorrect for correlation matrices.
@njit
def _nan_correlation_matrix(data: np.ndarray) -> np.ndarray:
"""
Calculate the correlation matrix of the data.
Parameters:
data: numpy array of shape (n_fragments, n_samples)
The data to calculate the correlation matrix of.
Returns:
numpy array of shape (n_fragments, n_fragments)
containing the correlation matrix of the data.
"""
n = len(data)
correlation_matrix = np.zeros((n, n))
for i in range(n):
correlation_matrix[i, i] = 1.0 # Set diagonal to 1.0 (perfect self-correlation)
for j in range(i + 1, n): # Only compute upper triangle
mask = np.isfinite(data[i]) & np.isfinite(data[j])
if np.sum(mask) > 1:
xi = data[i][mask]
xj = data[j][mask]| @@ -139,112 +195,169 @@ def _clean_data(self, data, level): | |||
|
|
|||
There was a problem hiding this comment.
The method is dropping columns but not storing the remaining columns in the not_dropped_columns attribute in the dictionary case. This would maintain consistency with the behavior in the DataFrame case and match the log message.
def _drop_columns(self, data):
if isinstance(data, dict):
result = {}
for k in data.keys():
result[k] = data[k].drop(columns=ColumnConfig.IDENTIFIERS, errors="ignore")
self.not_dropped_columns = list(next(iter(result.values())).columns)
return result
elif isinstance(data, pd.DataFrame):
logger.info(
"Columns that are not dropped can be found under attribute 'not_dropped_columns'."
)
self.not_dropped_columns = list(data.drop(columns=ColumnConfig.IDENTIFIERS, errors="ignore").columns)| @@ -98,7 +119,7 @@ def _nan_correlation_w_ref(data_ref_pairs: tuple[np.ndarray, np.ndarray]) -> np. | |||
| ref = ref[0] | |||
There was a problem hiding this comment.
The return value shape comment is incorrect. Based on the implementation, the method returns an array of shape (n_fragments,) not (n_samples, n_features) as currently documented.
def _nan_correlation_w_ref(data_ref_pairs: tuple[np.ndarray, np.ndarray]) -> np.ndarray:
"""
Calculate the correlation between each row of data and the reference row.
Parameters:
data_ref_pairs: tuple[np.ndarray, np.ndarray]
A tuple containing two numpy arrays:
- data: numpy array of shape (n_fragments, n_samples)
- ref: numpy array of shape (n_samples,) or (1, n_samples)
Returns:
numpy array of shape (n_fragments,)
containing the correlation between each row of data and the reference row.
"""
data, ref = data_ref_pairs
if ref.ndim > 1 and ref.shape[0] == 1:| def calculate_ms1_ms2_corr( | ||
| self, | ||
| ms1_data_extracted: list[np.ndarray], | ||
| ms2_data_extracted: list[np.ndarray], |
There was a problem hiding this comment.
Added a check to ensure the lengths of ms1_data_extracted and ms2_data_extracted match before zipping them. This will provide a more helpful error message rather than potential silent errors or less informative error messages later in the process.
def calculate_ms1_ms2_corr(
self,
ms1_data_extracted: list[np.ndarray],
ms2_data_extracted: list[np.ndarray],
) -> pd.DataFrame:
"""
Calculate the correlation between the ms1 and ms2 data.
Parameters:
ms1_data_extracted: list[np.ndarray]
The ms1 data extracted from the precursor data.
ms2_data_extracted: list[np.ndarray]
The ms2 data extracted from the precursor data.
Returns:
ms1_ms2_corr_data: pd.DataFrame
The ms1-ms2 correlation data.
"""
if len(ms1_data_extracted) != len(ms2_data_extracted):
raise ValueError("MS1 and MS2 data must have the same number of elements")
zipped_data = zip(ms2_data_extracted, ms1_data_extracted)
ms1_ms2_corr = self.parallel_process(
inputs=zipped_data, method=_nan_correlation_w_ref, n_jobs=10
)
ms1_ms2_corr = pd.DataFrame(np.vstack(ms1_ms2_corr))| @@ -77,10 +77,10 @@ def setUp(self): | |||
| ] | |||
There was a problem hiding this comment.
Added debugging output to the test to provide more information if the test fails. This will make it easier to diagnose issues with the _calculate_mean_distance function.
def test_calculate_mean_distance(self):
data = np.array([[1, 2], [2, 4]])
expected_output = np.array([[0.5, 1.0], [0.5, 1.0]])
result = self.helper._calculate_mean_distance(data)
# Print output for debugging if test fails
if not np.array_equal(result, expected_output):
print(f"Expected: {expected_output}, Got: {result}")
np.testing.assert_array_equal(result, expected_output)| ) | ||
| return [total] | ||
| return [ranks] | ||
|
|
There was a problem hiding this comment.
Added a proper docstring to the method, including Parameters and Returns sections, to improve code documentation and make the method's purpose and usage clearer.
def feature_engineering_pipeline_for_height_and_ms1_intensity(self, data):
"""
Feature engineering pipeline for height and MS1 intensity data.
Parameters
----------
data : array-like
The input data for feature engineering.
Returns
-------
list
List of engineered features.
"""
logger.info("Calculating additional features.")
mean_corr, std_corr = zip(
*self._feat_eng.parallel_process(
data, self._feat_eng.feature_engineering, func="corr", n_jobs=10
)
)| @@ -109,26 +108,83 @@ def process( | |||
|
|
|||
There was a problem hiding this comment.
Added a proper docstring to the _feat_eng_preprocess_pipeline method, including Parameters and Returns sections, to improve code documentation.
def _feat_eng_preprocess_pipeline(
self,
data: dict[str, pd.DataFrame],
level: str,
) -> tuple[list[torch.Tensor], list[torch.Tensor]]:
"""
Preprocess pipeline for feature engineering.
Parameters
----------
data : dict[str, pd.DataFrame]
Dictionary containing MS1 and MS2 data.
level : str
Level to extract data from (e.g., 'pg' for protein group).
Returns
-------
tuple[list[torch.Tensor], list[torch.Tensor]]
Tuple containing standardized feature data and intensity data.
"""
logger.info(
"Processing MS1 data: %d features found - %s",
len(set(data["ms1"].keys())),
", ".join(data["ms1"].keys()),
)
logger.info(
"Processing MS2 data: %d features found - %s",
len(set(data["ms2"].keys())),
", ".join(data["ms2"].keys()),
)| self, data: pd.DataFrame, level: str = "pg", return_index: bool = False | ||
| ) -> list: | ||
| ) -> Union[List[np.ndarray], Tuple[List[np.ndarray]]]: | ||
| """ |
There was a problem hiding this comment.
Updated the return type hint to more accurately reflect the function's behavior when return_index is True. The function returns a tuple of two lists in that case, not a list of tuples as implied by the original return hint.
def extract_data_from_level(
self, data: pd.DataFrame, level: str = "pg", return_index: bool = False
) -> Union[List[np.ndarray], Tuple[List[np.ndarray], List[pd.MultiIndex]]]:
"""
Create a list of np.array based on the level to group by.
The data is grouped by the level and the values are
extracted and stored in a list.
Parameters
----------
data : pd.DataFrame
The data to be processed.
level : str
The level to group by. Either "pg" for protein group
or "mod_seq_charge_hash" for precursor.
return_index : bool, optional
If True, the index is also returned. The index is used
to identify the protein group or precursor the data
belongs to.
Returns
-------
Union[List[np.ndarray], Tuple[List[np.ndarray], List[pd.MultiIndex]]]
A list of np.array with the extracted values, and optionally a list of indices.
"""| from selectlfq.dataloader import DataLoader | ||
|
|
||
|
|
||
| class TestModel(unittest.TestCase): |
There was a problem hiding this comment.
Added comments to clarify the structure of the test data, which makes it easier to understand the dimensions and purpose of the arrays in the feature_layer test data.
def setUp(self):
# Initialize test data with realistic shapes and values
self.intensity_layer = [np.array([[1.0, 2.0, 3.0], [4.0, 5.0, 6.0]])]
self.feature_layer = [
np.array(
[
[[1.0, 2.0, 3.0], [4.0, 5.0, 6.0]], # Sample 1, features for 2 fragments
[[7.0, 8.0, 9.0], [10.0, 11.0, 12.0]], # Sample 2, features for 2 fragments
]
)
]
# Create a dataloader with our test data
self.dataloader = DataLoader(
feature_layer=self.feature_layer, intensity_layer=self.intensity_layer
)
# Initialize the model with our dataloader
self.model = Model(dataloader=self.dataloader)|
Overall, the codebase has been improved with better error handling, documentation, and optimizations. Key improvements include:
Suggestions for further improvement:
|
|
The following feedback could not be added to specific lines, but still contains valuable information: |
|
Number of tokens: input_tokens=65761 output_tokens=4063 max_tokens=4096 |
selectlfq/featureengineering.py
Outdated
| else: | ||
| correlation_matrix[i, j] = ( | ||
| 0.0 # No correlation if standard deviation is zero | ||
| ) | ||
| correlation_matrix[j, i] = 0.0 | ||
| else: | ||
| correlation_matrix[i, j] = 0.0 # Not enough data points | ||
| correlation_matrix[j, i] = 0.0 |
There was a problem hiding this comment.
similiar to a few days ago: initialize with 0.0 and save the two elses
…clarity and maintainability of correlation calculations.
No description provided.