Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

High number of fitError when working with high level of missingness (scp) #63

Open
leopoldguyot opened this issue Aug 26, 2024 · 2 comments
Assignees

Comments

@leopoldguyot
Copy link

The Issue

When working with msqrob2, particularly with the msqrobLm function on single-cell data, a significant number of features produce a fitError object. This issue arise from the high presence of missing data. Upon investigation, it became apparent that these non-estimated features were those where one of the reference levels of the factor variable involved in the formula had no observations. This creates a problem due to how msqrob2 handles data subsetting before modeling. Specifically, msqrob2 subsets the design matrix by removing all rows and columns without observations for the current feature. However, it cannot subset out the reference level since it is "buried" within the model formulation. Consequently, the resulting model becomes rank-deficient, rendering it impossible to fit with methods like MASS::rlm.

To understand how other tools handle this, we examined limma and found that it does not manage these special cases. Instead, limma returns all models, even if they are rank-deficient. This means that, for features with no observations in the reference level, the reference level changes during modeling. This shift leads to confusing and potentially misleading interpretations of the associated coefficients.

The Fix

To address the issue of rank deficiency, we adopted a different approach to subset the model matrix. Instead of subsetting columns of the design matrix based on the missingness of observations, we propose subsetting the columns by selecting the first x columns according to the sorting of the columns based on their pivoting rank, where x represents the rank of the model. To determine x and the pivoting rank of the columns, we perform a QR decomposition. This approach allows us to fit features that lack observations in their reference levels.

However, the coefficients of these problematic features may still pose challenges because the interpretation may change for some parameters associated with shifts in their reference levels. To manage this, we introduced a new slot in the params slot of the StatModel object called "referencePresent". This new slot contains information for each parameter, indicating whether its reference level changed during modeling.

This information is then utilized by other functions in the package, such as getContrast and varContrast. These functions now include the option to return only the contrasts involving parameters whose reference levels did not change. This update is also reflected in higher-level functions like hypothesisTest or topFeatures, which can now be used to return only the feature contrasts that do not involve these reference changes. If users still wish to obtain contrasts for these problematic cases, a new column indicating whether the reference has changed will be included in the output table of the results.

@cvanderaa
Copy link
Collaborator

Léopold made important progress on this issue, which is currently available in branch:issue63 (see also #64).

  • Note to myself: I still need to pull the branch and see if I can break the code with additional unit tests 😈😉

@cvanderaa cvanderaa self-assigned this Sep 3, 2024
@cvanderaa
Copy link
Collaborator

Also pasting a relevant section from Léopold's report:

This [PR] is not perfect as it remains limited; specifically, it does not apply when ridge regression is used. In addition, the solution could be further optimized, as it currently demands significant resources. An initial optimization effort was made, but further refinement is needed to deliver a solution ready for deployment to users.

Hence we still need to:

  • Extend to ridge estimation
  • Optimize for computational efficiency (speed)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants