Skip to content

Implement Principal Component Analysis (PCA) module#1086

Open
JAi-SATHVIK wants to merge 90 commits intofortran-lang:masterfrom
JAi-SATHVIK:master
Open

Implement Principal Component Analysis (PCA) module#1086
JAi-SATHVIK wants to merge 90 commits intofortran-lang:masterfrom
JAi-SATHVIK:master

Conversation

@JAi-SATHVIK
Copy link
Contributor

@JAi-SATHVIK JAi-SATHVIK commented Jan 6, 2026

@JAi-SATHVIK JAi-SATHVIK marked this pull request as draft January 6, 2026 02:37
@codecov
Copy link

codecov bot commented Jan 6, 2026

Codecov Report

❌ Patch coverage is 0% with 26 lines in your changes missing coverage. Please review.
✅ Project coverage is 68.41%. Comparing base (e43b5b9) to head (cb908bf).
⚠️ Report is 39 commits behind head on master.

Files with missing lines Patch % Lines
example/stats/example_pca.f90 0.00% 26 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #1086      +/-   ##
==========================================
- Coverage   68.55%   68.41%   -0.14%     
==========================================
  Files         396      397       +1     
  Lines       12746    12772      +26     
  Branches     1376     1377       +1     
==========================================
  Hits         8738     8738              
- Misses       4008     4034      +26     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@jalvesz jalvesz linked an issue Jan 6, 2026 that may be closed by this pull request
@jalvesz
Copy link
Contributor

jalvesz commented Jan 7, 2026

@JAi-SATHVIK the build fails happen because in the CMakeLists.txt for the stats module you need to add something like target_link_libraries(stats PUBLIC blas lapack)

Also, in my previous comment regarding the kinds support, what I meant was: please enable all kinds, stdlib provids the backends. Linking against optimized libraries is optional and would increase performance for simple and double precision. But all kinds are supported.

@JAi-SATHVIK
Copy link
Contributor Author

Thank you @jalvesz , for the clarification! I've made the following updates:

CMakeLists.txt: The target_link_libraries(stats PUBLIC blas lapack) is already in place in src/stats/CMakeLists.txt which links the stats module against stdlib's internal BLAS/LAPACK targets.

Documentation: Updated the inline comments in both
stdlib_stats_pca.fypp and stdlib_stats.fypp to accurately reflect that:
All real kinds (sp, dp, xdp, qp) are supported by stdlib's internal BLAS/LAPACK backends.

@jalvesz
Copy link
Contributor

jalvesz commented Jan 7, 2026

New fails are happening because of the dependency on the sorting module. This exposes an issue with the modularization of the library. The sorting module being at the root of src is not visible by the stats module within its own independent folder... we might need to reconsider next steps.

One idea would be to make this library (pca) not a submodule of stats but a module in itself at the root of src such that it can use easily any other module such as stats, blas/lapack and sorting. Or there might be another approach to think about

Cc @jvdp1 @perazz

@jvdp1
Copy link
Member

jvdp1 commented Jan 7, 2026

New fails are happening because of the dependency on the sorting module. This exposes an issue with the modularization of the library. The sorting module being at the root of src is not visible by the stats module within its own independent folder...

In this case, add ../stdlib_sorting.fypp in the CMakeLists.txt should be enough (similarly to stdlib_string_type.fypp already present in CMakeLists.txt)

we might need to reconsider next steps.

However, I agree with that. When I was working on #1081, I started to get "faked" circular dependencies.

One idea would be to make this library (pca) not a submodule of stats but a module in itself at the root of src such that it can use easily any other module such as stats, blas/lapack and sorting. Or there might be another approach to think about

If the CMake file is correctly written, a submodule "pca" should not be a problem.
However, in terms of efficiency for the stats module (but also for other modules, like stdlib_linalg), it might be good to use modules instead of submodules: as fpm compiles only what is needed, if a user only needs mean, then the blas and lapack modules are currently not compiled (if I am correct). However, if the pca procedures are added as submodules of the stats module, then the blas and lapack will be compiled, even if only the procedure mean is used by the user. Similar "issues" can easily happen for stdlib_linalg.

@JAi-SATHVIK
Copy link
Contributor Author

Thank you @jalvesz @jvdp1 for the insights.

Regarding the immediate build failure, I will try adding ../stdlib_sorting.fypp to the src/stats/CMakeLists.txt as suggested to resolve the visibility issue with the sorting module.

Regarding the structural change: I'm open to moving PCA to a standalone module in src/ if that aligns better with the library's goals for modularity and reducing compilation overhead. Should I proceed with the CMake fix first to verify the current logic, or would you prefer I start refactoring it into its own module now?

@jalvesz
Copy link
Contributor

jalvesz commented Jan 8, 2026

I'll suggest to go step by step: first try to fix "as is", then let's continue the discussion on what would be the best strategy.

@JAi-SATHVIK
Copy link
Contributor Author

I accidentally closed the PR I’ve reopened it.

@JAi-SATHVIK JAi-SATHVIK reopened this Jan 24, 2026
@JAi-SATHVIK JAi-SATHVIK requested a review from loiseaujc January 24, 2026 19:14
@JAi-SATHVIK
Copy link
Contributor Author

Hi @jalvesz @jvdp1 @loiseaujc ,
This PCA implementation is now updated according to the latest review comments I have addressed the suggestions about eigenvalue ordering, avoiding unnecessary symmetrization, replacing scalar loops with array operations If you notice any remaining issues with style, performance, or CI configuration that would block merging, please let me know and I’ll be happy to fix them.
Thankyou,

@jvdp1
Copy link
Member

jvdp1 commented Jan 26, 2026

Thank you @JAi-SATHVIK . Overall it looks pretty good. I will have a closer look later. Could you update the specs, please?

allocate(c(p, p), source=zero)
call syrk('U', 'T', p, n, alpha, x_centered, n, beta, c, p)

! Fill lower triangle from upper triangle (syrk only fills upper)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment no longer applies.


! Fill lower triangle from upper triangle (syrk only fills upper)
allocate(lambda(p))
allocate(vectors(p, p))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can coalesce these two lines into

allocate(lambda(p), vectors(p, p))

This is essentially a matter of personal preferences. No big deal.

where (lambda(1:m) > 0.0_${k1}$)
singular_values(1:m) = sqrt(lambda(1:m) * real(n-1, ${k1}$))
elsewhere
singular_values(1:m) = 0.0_${k1}$
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since singular_values is initialized to zero on line 76, I don't think you need the elsewhere statement. I might be wrong though.


! Calculate mean along dimension 1 (column means) using stdlib mean
allocate(mu(p))
mu = mean(x, 1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can get rid of the allocate statement. mu = mean(x, 1) will directly allocate mu and fill it with mean(x, 1).

call center_data_${k1}$(x, mu)
call pca_svd_driver_${k1}$(x, n, p, components, singular_values, err0)
else
allocate(x_centered, source=x)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can keep it simpler with just x_centered = x.

end if

case ("eig", "cov")
allocate(x_centered, source=x)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same thing, x_centered = x is just as readable.

@JAi-SATHVIK JAi-SATHVIK requested a review from loiseaujc January 27, 2026 16:28
@JAi-SATHVIK
Copy link
Contributor Author

Hi, @loiseaujc

These are the changes made based on review feedback:

  • Removed outdated comment about filling lower triangle (no longer applicable with upper_a=.true.)
  • Combined separate allocate statements into single calls
  • Simplified allocate(x_centered, source=x)x_centered = x (auto-allocation)
  • Removed redundant allocate(mu(p)) before assignment
  • Removed unused variables i, j from pca_eigh_driver

Some improvements:

  • Added input validation for empty arrays (n < 1 or p < 1)
  • Made method string comparison case-insensitive ("SVD", "svd", "Svd" all work)

@JAi-SATHVIK
Copy link
Contributor Author

@jalvesz @jvdp1 @loiseaujc
Issue:
qp matmul tests failed because the numerical tolerance was too strict for different multiplication orders.

Shall I fix the issue by Increasing test tolerance multipliers to allow for minor rounding differences in high-precision calculations?

@loiseaujc
Copy link
Contributor

The fail in the intrinsics test is something I've encountered quite a few times lately as well. I think it would deserve a proper PR but I ain't exactly sure how to fix. The easiest way would be replace the precision check using a tolerance epsilon(1.0_${k}$) to sqrt(epsilon(1.0_${k}$) but it is far less stringent. Not entirely sure how to proceed there.

@JAi-SATHVIK
Copy link
Contributor Author

The fail in the intrinsics test is something I've encountered quite a few times lately as well. I think it would deserve a proper PR but I ain't exactly sure how to fix. The easiest way would be replace the precision check using a tolerance epsilon(1.0_${k}$) to sqrt(epsilon(1.0_${k}$) but it is far less stringent. Not entirely sure how to proceed there.

I think we can increase test tolerance multipliers in test\intrinsics\test_intrinsics.fypp right?

The numbers being multiplied are random sums. The result magnitude is roughly 40-50, not 1.0.
A tolerance of 2000 * epsilon is still extremely strict compared to sqrt(epsilon(1.0_${k}$).
If this works I can raise a separate PR for this.

@loiseaujc
Copy link
Contributor

I don't quite like sqrt(epsilon(1.0_${rk}$)) either (although it is a simple solution). I dislike even more artificially increasing the constant. A better way would be to look at error estimates for matrix-matrix multiplication in floating points. I would have to double check some of my old lecture notes but it would basically involve the product of the Frobenius norms of the different factors. A simple remedy would thus to normalize all the matrices to have a Frobenius norm of one. I'll create a local branch on my fork and test this to see if I get rid of the failure.

@jalvesz
Copy link
Contributor

jalvesz commented Feb 3, 2026

Hi @JAi-SATHVIK, this is close, could you please adapt the test file to conform to the usage of test-drive please. This means a separation between the program unit and a module containing the collection of tests. Please check how the tests of other modules in the library are built before.

Copy link
Contributor

@loiseaujc loiseaujc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some more minor details.

@JAi-SATHVIK
Copy link
Contributor Author

JAi-SATHVIK commented Feb 6, 2026

Thankyou, @jalvesz @loiseaujc for the suggestions I Have made the changes.

  • Refactored test_pca.fypp separating the test module from the program runner.
  • Optimized stdlib_stats_pca.fypp by condensing where blocks and merging multiple min calls into single statements.
  • Integrated stdlib_ascii:to_lower for clearer more robust case-insensitive string handling in method selection.

@JAi-SATHVIK JAi-SATHVIK requested a review from loiseaujc February 6, 2026 08:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Implement Principal Component Analysis (PCA) module

5 participants