Implement Principal Component Analysis (PCA) module#1086
Implement Principal Component Analysis (PCA) module#1086JAi-SATHVIK wants to merge 90 commits intofortran-lang:masterfrom
Conversation
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #1086 +/- ##
==========================================
- Coverage 68.55% 68.41% -0.14%
==========================================
Files 396 397 +1
Lines 12746 12772 +26
Branches 1376 1377 +1
==========================================
Hits 8738 8738
- Misses 4008 4034 +26 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
@JAi-SATHVIK the build fails happen because in the CMakeLists.txt for the stats module you need to add something like target_link_libraries(stats PUBLIC blas lapack) Also, in my previous comment regarding the kinds support, what I meant was: please enable all kinds, stdlib provids the backends. Linking against optimized libraries is optional and would increase performance for simple and double precision. But all kinds are supported. |
|
Thank you @jalvesz , for the clarification! I've made the following updates: CMakeLists.txt: The Documentation: Updated the inline comments in both |
|
New fails are happening because of the dependency on the sorting module. This exposes an issue with the modularization of the library. The sorting module being at the root of src is not visible by the stats module within its own independent folder... we might need to reconsider next steps. One idea would be to make this library (pca) not a submodule of stats but a module in itself at the root of src such that it can use easily any other module such as stats, blas/lapack and sorting. Or there might be another approach to think about |
In this case, add
However, I agree with that. When I was working on #1081, I started to get "faked" circular dependencies.
If the CMake file is correctly written, a submodule "pca" should not be a problem. |
|
Thank you @jalvesz @jvdp1 for the insights. Regarding the immediate build failure, I will try adding ../stdlib_sorting.fypp to the src/stats/CMakeLists.txt as suggested to resolve the visibility issue with the sorting module. Regarding the structural change: I'm open to moving PCA to a standalone module in src/ if that aligns better with the library's goals for modularity and reducing compilation overhead. Should I proceed with the CMake fix first to verify the current logic, or would you prefer I start refactoring it into its own module now? |
|
I'll suggest to go step by step: first try to fix "as is", then let's continue the discussion on what would be the best strategy. |
|
I accidentally closed the PR I’ve reopened it. |
|
Hi @jalvesz @jvdp1 @loiseaujc , |
|
Thank you @JAi-SATHVIK . Overall it looks pretty good. I will have a closer look later. Could you update the specs, please? |
src/stats/stdlib_stats_pca.fypp
Outdated
| allocate(c(p, p), source=zero) | ||
| call syrk('U', 'T', p, n, alpha, x_centered, n, beta, c, p) | ||
|
|
||
| ! Fill lower triangle from upper triangle (syrk only fills upper) |
There was a problem hiding this comment.
This comment no longer applies.
src/stats/stdlib_stats_pca.fypp
Outdated
|
|
||
| ! Fill lower triangle from upper triangle (syrk only fills upper) | ||
| allocate(lambda(p)) | ||
| allocate(vectors(p, p)) |
There was a problem hiding this comment.
You can coalesce these two lines into
allocate(lambda(p), vectors(p, p))This is essentially a matter of personal preferences. No big deal.
src/stats/stdlib_stats_pca.fypp
Outdated
| where (lambda(1:m) > 0.0_${k1}$) | ||
| singular_values(1:m) = sqrt(lambda(1:m) * real(n-1, ${k1}$)) | ||
| elsewhere | ||
| singular_values(1:m) = 0.0_${k1}$ |
There was a problem hiding this comment.
Since singular_values is initialized to zero on line 76, I don't think you need the elsewhere statement. I might be wrong though.
|
|
||
| ! Calculate mean along dimension 1 (column means) using stdlib mean | ||
| allocate(mu(p)) | ||
| mu = mean(x, 1) |
There was a problem hiding this comment.
I think you can get rid of the allocate statement. mu = mean(x, 1) will directly allocate mu and fill it with mean(x, 1).
src/stats/stdlib_stats_pca.fypp
Outdated
| call center_data_${k1}$(x, mu) | ||
| call pca_svd_driver_${k1}$(x, n, p, components, singular_values, err0) | ||
| else | ||
| allocate(x_centered, source=x) |
There was a problem hiding this comment.
You can keep it simpler with just x_centered = x.
src/stats/stdlib_stats_pca.fypp
Outdated
| end if | ||
|
|
||
| case ("eig", "cov") | ||
| allocate(x_centered, source=x) |
There was a problem hiding this comment.
Same thing, x_centered = x is just as readable.
|
Hi, @loiseaujc These are the changes made based on review feedback:
Some improvements:
|
|
@jalvesz @jvdp1 @loiseaujc Shall I fix the issue by Increasing test tolerance multipliers to allow for minor rounding differences in high-precision calculations? |
|
The fail in the |
I think we can increase test tolerance multipliers in The numbers being multiplied are random sums. The result magnitude is roughly 40-50, not 1.0. |
|
I don't quite like |
|
Hi @JAi-SATHVIK, this is close, could you please adapt the test file to conform to the usage of test-drive please. This means a separation between the program unit and a module containing the collection of tests. Please check how the tests of other modules in the library are built before. |
loiseaujc
left a comment
There was a problem hiding this comment.
Some more minor details.
|
Thankyou, @jalvesz @loiseaujc for the suggestions I Have made the changes.
|
#1080