Skip to content

First draft of the new n-dimensional arrays + NB use case #108

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Oct 13, 2022

Conversation

VolodymyrOrlov
Copy link
Collaborator

This PR will eventually solve #85. The goal is to have an abstract layer that can be implemented by concrete library that enables all traditional functionality provided by n-dimensional arrays: array manipulation, transformation, basic linear algebra routines, views e.t.c.

The abstract layer I am proposing here enables us to:

  • have arrays that can hold multiple types, including strings
  • implement more efficient ML algorithms that minimize memory copy operations
  • have ML algorithms that declare more granular requirements to input and output types, e.g. we can declare an input type for an algorithm that does need a simple non-mutable array w/o typical linear algebra routines.

As a bonus, this PR introduces a row-major dense matrix that might get handy in some cases.

@VolodymyrOrlov VolodymyrOrlov requested a review from morenol May 31, 2021 19:31
@VolodymyrOrlov VolodymyrOrlov marked this pull request as draft May 31, 2021 19:33
@VolodymyrOrlov VolodymyrOrlov linked an issue May 31, 2021 that may be closed by this pull request
@VolodymyrOrlov
Copy link
Collaborator Author

@morenol there are multiple problems with this PR, like printlns, lack of documentation, e.t.c. and I plan to fix these within next 3-4 of weeks. But I thought I'd share this PR with you so that you can take a look on proposed changes earlier. Besides, I've decided to try new abstract layer on NB first.

@codecov-commenter
Copy link

codecov-commenter commented May 31, 2021

Codecov Report

Merging #108 (e99bd70) into v0.5-wip (e99bd70) will not change coverage.
The diff coverage is n/a.

❗ Current head e99bd70 differs from pull request most recent head 55d2c16. Consider uploading reports for the commit 55d2c16 to get more accurate results

@@            Coverage Diff            @@
##           v0.5-wip     #108   +/-   ##
=========================================
  Coverage     38.72%   38.72%           
=========================================
  Files            86       86           
  Lines          7031     7031           
=========================================
  Hits           2723     2723           
  Misses         4308     4308           

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

Copy link
Collaborator

@morenol morenol left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@VolodymyrOrlov this looks nice.

Since we are rethinking the way that the arrays are represented I think that we should consider the usage of const generics.

//! ]);
//! let y = vec![0., 0., 1., 1., 1., 0., 1., 0., 1., 1., 1., 1., 1., 0.];
//! let y: Vec<u32> = vec![0, 0, 1, 1, 1, 0, 1, 0, 1, 1, 1, 1, 1, 0];
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is nice to be able to use normal integers for this

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem is that you make a call to .to_usize() in fn log_likelihood and fn fit, and this function will fail if used on negative integer values. Why would you like to support integers?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am sorry, I was trying to say that it is good to be able to use other types different than floats (the constraint that we currently have in the main branch)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see! I thought you want to make sure we can run CategoricalNB on integer target variables. Yes, that's the plan! Floats are great, but we do need to support other types as well. That's one reason why I decided to try new layer on Naive Bayes first, since NB methods do not apply to floating-point data most of the time.

@VolodymyrOrlov
Copy link
Collaborator Author

VolodymyrOrlov commented Jun 1, 2021

@VolodymyrOrlov this looks nice.

Since we are rethinking the way that the arrays are represented I think that we should consider the usage of const generics.

Any concrete ideas how we can use const generics to improve the proposed abstract layer? For example, const generics can be used to define number of dimensions in S

@morenol
Copy link
Collaborator

morenol commented Jun 1, 2021

I think that I edited your comment instead of a response: here it is

Any concrete ideas how we can use const generics to improve the proposed abstract layer? For example, const generics can be used to define number of dimensions in S

I am not sure how we can use them in the abstract layer. I was thinking in the structs defined in linalg::dense::. We can create an special type for static sized arrays. But maybe it is better to keep it simple and use nalgebra and ndarray when we want that kind of optimization

@Mec-iS
Copy link
Collaborator

Mec-iS commented Jul 3, 2021

Great work Volodymyr! This looks really promising.
Just few notes to check if I understood the new abstraction and to help devs in digging:

  • we are going to have new traits groups to define types for array/vectors elements: Num/Number and FloatNumber (I like it!) directly using num_traits;
  • A lot of things going on in linalg/base.rs:
    • new smartcore Array-related type declarations to provide a unified interface, including the ones from optimised strides read/write.
    • this interface is the toolbox to be leveraged by higher-level types, in particular the new implementations in linalg/dense/;
  • we have now "views" types that, I assume, are to provide segmentation for the unerlying structs so to allow reusing of the same instantiated struct and avoid copying;
  • supercool the "Decomposable"s for structs decomposition.

Will help with documentation as soon as it is relatively stable.

@Steboss check it out!

@VolodymyrOrlov
Copy link
Collaborator Author

Great work Volodymyr! This looks really promising.
Just few notes to check if I understood the new abstraction and to help devs in digging:

  • we are going to have new traits groups to define types for array/vectors elements: Num/Number and FloatNumber (I like it!) directly using num_traits;

  • A lot of things going on in linalg/base.rs:

    • new smartcore Array-related type declarations to provide a unified interface, including the ones from optimised strides read/write.
    • this interface is the toolbox to be leveraged by higher-level types, in particular the new implementations in linalg/dense/;
  • we have now "views" types that, I assume, are to provide segmentation for the unerlying structs so to allow reusing of the same instantiated struct and avoid copying;

  • supercool the "Decomposable"s for structs decomposition.

Will help with documentation as soon as it is relatively stable.

@Steboss check it out!

That's right! The goal is to build a rich interface for 1- and 2-dimensional arrays that can be extended by anyone later. On one hand we can make ML algorithms more efficient by relying on a concrete library, like nalgebra or ndarray. Investing time into an abstract layer that facilitates integration with 3rd partly libraries, on the other hand, gives us a lot of flexibility and saves time later, if we find it necessary to bring in some other library in the future.

@Mec-iS
Copy link
Collaborator

Mec-iS commented Sep 20, 2022

@morenol please take a look, there are still missing implementations but the code is at least cleaned up.

@Mec-iS
Copy link
Collaborator

Mec-iS commented Sep 20, 2022

  • conflicts with development should be now solved
  • run the linter, try to compile
  • there are some types that are not correct, in particular in src/linear/ridge_regression.rs and src/linear/logistic_regression.rs
  • tests need to be double-checked against the new typed structures and implementations; possibly write new ones
  • there is a problem with distance metrics
  • ... (probably hundreds of things more) 💯

@Mec-iS Mec-iS requested review from morenol and Mec-iS September 20, 2022 17:24
@Mec-iS Mec-iS added the help wanted Extra attention is needed label Sep 21, 2022
@Mec-iS
Copy link
Collaborator

Mec-iS commented Sep 21, 2022

please fetch this branch and run cargo build --lib, maybe there are some errors for which the solution is evident to you people.

@morenol
Copy link
Collaborator

morenol commented Sep 21, 2022

please fetch this branch and run cargo build --lib, maybe there are some errors for which the solution is evident to you people.

I can take a look later today

@Mec-iS
Copy link
Collaborator

Mec-iS commented Sep 21, 2022

sure, when you have time. If we can fix the static analyzer (in src/ensemble) and compiler errors maybe we can move into running tests 👍🏼
Also I don't understand how the implementation for RandomForestClassifier.samples should work as in this branch the property has been removed.

@Mec-iS
Copy link
Collaborator

Mec-iS commented Sep 22, 2022

@montanalow please take a look to ensemble and linear modules as you are more familiar with those procedures. The objective is to make things to compile and then move into making the tests to pass. Also I saw some conflicts when pulling the very latest changes.

@Mec-iS
Copy link
Collaborator

Mec-iS commented Sep 22, 2022

great job @morenol
Can you tell me why we are missing definitions for ElasticNetSearchParameters, LassoSearchParameters, LinearRegressionSearchParameters? that instead are present in development?
I will try a merge, another hundred conflicts to solve

@morenol
Copy link
Collaborator

morenol commented Sep 22, 2022

ElasticNetSearchParameters

not completely sure, we probably need to split in several PRs once we have all the conflicts solved in order to make it easy to review

@montanalow
Copy link
Collaborator

I've got most of the merge conflicts sorted in my personal branch, but we need to apply many of the updates in this branch to the new search params, and apply the rng changes to this branch. I think I understand it enough to get it done in an hour or so.

@Mec-iS
Copy link
Collaborator

Mec-iS commented Sep 22, 2022

@montanalow great! I was splitting the changes into multiple branches but if you can do in one pass that would be awesome.

@montanalow
Copy link
Collaborator

@montanalow great! I was splitting the changes into multiple branches but if you can do in one pass that would be awesome.

#171 is up that should resolve all merge conflicts in this branch, although it's a fairly large change set...

@Mec-iS
Copy link
Collaborator

Mec-iS commented Sep 23, 2022

The library now compiles but there are still a lot of runtime errors and tests to fix.

I have not yet a clear understanding about what it is going to be the substitute for DenseMatrix, I thought it should have been Array2 but how to instantiate it? For example this test. Also predict_oob is still broken as I don't know how to implement it without using the samples property that has been removed (probably is supposed to use arrays views?).

I am not still sure we can merge this all at once, I will probably start opening a PR to merge num.rs so we can do it gradually.

Mec-iS added a commit that referenced this pull request Sep 23, 2022
@Mec-iS
Copy link
Collaborator

Mec-iS commented Sep 26, 2022

I am seeing discrepancies between the metrics::recall module in development and the one in this branch. Things are getting too entangled, please move and follow up to #173 for a more stepwise implementation.

Mec-iS added a commit that referenced this pull request Oct 6, 2022
@Mec-iS
Copy link
Collaborator

Mec-iS commented Oct 11, 2022

please help. it is quite easy, just follow the implementations in the other ported modules. cc: @morenol @montanalow @titoeb

Missing modules,

  • preprocessing
  • readers
  • svm

@Mec-iS Mec-iS marked this pull request as ready for review October 13, 2022 18:39
@Mec-iS Mec-iS changed the base branch from development to v0.5-wip October 13, 2022 18:40
@Mec-iS Mec-iS merged commit 7723bcd into v0.5-wip Oct 13, 2022
@Mec-iS
Copy link
Collaborator

Mec-iS commented Oct 13, 2022

@Mec-iS Mec-iS deleted the linalg-refactoring branch October 13, 2022 18:43
@Mec-iS Mec-iS mentioned this pull request Oct 13, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Extra attention is needed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Refactor linalg module
5 participants