Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dataset improvements #74

Merged
merged 6 commits into from
Jan 5, 2021
Merged

Conversation

Sauro98
Copy link
Member

@Sauro98 Sauro98 commented Dec 20, 2020

Hi, I'm opening the PR a little early because I'd like to ask some questions. This PR wants to make the changes proposed in #70.

  • struct Dataset has been renamed DatasetBase
  • All crates have been updated according to the renaming of Dataset into DatasetBase
  • New types have been added:
    • Dataset: owned Array2 as records, owned Array1 as targets. The current implementation allows to write Dataset<f24,f24> or Dataset<f24,usize> and so on. Is this in line with the objective of the proposed changes?
    • DatasetView: view of an Array2 as records, view of an Array1 as targets. The current implementation allows to write DatasetView<f24,f24> or DatasetView<f24,usize> and so on. Calling view() on a Dataset produces a DatasetView.
  • Functions inside sub-crate datasets have been changed to return an instance of Dataset.

There's still k-folding missing but I figured that it would be better to have an intermediate check. I also added tests to ensure the existence of the implementations of the required methods for the two new types and they also provide a look at how the two types can be used

src/dataset/mod.rs Outdated Show resolved Hide resolved
src/dataset/impl_dataset.rs Outdated Show resolved Hide resolved
@codecov-io
Copy link

codecov-io commented Dec 20, 2020

Codecov Report

Merging #74 (f6572a1) into master (ab55fb9) will increase coverage by 1.73%.
The diff coverage is 69.02%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master      #74      +/-   ##
==========================================
+ Coverage   49.44%   51.17%   +1.73%     
==========================================
  Files          57       57              
  Lines        3762     3875     +113     
==========================================
+ Hits         1860     1983     +123     
+ Misses       1902     1892      -10     
Impacted Files Coverage Δ
linfa-clustering/src/appx_dbscan/algorithm.rs 28.57% <ø> (ø)
linfa-clustering/src/dbscan/algorithm.rs 50.00% <ø> (ø)
linfa-reduction/src/pca/algorithms.rs 0.00% <0.00%> (ø)
src/dataset/impl_records.rs 0.00% <ø> (ø)
src/dataset/impl_dataset.rs 36.48% <42.52%> (+24.31%) ⬆️
linfa-bayes/src/gaussian_nb.rs 57.03% <50.00%> (ø)
linfa-svm/src/classification.rs 64.05% <50.00%> (+1.72%) ⬆️
linfa-svm/src/regression.rs 44.59% <50.00%> (+7.09%) ⬆️
src/metrics_classification.rs 49.21% <60.00%> (-0.79%) ⬇️
linfa-kernel/src/lib.rs 24.73% <66.66%> (-0.27%) ⬇️
... and 24 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update ab55fb9...f6572a1. Read the comment docs.

@bytesnake
Copy link
Member

bytesnake commented Dec 21, 2020 via email

@bytesnake
Copy link
Member

bytesnake commented Dec 21, 2020 via email

@Sauro98
Copy link
Member Author

Sauro98 commented Dec 29, 2020

Hi, I've added k-folding for Dataset and DatasetView types, implemented the changes mentioned in the previous comments and I've added a shy attempt at documentation. There are some questions I would like to ask:

  • Is that style/content of documentation alright? If so, or even if you have some suggestions, I would gladly write some more for the impl_dataset.rs file
  • I see that in general the methods in this file don't check for correctness of input parameters, and so I left the k parameter for k-folding unchecked. Is this alright? Or else, should methods instead return a Result catching any error with the inputs?

Of course, I'm here for any comments about the changes already implemented. Happy holidays ☃️

@bytesnake
Copy link
Member

Hi, I've added k-folding for Dataset and DatasetView types, implemented the changes mentioned in the previous comments and I've added a shy attempt at documentation. There are some questions I would like to ask:

* Is that style/content of documentation alright? If so, or even if you have some suggestions, I would gladly write some more for the `impl_dataset.rs` file

They are looking good, I think that documentation should really be improved. You can take this blogpost as a guide how to document functions: https://deterministic.space/machine-readable-inline-markdown-code-cocumentation.html

* I see that in general the methods in this file don't check for correctness of input parameters, and so I left the `k` parameter for k-folding unchecked. Is this alright? Or else, should methods instead return a `Result` catching any error with the inputs?

yes we should, but only once we have proper error handling with thiserror

Of course, I'm here for any comments about the changes already implemented. Happy holidays snowman_with_snow

thanks :)

Copy link
Member

@bytesnake bytesnake left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for

  • switching to DatasetBase
  • extending testing and documentation
  • introducing k-folding

datasets/src/lib.rs Show resolved Hide resolved
linfa-clustering/benches/gaussian_mixture.rs Outdated Show resolved Hide resolved
src/dataset/impl_dataset.rs Outdated Show resolved Hide resolved
src/dataset/impl_dataset.rs Outdated Show resolved Hide resolved
src/dataset/impl_dataset.rs Outdated Show resolved Hide resolved
src/dataset/mod.rs Show resolved Hide resolved
@bytesnake
Copy link
Member

I merged the naive bayes PR, can you rebase?

@bytesnake
Copy link
Member

one idea which would improve readability:
add a field to DatasetBase called features: Vec<String> to give each feature a name. This could be used in printing correlation or decision tree etc.
But this would probably go into a follow up PR as these are already enough changes

Copy link
Member Author

@Sauro98 Sauro98 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

did not mean to start a review

src/dataset/impl_dataset.rs Outdated Show resolved Hide resolved
@bytesnake bytesnake merged commit 21dd579 into rust-ml:master Jan 5, 2021
@Sauro98 Sauro98 deleted the dataset_improvements branch January 6, 2021 14:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants