-
Notifications
You must be signed in to change notification settings - Fork 56
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
knn classification #310
knn classification #310
Conversation
cleaned up, added docs and code examples, adjusted tests
Codecov ReportAttention:
Additional details and impacted files@@ Coverage Diff @@
## developer #310 +/- ##
=============================================
+ Coverage 47.00% 47.16% +0.15%
=============================================
Files 148 149 +1
Lines 16458 16567 +109
Branches 2219 2230 +11
=============================================
+ Hits 7736 7813 +77
- Misses 8052 8077 +25
- Partials 670 677 +7 ☔ View full report in Codecov by Sentry. |
(x : 'a) | ||
: 'l option = | ||
|
||
if Seq.isEmpty points || Seq.length points <> Seq.length labels || k <= 0 then |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be better to pass on the consumer the need to put both points and labels in a single sequence of tuples?
It would make the interface more conforming to the array implementation and save two calls to Seq.length.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Many thanks for the valuable feedback!
Indeed, I also believe it would be clearer to have a single sequence of tuples, and the array version uses the type 'LabeledPoint' (which is nothing more than a more descriptive tuple).
The reason for this outcome, was the motivation by the existing Python versions, see e.g. here (Step 2)
which uses the seperate arguments, as well as the need to construct the tuple seq in case you started with seperate data (ofc. the converse argument also holds).
That being said, I am happy to change it as suggested. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with @smoothdeveloper that it would be beneficial if both points and labels are passed as single sequence. This reduces user errors by accidently resorting one of the collections. If you could make this small adjustment, I'll be happy to merge
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
i replaced the seperated params by a tupled sequence.
also i removed the tupe LabeledPoint
by just tuples, because it seemed as just overhead
|
||
open FSharp.Stats.DistanceMetrics | ||
|
||
module Array = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Technically, it would be better to put RequireQualifiedAccess
attribute on the Array
and Seq
modules, as it is possible to open KNN.Array/Seq otherwise.
I'm assuming:
- those members aren't destined for raw usage from client code
- the attribute on KNN is to discourage ambiguous usage of
predict
to surface in client code - we are encouraging the usage of
Classifier
type
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again, thanks!
So indeed I added the RequireQualifiedAccess
to prevent misuse of the Classifier
and predict
functions.
Moving the RequireQualifiedAccess
to Array
and Seq
would leave the Classifier
'unprotected' (currently it needs to constructued with KNN.Classifier
) and I could move it to Array
too or some new module or simply rename it to KnnClassifier
.
The intention of the Classifier
versus the Array.predict
/ Seq.predict
versions was a more Python style version more functional style use. Moreover the Classifier
provides convencience methods to construct the data in the required format, and to run single and multiple predictions.
Ultimately I am not sure of what the best versions are in terms of best 'user experience'.
Happy for your thoughts and feedback!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I really like the option to have this convenience layer! I think it is okay to have the requiredAccess added to KNN even if one could open KNN.Array. It allows to have Classifier
protected and prevent confusion
member this.fit(lps : LabeledPoint<'a, 'l> array) = | ||
this.labeledPoints <- lps | ||
|
||
member this.fit(points : 'a array, labels : 'l array) = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggesting adding a comment that explains it will fail if both arrays aren't of same length
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, will add it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be great to have XML comments here 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added XML comments
|
||
open FSharp.Stats.DistanceMetrics | ||
|
||
module Array = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I really like the option to have this convenience layer! I think it is okay to have the requiredAccess added to KNN even if one could open KNN.Array. It allows to have Classifier
protected and prevent confusion
(x : 'a) | ||
: 'l option = | ||
|
||
if Seq.isEmpty points || Seq.length points <> Seq.length labels || k <= 0 then |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with @smoothdeveloper that it would be beneficial if both points and labels are passed as single sequence. This reduces user errors by accidently resorting one of the collections. If you could make this small adjustment, I'll be happy to merge
member this.fit(lps : LabeledPoint<'a, 'l> array) = | ||
this.labeledPoints <- lps | ||
|
||
member this.fit(points : 'a array, labels : 'l array) = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be great to have XML comments here 👍
Thanks for this awsome and well prepared addition @s-weil! |
Thank you for contributing to FSharp.Stats. Please take the time to tell us a bit more about your PR.
Closes 300
Please list the changes introduced in this PR
Description
The KNN algorithm is implemented in different versions [see
ML\Unsupervised\KNN.fs
], namelyCode documentation, examples and unit tests are provided.
NOTE: further specialisations of the algorithm for improved performance exist and versions are possible (such as a vector version). Moreover one could consider the feature of parallelized predictions (may only make sense in case of a big point cloud).