-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
New Centroid-based Classifier #5103
Conversation
Comparison with master using leave-one-out cross validation of samples with ambiguous extensions:
Note that while it does generally better, in particular in some cases like Assembly or C/C++/Objective-C, there are also some new errors. cross-validation diff
--- master.list 2020-11-14 15:17:21.414471682 +0100
+++ new-cbs-classifier.list 2020-12-05 16:03:10.484627319 +0100
@@ -16,2 +16,2 @@
-AsciiDoc/list.asc BAD(Public Key)
-Assembly/3D_PRG.I GOOD
+AsciiDoc/list.asc GOOD
+Assembly/3D_PRG.I BAD(Motorola 68K Assembly)
@@ -21,4 +21,4 @@
-Assembly/fp_sqr32_160_comba.inc BAD(C++)
-Assembly/lib.inc BAD(Motorola 68K Assembly)
-Assembly/macros.inc BAD(NASL)
-BitBake/gstreamer-libav.bb BAD(BlitzBasic)
+Assembly/fp_sqr32_160_comba.inc GOOD
+Assembly/lib.inc GOOD
+Assembly/macros.inc GOOD
+BitBake/gstreamer-libav.bb GOOD
@@ -34 +34 @@
-C++/16F88.h GOOD
+C++/16F88.h BAD(C)
@@ -36 +36 @@
-C/array.h BAD(Objective-C)
+C/array.h GOOD
@@ -39 +39 @@
-C#/AssemblyInfo.cs BAD(Smalltalk)
+C#/AssemblyInfo.cs GOOD
@@ -44 +44 @@
-C/bootstrap.h BAD(Objective-C)
+C/bootstrap.h GOOD
@@ -48 +48 @@
-C++/ClasspathVMSystemProperties.inc BAD(Pawn)
+C++/ClasspathVMSystemProperties.inc GOOD
@@ -60 +60 @@
-C/exception.zep.h GOOD
+C/exception.zep.h BAD(C++)
@@ -64 +64 @@
-C/GLKMatrix4.h BAD(Objective-C)
+C/GLKMatrix4.h GOOD
@@ -66 +66 @@
-C/hello.h BAD(C++)
+C/hello.h GOOD
@@ -73 +73 @@
-C/jni_layer.h BAD(C++)
+C/jni_layer.h GOOD
@@ -76,2 +76,2 @@
-C++/Memory16F88.h GOOD
-C++/metrics.h BAD(C)
+C++/Memory16F88.h BAD(C)
+C++/metrics.h GOOD
@@ -81 +81 @@
-C/ntru_encrypt.h BAD(C++)
+C/ntru_encrypt.h GOOD
@@ -114 +114 @@
-C++/rpc.h BAD(C)
+C++/rpc.h GOOD
@@ -118 +118 @@
-C/syscalldefs.h BAD(Objective-C)
+C/syscalldefs.h BAD(C++)
@@ -136 +136 @@
-DTrace/trace_futexes.d BAD(D)
+DTrace/trace_futexes.d GOOD
@@ -156 +156 @@
-Filterscript/colormatrix.fs BAD(GLSL)
+Filterscript/colormatrix.fs GOOD
@@ -166 +166 @@
-Formatted/long_seq.for BAD(Forth)
+Formatted/long_seq.for GOOD
@@ -191 +191 @@
-Genie/Hello.gs GOOD
+Genie/Hello.gs BAD(Gosu)
@@ -193 +193 @@
-Genie/web.gs BAD(JavaScript)
+Genie/web.gs GOOD
@@ -202 +202 @@
-Gosu/Ronin.gs GOOD
+Gosu/Ronin.gs BAD(JavaScript)
@@ -241 +241 @@
-HTML/tailDel.inc BAD(C++)
+HTML/tailDel.inc BAD(Assembly)
@@ -256 +256 @@
-INI/ms.properties BAD(Java Properties)
+INI/ms.properties GOOD
@@ -287 +287 @@
-Logos/string1.x GOOD
+Logos/string1.x BAD(RPC)
@@ -289 +289 @@
-Lua/wsapi.fcgi BAD(Perl)
+Lua/wsapi.fcgi BAD(Ruby)
@@ -291,2 +291,2 @@
-M4Sugar/list.m4 GOOD
-Makefile/foo.o.d GOOD
+M4Sugar/list.m4 BAD(M4)
+Makefile/foo.o.d BAD(UNK)
@@ -294 +294 @@
-Mathematica/HeyexImport.m BAD(Objective-C)
+Mathematica/HeyexImport.m GOOD
@@ -298 +298 @@
-Mathematica/PacletInfo.m BAD(Mercury)
+Mathematica/PacletInfo.m GOOD
@@ -329 +329 @@
-MATLAB/normalize.m BAD(M)
+MATLAB/normalize.m GOOD
@@ -381 +381 @@
-MQL5/Regex.mqh GOOD
+MQL5/Regex.mqh BAD(MQL4)
@@ -385 +385 @@
-M/ZDIOUT1.m BAD(Mercury)
+M/ZDIOUT1.m GOOD
@@ -408 +408 @@
-NewLisp/log-to-database.lisp BAD(Common Lisp)
+NewLisp/log-to-database.lisp GOOD
@@ -429 +429 @@
-Objective-C/Siesta.h BAD(C)
+Objective-C/Siesta.h GOOD
@@ -459,2 +459,2 @@
-Pawn/Check.inc BAD(PHP)
-Pawn/fixed.inc BAD(SourcePawn)
+Pawn/Check.inc GOOD
+Pawn/fixed.inc GOOD
@@ -466 +466 @@
-Perl/exception_handler.pl BAD(Raku)
+Perl/exception_handler.pl GOOD
@@ -472 +472 @@
-Perl/perl-test.t BAD(Raku)
+Perl/perl-test.t GOOD
@@ -479 +479 @@
-Perl/use5.pl BAD(Raku)
+Perl/use5.pl GOOD
@@ -484 +484 @@
-PHP/exception.zep.php BAD(Hack)
+PHP/exception.zep.php GOOD
@@ -490 +490 @@
-PHP/root.php BAD(Hack)
+PHP/root.php GOOD
@@ -506 +506 @@
-Proguard/proguard-rules.pro GOOD
+Proguard/proguard-rules.pro BAD(INI)
@@ -531,3 +531,3 @@
-q/tq.q GOOD
-Raku/01-dash-uppercase-i.t BAD(Perl)
-Raku/01-parse.t BAD(Perl)
+q/tq.q BAD(HiveQL)
+Raku/01-dash-uppercase-i.t GOOD
+Raku/01-parse.t GOOD
@@ -536 +536 @@
-Raku/A.pm BAD(Perl)
+Raku/A.pm GOOD
@@ -539,2 +539,2 @@
-Raku/calendar.t BAD(Perl)
-Raku/ContainsUnicode.pm BAD(Perl)
+Raku/calendar.t GOOD
+Raku/ContainsUnicode.pm GOOD
@@ -543 +543 @@
-Raku/hash.t GOOD
+Raku/hash.t BAD(Perl)
@@ -558 +558 @@
-Rebol/booters.r GOOD
+Rebol/booters.r BAD(R)
@@ -567 +567 @@
-RPC/rusers.x BAD(Logos)
+RPC/rusers.x GOOD
@@ -575 +575 @@
-RUNOFF/longlib.rno BAD(Roff)
+RUNOFF/longlib.rno GOOD
@@ -583 +583 @@
-SaltStack/top.sls GOOD
+SaltStack/top.sls BAD(Scheme)
@@ -595 +595 @@
-Smalltalk/categories.st GOOD
+Smalltalk/categories.st BAD(UNK)
@@ -599,2 +599,2 @@
-Smalltalk/scriptWithPragma.st BAD(HTML)
-Smalltalk/smallMethod.st GOOD
+Smalltalk/scriptWithPragma.st GOOD
+Smalltalk/smallMethod.st BAD(UNK)
@@ -603 +603 @@
-SourcePawn/mfile.inc BAD(C++)
+SourcePawn/mfile.inc BAD(Pawn)
@@ -608 +608 @@
-SuperCollider/WarpPreset.sc GOOD
+SuperCollider/WarpPreset.sc BAD(Scala)
@@ -623,3 +623,3 @@
-Text/LIDARLite.ncl GOOD
-Text/min-help.ncl BAD(NCL)
-Text/rmMonAnnCycLLT-help.ncl BAD(NCL)
+Text/LIDARLite.ncl BAD(XML)
+Text/min-help.ncl GOOD
+Text/rmMonAnnCycLLT-help.ncl GOOD
@@ -649 +649 @@
-Verilog/sha-256-functions.v BAD(V)
+Verilog/sha-256-functions.v GOOD
@@ -679 +679 @@
-XML/water.tsx BAD(TSX)
+XML/water.tsx GOOD Combined with other improvements for the tokenizer, the number of errors get down to 22. |
c796210
to
22372ac
Compare
|
||
nil | ||
end | ||
|
||
# Public: Finalize training. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't you think graduate!
would be a better name for this method? 😉
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense 😉
Actually, I'm not so fond of train!
and finalize_train!
. I did it this way to minimize API changes, but it may make sense to replace both by a new train!
(or fit!
) that takes an enumerable of all traning samples and does the whole thing, without an intermediate state where the classifier object is not valid.
How's it rank in terms of speed? Compared with the current Bayesian classifier, I mean. |
On my machine, generating the require 'linguist'
blob = Linguist::FileBlob.new("lib/linguist/heuristics.rb")
languages = [Linguist::Language['Ruby'], Linguist::Language['Python']]
(1..5000).each do
Linguist::Classifier.call(blob, languages)
end This is not too serious benchmark for the classification part, since it's just one example file. Actually, this is less of an improvement than what I was expecting for this change. I think I have too many local branches 🤣 |
22372ac
to
c6d64e7
Compare
I'm going to hold off looking into this until the new year. As you stated, this is quite a big change which I want to thoroughly review, test and run through a few peeps involved with the GitHub.com backend. |
Makes sense. When reviewing classifier accuracy, you might be interested in my staging branch which includes this new classifier, tokenizer improvements and cross-validation tooling. |
27fdbee
to
44d4101
Compare
I've started playing with the changes in this PR and from a performance perspective, this is looking pretty impressive and seems to be about 3-4 times quicker on my MBP when checking individual files. Before I get too deep into the weeds, I think it might be a better idea to get all the dependent PRs merged so we can make a like-for-like comparison between the Bayesian and Centroid classifiers without having to consider other pending changes. It'll also make it easier for me to switch between gems when testing from the GItHub app side of things. |
@smola Any chance you could weigh in on #2988? Running the same sample through the Centroid-based classifier yields the same results as our current Bayesian classifier: λ linguist (new-cbs-classifier): bundle exec bin/github-linguist ~/Downloads/ChatKeys-LF.toc
ChatKeys-LF.toc: 9 lines (9 sloc)
type: Text
mime type: text/plain
language: World of Warcraft Addon Data
λ linguist (new-cbs-classifier): bundle exec bin/github-linguist ~/Downloads/ChatKeys-CR.toc
ChatKeys-CR.toc: 9 lines (9 sloc)
type: Text
mime type: text/plain
language: TeX I still haven't the foggiest as to why, how, or where CR line-endings are screwing with Linguist. Hoping you might have more idea than I… 😅 |
It doesn't if you use the full staging branch which pulls in the tokeniser changes: [~/t/t/smola-linguist staging]$ bundle exec bin/github-linguist ~/tmp/trash/nlcr/ChatKeys-CR.toc
ChatKeys-CR.toc: 9 lines (9 sloc)
type: Text
mime type: text/plain
language: World of Warcraft Addon Data
[~/t/t/smola-linguist staging]$ bundle exec bin/github-linguist ~/tmp/trash/nlcr/ChatKeys-LF.toc
ChatKeys-LF.toc: 9 lines (9 sloc)
type: Text
mime type: text/plain
language: World of Warcraft Addon Data
[~/t/t/smola-linguist staging]$
☝️ is why I stopped testing this PR directly as I was seeing the same thing, including the failing identification tests in this PR. |
I knew this had to be my fault… 😁 Thanks for the clarification. 😅 |
I'm currently trying to split and prepare PRs from my staging branch in a way that makes sense. So bear with me 😉 |
1ab03b7
to
f4cf832
Compare
The test failure here is because our bogus file with invalid unicode chars at https://github.com/github/linguist/blob/test/attributes/test/fixtures/ba%EF%BF%BDr/file_%C3%A3.pl is now being detected as Raku instead of Perl. That's not really an issue as we don't really care what the language is for this test. We only care that a language is detected. Changing all instances of Perl to Raku in the test should get this passing again. https://github.com/github/linguist/blob/2d76af2adb2b4103ef9ef4cf6dd79a34bc2a841d/test/test_repository.rb#L77-L80 |
I've gone ahead and pushed the test fix so I can update this PR and start testing it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is looking good and I've not hit any obvious problems with my testing over the past few days. I do have a few nits and question which I've outlined inline.
Training: * A fixed vocabulary is set to all tokens that appear in, at least, 2 samples. * All out-of-vocabulary tokens are discarded. * For every token, we set its Inverse Class Frequency (ICF) to `log(ct / cf) + 1` where `ct` is the total number of classes and `cf` is the number of classes where the token occurs. * Each sample is converted to a vector of `tf * icf` for every token in the vocabulary. `tf` is `1 + log(freq)`, where `freq` is the number of occurrences of the token in the given sample. * Samples are L2-normalized. * For each class (language), we compute the centroid of all its training samples by averaging them and L2-normalizing the result. Classification: * For a new sample, we get the L2-normalized vector with `tf * icf` terms for every known token, then classify the sample using the nearest centroid. Cosine similarity is used as similarity measure for this.
Co-authored-by: Colin Seymour <colin@github.com>
Co-authored-by: Colin Seymour <colin@github.com>
348d88f
to
9b6fa51
Compare
Any updates on this one? |
@lildude I'll try to get back to this in the next few days. Sorry for the slow response. 😞 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Alhadis dismissed their stale review via 9d922e3 4 minutes ago
Uhm… that wasn't supposed to happen for a merge affecting files untouched by this PR. Reapproving this to err on the side of caution…
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
YOLO 🤣
This is quite a big change and it probably needs some more polishing before merging.
Summary
This PR replaces the current Bayesian classifier with a Centroid-Based classifier.
This is not state-of-the-art machine learning, but it gives pretty good results given our current constraints on training dataset size (see #3745).
Benefits
[0.0, 1.0]
range, which are more intuitive and are comparable between samples.muchfaster.How does it work?
Training
samples.
log(ct / cf) + 1
wherect
is the total number of classes andcf
isthe number of classes where the token occurs.
tf * icf
for every token inthe vocabulary.
tf
is1 + log(freq)
, wherefreq
is thenumber of occurrences of the token in the given sample.
samples by averaging them and L2-normalizing the result.
Classification
tf * icf
terms for every known token, then classify the sample using the nearest
centroid. Cosine similarity is used as similarity measure for this.
Backwards Compatibility
samples.json
changed considerably.#finalize_train!
MUST be called after the last#train!
call.tokens_probability
,token_probability
,language_probability
are gone.Future Improvements
I've been playing with methods to learn class-level weights, such as something similar to (Liu, 2017). This seems promising but I'm not sure the results are good enough with the given size of the training data.
Also, this classifier performs even better when combined with tokenizer improvements such as #5060 and #5061 (with a few more coming up).
The structure of
samples.json
is not so easy to diff now. It could make sense to add a script in thescript
directory for convenient diff'ing ofsamples.json
files.Checklist
Checklist does not apply
Subsequent @lildude's changes:
samples/R/hello-r.R
&test/fixtures/AngelScript/ClassDef.as
) and added a new improved R sample (2.R
):