Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New Centroid-based Classifier #5103

Merged
merged 18 commits into from
Aug 29, 2024
Merged

Conversation

smola
Copy link
Contributor

@smola smola commented Dec 5, 2020

This is quite a big change and it probably needs some more polishing before merging.

Summary

This PR replaces the current Bayesian classifier with a Centroid-Based classifier.

This is not state-of-the-art machine learning, but it gives pretty good results given our current constraints on training dataset size (see #3745).

Benefits

  • Accuracy is much higher. There's more to test here, but I think the error rate should drop to about a half.
  • It produces scores in the [0.0, 1.0] range, which are more intuitive and are comparable between samples.
  • For empty files or files without known tokens, it produces a 0.0 score, instead of a very small one. This means the classifier is able to produce no results instead of arbitrary results in that case.
  • At classification time, there is no multiplication of small numbers involved. No log-probabilities. Coupled with the fact that unknown tokens are discarded, this is much faster.
  • Discarding unknown tokens also produces more stable results.
  • L2-normalization of samples avoids being biased by big samples in the training data.
  • Using logarithmic term frequency instead of absolute counts means that we are not so biased towards repetitive tokens such as parenthesis or braces. See Misclassified C as C++ in data.table project #5028 and Add a C sample defining a large initialized array #5015.

How does it work?

Training

  • A fixed vocabulary is set to all tokens that appear in, at least, 2
    samples.
  • All out-of-vocabulary tokens are discarded.
  • For every token, we set its Inverse Class Frequency (ICF) to
    log(ct / cf) + 1 where ct is the total number of classes and cf is
    the number of classes where the token occurs.
  • Each sample is converted to a vector of tf * icf for every token in
    the vocabulary. tf is 1 + log(freq), where freq is the
    number of occurrences of the token in the given sample.
  • Samples are L2-normalized.
  • For each class (language), we compute the centroid of all its training
    samples by averaging them and L2-normalizing the result.

Classification

  • For a new sample, we get the L2-normalized vector with tf * icf
    terms for every known token, then classify the sample using the nearest
    centroid. Cosine similarity is used as similarity measure for this.

Backwards Compatibility

  • The structure of samples.json changed considerably.
  • During training, #finalize_train! MUST be called after the last #train! call.
  • The internal (but publicly exposed) methods tokens_probability, token_probability, language_probability are gone.

Future Improvements

I've been playing with methods to learn class-level weights, such as something similar to (Liu, 2017). This seems promising but I'm not sure the results are good enough with the given size of the training data.

Also, this classifier performs even better when combined with tokenizer improvements such as #5060 and #5061 (with a few more coming up).

The structure of samples.json is not so easy to diff now. It could make sense to add a script in the script directory for convenient diff'ing of samples.json files.

Checklist

Checklist does not apply


Subsequent @lildude's changes:

  • Removed two terrible samples (samples/R/hello-r.R & test/fixtures/AngelScript/ClassDef.as) and added a new improved R sample (2.R):

@smola
Copy link
Contributor Author

smola commented Dec 5, 2020

Comparison with master using leave-one-out cross validation of samples with ambiguous extensions:

  • Total errors in master: 58
  • Total errors in this branch: 32

Note that while it does generally better, in particular in some cases like Assembly or C/C++/Objective-C, there are also some new errors.

cross-validation diff

--- master.list	2020-11-14 15:17:21.414471682 +0100
+++ new-cbs-classifier.list	2020-12-05 16:03:10.484627319 +0100
@@ -16,2 +16,2 @@
-AsciiDoc/list.asc BAD(Public Key)
-Assembly/3D_PRG.I GOOD
+AsciiDoc/list.asc GOOD
+Assembly/3D_PRG.I BAD(Motorola 68K Assembly)
@@ -21,4 +21,4 @@
-Assembly/fp_sqr32_160_comba.inc BAD(C++)
-Assembly/lib.inc BAD(Motorola 68K Assembly)
-Assembly/macros.inc BAD(NASL)
-BitBake/gstreamer-libav.bb BAD(BlitzBasic)
+Assembly/fp_sqr32_160_comba.inc GOOD
+Assembly/lib.inc GOOD
+Assembly/macros.inc GOOD
+BitBake/gstreamer-libav.bb GOOD
@@ -34 +34 @@
-C++/16F88.h GOOD
+C++/16F88.h BAD(C)
@@ -36 +36 @@
-C/array.h BAD(Objective-C)
+C/array.h GOOD
@@ -39 +39 @@
-C#/AssemblyInfo.cs BAD(Smalltalk)
+C#/AssemblyInfo.cs GOOD
@@ -44 +44 @@
-C/bootstrap.h BAD(Objective-C)
+C/bootstrap.h GOOD
@@ -48 +48 @@
-C++/ClasspathVMSystemProperties.inc BAD(Pawn)
+C++/ClasspathVMSystemProperties.inc GOOD
@@ -60 +60 @@
-C/exception.zep.h GOOD
+C/exception.zep.h BAD(C++)
@@ -64 +64 @@
-C/GLKMatrix4.h BAD(Objective-C)
+C/GLKMatrix4.h GOOD
@@ -66 +66 @@
-C/hello.h BAD(C++)
+C/hello.h GOOD
@@ -73 +73 @@
-C/jni_layer.h BAD(C++)
+C/jni_layer.h GOOD
@@ -76,2 +76,2 @@
-C++/Memory16F88.h GOOD
-C++/metrics.h BAD(C)
+C++/Memory16F88.h BAD(C)
+C++/metrics.h GOOD
@@ -81 +81 @@
-C/ntru_encrypt.h BAD(C++)
+C/ntru_encrypt.h GOOD
@@ -114 +114 @@
-C++/rpc.h BAD(C)
+C++/rpc.h GOOD
@@ -118 +118 @@
-C/syscalldefs.h BAD(Objective-C)
+C/syscalldefs.h BAD(C++)
@@ -136 +136 @@
-DTrace/trace_futexes.d BAD(D)
+DTrace/trace_futexes.d GOOD
@@ -156 +156 @@
-Filterscript/colormatrix.fs BAD(GLSL)
+Filterscript/colormatrix.fs GOOD
@@ -166 +166 @@
-Formatted/long_seq.for BAD(Forth)
+Formatted/long_seq.for GOOD
@@ -191 +191 @@
-Genie/Hello.gs GOOD
+Genie/Hello.gs BAD(Gosu)
@@ -193 +193 @@
-Genie/web.gs BAD(JavaScript)
+Genie/web.gs GOOD
@@ -202 +202 @@
-Gosu/Ronin.gs GOOD
+Gosu/Ronin.gs BAD(JavaScript)
@@ -241 +241 @@
-HTML/tailDel.inc BAD(C++)
+HTML/tailDel.inc BAD(Assembly)
@@ -256 +256 @@
-INI/ms.properties BAD(Java Properties)
+INI/ms.properties GOOD
@@ -287 +287 @@
-Logos/string1.x GOOD
+Logos/string1.x BAD(RPC)
@@ -289 +289 @@
-Lua/wsapi.fcgi BAD(Perl)
+Lua/wsapi.fcgi BAD(Ruby)
@@ -291,2 +291,2 @@
-M4Sugar/list.m4 GOOD
-Makefile/foo.o.d GOOD
+M4Sugar/list.m4 BAD(M4)
+Makefile/foo.o.d BAD(UNK)
@@ -294 +294 @@
-Mathematica/HeyexImport.m BAD(Objective-C)
+Mathematica/HeyexImport.m GOOD
@@ -298 +298 @@
-Mathematica/PacletInfo.m BAD(Mercury)
+Mathematica/PacletInfo.m GOOD
@@ -329 +329 @@
-MATLAB/normalize.m BAD(M)
+MATLAB/normalize.m GOOD
@@ -381 +381 @@
-MQL5/Regex.mqh GOOD
+MQL5/Regex.mqh BAD(MQL4)
@@ -385 +385 @@
-M/ZDIOUT1.m BAD(Mercury)
+M/ZDIOUT1.m GOOD
@@ -408 +408 @@
-NewLisp/log-to-database.lisp BAD(Common Lisp)
+NewLisp/log-to-database.lisp GOOD
@@ -429 +429 @@
-Objective-C/Siesta.h BAD(C)
+Objective-C/Siesta.h GOOD
@@ -459,2 +459,2 @@
-Pawn/Check.inc BAD(PHP)
-Pawn/fixed.inc BAD(SourcePawn)
+Pawn/Check.inc GOOD
+Pawn/fixed.inc GOOD
@@ -466 +466 @@
-Perl/exception_handler.pl BAD(Raku)
+Perl/exception_handler.pl GOOD
@@ -472 +472 @@
-Perl/perl-test.t BAD(Raku)
+Perl/perl-test.t GOOD
@@ -479 +479 @@
-Perl/use5.pl BAD(Raku)
+Perl/use5.pl GOOD
@@ -484 +484 @@
-PHP/exception.zep.php BAD(Hack)
+PHP/exception.zep.php GOOD
@@ -490 +490 @@
-PHP/root.php BAD(Hack)
+PHP/root.php GOOD
@@ -506 +506 @@
-Proguard/proguard-rules.pro GOOD
+Proguard/proguard-rules.pro BAD(INI)
@@ -531,3 +531,3 @@
-q/tq.q GOOD
-Raku/01-dash-uppercase-i.t BAD(Perl)
-Raku/01-parse.t BAD(Perl)
+q/tq.q BAD(HiveQL)
+Raku/01-dash-uppercase-i.t GOOD
+Raku/01-parse.t GOOD
@@ -536 +536 @@
-Raku/A.pm BAD(Perl)
+Raku/A.pm GOOD
@@ -539,2 +539,2 @@
-Raku/calendar.t BAD(Perl)
-Raku/ContainsUnicode.pm BAD(Perl)
+Raku/calendar.t GOOD
+Raku/ContainsUnicode.pm GOOD
@@ -543 +543 @@
-Raku/hash.t GOOD
+Raku/hash.t BAD(Perl)
@@ -558 +558 @@
-Rebol/booters.r GOOD
+Rebol/booters.r BAD(R)
@@ -567 +567 @@
-RPC/rusers.x BAD(Logos)
+RPC/rusers.x GOOD
@@ -575 +575 @@
-RUNOFF/longlib.rno BAD(Roff)
+RUNOFF/longlib.rno GOOD
@@ -583 +583 @@
-SaltStack/top.sls GOOD
+SaltStack/top.sls BAD(Scheme)
@@ -595 +595 @@
-Smalltalk/categories.st GOOD
+Smalltalk/categories.st BAD(UNK)
@@ -599,2 +599,2 @@
-Smalltalk/scriptWithPragma.st BAD(HTML)
-Smalltalk/smallMethod.st GOOD
+Smalltalk/scriptWithPragma.st GOOD
+Smalltalk/smallMethod.st BAD(UNK)
@@ -603 +603 @@
-SourcePawn/mfile.inc BAD(C++)
+SourcePawn/mfile.inc BAD(Pawn)
@@ -608 +608 @@
-SuperCollider/WarpPreset.sc GOOD
+SuperCollider/WarpPreset.sc BAD(Scala)
@@ -623,3 +623,3 @@
-Text/LIDARLite.ncl GOOD
-Text/min-help.ncl BAD(NCL)
-Text/rmMonAnnCycLLT-help.ncl BAD(NCL)
+Text/LIDARLite.ncl BAD(XML)
+Text/min-help.ncl GOOD
+Text/rmMonAnnCycLLT-help.ncl GOOD
@@ -649 +649 @@
-Verilog/sha-256-functions.v BAD(V)
+Verilog/sha-256-functions.v GOOD
@@ -679 +679 @@
-XML/water.tsx BAD(TSX)
+XML/water.tsx GOOD

Combined with other improvements for the tokenizer, the number of errors get down to 22.


nil
end

# Public: Finalize training.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't you think graduate! would be a better name for this method? 😉

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense 😉

Actually, I'm not so fond of train! and finalize_train!. I did it this way to minimize API changes, but it may make sense to replace both by a new train! (or fit!) that takes an enumerable of all traning samples and does the whole thing, without an intermediate state where the classifier object is not valid.

lib/linguist/classifier.rb Outdated Show resolved Hide resolved
@Alhadis
Copy link
Collaborator

Alhadis commented Dec 5, 2020

How's it rank in terms of speed? Compared with the current Bayesian classifier, I mean.

@smola
Copy link
Contributor Author

smola commented Dec 5, 2020

@Alhadis How's it rank in terms of speed? Compared with the current Bayesian classifier, I mean.

On my machine, generating the samples.json goes from ~6.5s in master to ~3.9s. For the following script it goes from ~2.7s to ~2.6s:

require 'linguist'
blob = Linguist::FileBlob.new("lib/linguist/heuristics.rb")
languages = [Linguist::Language['Ruby'], Linguist::Language['Python']]
(1..5000).each do
  Linguist::Classifier.call(blob, languages)
end

This is not too serious benchmark for the classification part, since it's just one example file.

Actually, this is less of an improvement than what I was expecting for this change. I think I have too many local branches 🤣

@lildude
Copy link
Member

lildude commented Dec 11, 2020

I'm going to hold off looking into this until the new year. As you stated, this is quite a big change which I want to thoroughly review, test and run through a few peeps involved with the GitHub.com backend.

@lildude lildude self-assigned this Dec 11, 2020
@smola
Copy link
Contributor Author

smola commented Dec 11, 2020

@lildude I'm going to hold off looking into this until the new year. As you stated, this is quite a big change which I want to thoroughly review, test and run through a few peeps involved with the GitHub.com backend.

Makes sense. When reviewing classifier accuracy, you might be interested in my staging branch which includes this new classifier, tokenizer improvements and cross-validation tooling.

@lildude lildude requested a review from a team as a code owner January 4, 2021 10:10
@lildude
Copy link
Member

lildude commented Jan 6, 2021

I've started playing with the changes in this PR and from a performance perspective, this is looking pretty impressive and seems to be about 3-4 times quicker on my MBP when checking individual files.

Before I get too deep into the weeds, I think it might be a better idea to get all the dependent PRs merged so we can make a like-for-like comparison between the Bayesian and Centroid classifiers without having to consider other pending changes. It'll also make it easier for me to switch between gems when testing from the GItHub app side of things.

@Alhadis
Copy link
Collaborator

Alhadis commented Jan 8, 2021

@smola Any chance you could weigh in on #2988? Running the same sample through the Centroid-based classifier yields the same results as our current Bayesian classifier:

λ linguist (new-cbs-classifier): bundle exec bin/github-linguist ~/Downloads/ChatKeys-LF.toc
ChatKeys-LF.toc: 9 lines (9 sloc)
  type:      Text
  mime type: text/plain
  language:  World of Warcraft Addon Data
λ linguist (new-cbs-classifier): bundle exec bin/github-linguist ~/Downloads/ChatKeys-CR.toc
ChatKeys-CR.toc: 9 lines (9 sloc)
  type:      Text
  mime type: text/plain
  language:  TeX

I still haven't the foggiest as to why, how, or where CR line-endings are screwing with Linguist. Hoping you might have more idea than I… 😅

@lildude
Copy link
Member

lildude commented Jan 8, 2021

Running the same sample through the Centroid-based classifier yields the same results as our current Bayesian classifier:

It doesn't if you use the full staging branch which pulls in the tokeniser changes:

[~/t/t/smola-linguist staging]$ bundle exec bin/github-linguist ~/tmp/trash/nlcr/ChatKeys-CR.toc
ChatKeys-CR.toc: 9 lines (9 sloc)
  type:      Text
  mime type: text/plain
  language:  World of Warcraft Addon Data
[~/t/t/smola-linguist staging]$ bundle exec bin/github-linguist ~/tmp/trash/nlcr/ChatKeys-LF.toc
ChatKeys-LF.toc: 9 lines (9 sloc)
  type:      Text
  mime type: text/plain
  language:  World of Warcraft Addon Data
[~/t/t/smola-linguist staging]$

Also, this classifier performs even better when combined with tokenizer improvements such as #5060 and #5061 (with a few more coming up).

☝️ is why I stopped testing this PR directly as I was seeing the same thing, including the failing identification tests in this PR.

@Alhadis
Copy link
Collaborator

Alhadis commented Jan 8, 2021

It doesn't if you use the full staging branch which pulls in the tokeniser changes:

I knew this had to be my fault… 😁 Thanks for the clarification. 😅

@smola
Copy link
Contributor Author

smola commented Jan 12, 2021

I'm currently trying to split and prepare PRs from my staging branch in a way that makes sense. So bear with me 😉

@smola smola force-pushed the new-cbs-classifier branch 3 times, most recently from 1ab03b7 to f4cf832 Compare February 13, 2021 15:02
@lildude
Copy link
Member

lildude commented Mar 8, 2021

The test failure here is because our bogus file with invalid unicode chars at https://github.com/github/linguist/blob/test/attributes/test/fixtures/ba%EF%BF%BDr/file_%C3%A3.pl is now being detected as Raku instead of Perl.

That's not really an issue as we don't really care what the language is for this test. We only care that a language is detected.

Changing all instances of Perl to Raku in the test should get this passing again. https://github.com/github/linguist/blob/2d76af2adb2b4103ef9ef4cf6dd79a34bc2a841d/test/test_repository.rb#L77-L80

@lildude
Copy link
Member

lildude commented Mar 19, 2021

I've gone ahead and pushed the test fix so I can update this PR and start testing it.

Copy link
Member

@lildude lildude left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is looking good and I've not hit any obvious problems with my testing over the past few days. I do have a few nits and question which I've outlined inline.

lib/linguist/samples.rb Outdated Show resolved Hide resolved
test/test_classifier.rb Show resolved Hide resolved
lib/linguist/classifier.rb Show resolved Hide resolved
test/test_classifier.rb Outdated Show resolved Hide resolved
test/test_blob.rb Outdated Show resolved Hide resolved
test/test_classifier.rb Outdated Show resolved Hide resolved
test/test_file_blob.rb Outdated Show resolved Hide resolved
@lildude lildude removed their assignment May 26, 2021
smola and others added 4 commits June 18, 2021 17:29
Training:

* A fixed vocabulary is set to all tokens that appear in, at least, 2
  samples.
* All out-of-vocabulary tokens are discarded.
* For every token, we set its Inverse Class Frequency (ICF) to
`log(ct / cf) + 1` where `ct` is the total number of classes and `cf` is
the number of classes where the token occurs.
* Each sample is converted to a vector of `tf * icf` for every token in
the vocabulary. `tf` is `1 + log(freq)`, where `freq` is the
number of occurrences of the token in the given sample.
* Samples are L2-normalized.
* For each class (language), we compute the centroid of all its training
samples by averaging them and L2-normalizing the result.

Classification:

* For a new sample, we get the L2-normalized vector with `tf * icf`
terms for every known token, then classify the sample using the nearest
centroid. Cosine similarity is used as similarity measure for this.
Co-authored-by: Colin Seymour <colin@github.com>
Co-authored-by: Colin Seymour <colin@github.com>
@lildude lildude requested a review from a team as a code owner September 8, 2023 06:59
@Randl
Copy link

Randl commented Jun 8, 2024

Any updates on this one?

@lildude lildude requested a review from Alhadis August 6, 2024 14:48
@lildude
Copy link
Member

lildude commented Aug 6, 2024

Right... after a really long time of ignoring this PR, I think I've addressed the concerns I originally added and I'm happy with this now.

@Alhadis @smola time to dig out those rusty memories and lets see if we can put this to bed.

@Alhadis
Copy link
Collaborator

Alhadis commented Aug 9, 2024

time to dig out those rusty memories

@lildude I'll try to get back to this in the next few days. Sorry for the slow response. 😞

Alhadis
Alhadis previously approved these changes Aug 14, 2024
Copy link
Collaborator

@Alhadis Alhadis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! 👍 :shipit:

Copy link
Collaborator

@Alhadis Alhadis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Alhadis dismissed their stale review via 9d922e3 4 minutes ago

Uhm… that wasn't supposed to happen for a merge affecting files untouched by this PR. Reapproving this to err on the side of caution…

Copy link
Member

@lildude lildude left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

YOLO 🤣

@lildude lildude added this pull request to the merge queue Aug 29, 2024
Merged via the queue into github-linguist:master with commit f7c7baa Aug 29, 2024
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants