New Centroid-based Classifier #5103

smola · 2020-12-05T15:46:59Z

This is quite a big change and it probably needs some more polishing before merging.

Summary

This PR replaces the current Bayesian classifier with a Centroid-Based classifier.

This is not state-of-the-art machine learning, but it gives pretty good results given our current constraints on training dataset size (see #3745).

Benefits

Accuracy is much higher. There's more to test here, but I think the error rate should drop to about a half.
It produces scores in the [0.0, 1.0] range, which are more intuitive and are comparable between samples.
For empty files or files without known tokens, it produces a 0.0 score, instead of a very small one. This means the classifier is able to produce no results instead of arbitrary results in that case.
At classification time, there is no multiplication of small numbers involved. No log-probabilities. Coupled with the fact that unknown tokens are discarded, this is ~~much~~ faster.
Discarding unknown tokens also produces more stable results.
L2-normalization of samples avoids being biased by big samples in the training data.
Using logarithmic term frequency instead of absolute counts means that we are not so biased towards repetitive tokens such as parenthesis or braces. See Misclassified C as C++ in data.table project #5028 and Add a C sample defining a large initialized array #5015.

How does it work?

Training

A fixed vocabulary is set to all tokens that appear in, at least, 2
samples.
All out-of-vocabulary tokens are discarded.
For every token, we set its Inverse Class Frequency (ICF) to
log(ct / cf) + 1 where ct is the total number of classes and cf is
the number of classes where the token occurs.
Each sample is converted to a vector of tf * icf for every token in
the vocabulary. tf is 1 + log(freq), where freq is the
number of occurrences of the token in the given sample.
Samples are L2-normalized.
For each class (language), we compute the centroid of all its training
samples by averaging them and L2-normalizing the result.

Classification

For a new sample, we get the L2-normalized vector with tf * icf
terms for every known token, then classify the sample using the nearest
centroid. Cosine similarity is used as similarity measure for this.

Backwards Compatibility

The structure of samples.json changed considerably.
During training, #finalize_train! MUST be called after the last #train! call.
The internal (but publicly exposed) methods tokens_probability, token_probability, language_probability are gone.

Future Improvements

I've been playing with methods to learn class-level weights, such as something similar to (Liu, 2017). This seems promising but I'm not sure the results are good enough with the given size of the training data.

Also, this classifier performs even better when combined with tokenizer improvements such as #5060 and #5061 (with a few more coming up).

The structure of samples.json is not so easy to diff now. It could make sense to add a script in the script directory for convenient diff'ing of samples.json files.

Checklist

Checklist does not apply

Subsequent @lildude's changes:

Removed two terrible samples (samples/R/hello-r.R & test/fixtures/AngelScript/ClassDef.as) and added a new improved R sample (2.R):
- Sample source
- License: GPL-3

smola · 2020-12-05T16:06:14Z

Comparison with master using leave-one-out cross validation of samples with ambiguous extensions:

Total errors in master: 58
Total errors in this branch: 32

Note that while it does generally better, in particular in some cases like Assembly or C/C++/Objective-C, there are also some new errors.

cross-validation diff

--- master.list	2020-11-14 15:17:21.414471682 +0100
+++ new-cbs-classifier.list	2020-12-05 16:03:10.484627319 +0100
@@ -16,2 +16,2 @@
-AsciiDoc/list.asc BAD(Public Key)
-Assembly/3D_PRG.I GOOD
+AsciiDoc/list.asc GOOD
+Assembly/3D_PRG.I BAD(Motorola 68K Assembly)
@@ -21,4 +21,4 @@
-Assembly/fp_sqr32_160_comba.inc BAD(C++)
-Assembly/lib.inc BAD(Motorola 68K Assembly)
-Assembly/macros.inc BAD(NASL)
-BitBake/gstreamer-libav.bb BAD(BlitzBasic)
+Assembly/fp_sqr32_160_comba.inc GOOD
+Assembly/lib.inc GOOD
+Assembly/macros.inc GOOD
+BitBake/gstreamer-libav.bb GOOD
@@ -34 +34 @@
-C++/16F88.h GOOD
+C++/16F88.h BAD(C)
@@ -36 +36 @@
-C/array.h BAD(Objective-C)
+C/array.h GOOD
@@ -39 +39 @@
-C#/AssemblyInfo.cs BAD(Smalltalk)
+C#/AssemblyInfo.cs GOOD
@@ -44 +44 @@
-C/bootstrap.h BAD(Objective-C)
+C/bootstrap.h GOOD
@@ -48 +48 @@
-C++/ClasspathVMSystemProperties.inc BAD(Pawn)
+C++/ClasspathVMSystemProperties.inc GOOD
@@ -60 +60 @@
-C/exception.zep.h GOOD
+C/exception.zep.h BAD(C++)
@@ -64 +64 @@
-C/GLKMatrix4.h BAD(Objective-C)
+C/GLKMatrix4.h GOOD
@@ -66 +66 @@
-C/hello.h BAD(C++)
+C/hello.h GOOD
@@ -73 +73 @@
-C/jni_layer.h BAD(C++)
+C/jni_layer.h GOOD
@@ -76,2 +76,2 @@
-C++/Memory16F88.h GOOD
-C++/metrics.h BAD(C)
+C++/Memory16F88.h BAD(C)
+C++/metrics.h GOOD
@@ -81 +81 @@
-C/ntru_encrypt.h BAD(C++)
+C/ntru_encrypt.h GOOD
@@ -114 +114 @@
-C++/rpc.h BAD(C)
+C++/rpc.h GOOD
@@ -118 +118 @@
-C/syscalldefs.h BAD(Objective-C)
+C/syscalldefs.h BAD(C++)
@@ -136 +136 @@
-DTrace/trace_futexes.d BAD(D)
+DTrace/trace_futexes.d GOOD
@@ -156 +156 @@
-Filterscript/colormatrix.fs BAD(GLSL)
+Filterscript/colormatrix.fs GOOD
@@ -166 +166 @@
-Formatted/long_seq.for BAD(Forth)
+Formatted/long_seq.for GOOD
@@ -191 +191 @@
-Genie/Hello.gs GOOD
+Genie/Hello.gs BAD(Gosu)
@@ -193 +193 @@
-Genie/web.gs BAD(JavaScript)
+Genie/web.gs GOOD
@@ -202 +202 @@
-Gosu/Ronin.gs GOOD
+Gosu/Ronin.gs BAD(JavaScript)
@@ -241 +241 @@
-HTML/tailDel.inc BAD(C++)
+HTML/tailDel.inc BAD(Assembly)
@@ -256 +256 @@
-INI/ms.properties BAD(Java Properties)
+INI/ms.properties GOOD
@@ -287 +287 @@
-Logos/string1.x GOOD
+Logos/string1.x BAD(RPC)
@@ -289 +289 @@
-Lua/wsapi.fcgi BAD(Perl)
+Lua/wsapi.fcgi BAD(Ruby)
@@ -291,2 +291,2 @@
-M4Sugar/list.m4 GOOD
-Makefile/foo.o.d GOOD
+M4Sugar/list.m4 BAD(M4)
+Makefile/foo.o.d BAD(UNK)
@@ -294 +294 @@
-Mathematica/HeyexImport.m BAD(Objective-C)
+Mathematica/HeyexImport.m GOOD
@@ -298 +298 @@
-Mathematica/PacletInfo.m BAD(Mercury)
+Mathematica/PacletInfo.m GOOD
@@ -329 +329 @@
-MATLAB/normalize.m BAD(M)
+MATLAB/normalize.m GOOD
@@ -381 +381 @@
-MQL5/Regex.mqh GOOD
+MQL5/Regex.mqh BAD(MQL4)
@@ -385 +385 @@
-M/ZDIOUT1.m BAD(Mercury)
+M/ZDIOUT1.m GOOD
@@ -408 +408 @@
-NewLisp/log-to-database.lisp BAD(Common Lisp)
+NewLisp/log-to-database.lisp GOOD
@@ -429 +429 @@
-Objective-C/Siesta.h BAD(C)
+Objective-C/Siesta.h GOOD
@@ -459,2 +459,2 @@
-Pawn/Check.inc BAD(PHP)
-Pawn/fixed.inc BAD(SourcePawn)
+Pawn/Check.inc GOOD
+Pawn/fixed.inc GOOD
@@ -466 +466 @@
-Perl/exception_handler.pl BAD(Raku)
+Perl/exception_handler.pl GOOD
@@ -472 +472 @@
-Perl/perl-test.t BAD(Raku)
+Perl/perl-test.t GOOD
@@ -479 +479 @@
-Perl/use5.pl BAD(Raku)
+Perl/use5.pl GOOD
@@ -484 +484 @@
-PHP/exception.zep.php BAD(Hack)
+PHP/exception.zep.php GOOD
@@ -490 +490 @@
-PHP/root.php BAD(Hack)
+PHP/root.php GOOD
@@ -506 +506 @@
-Proguard/proguard-rules.pro GOOD
+Proguard/proguard-rules.pro BAD(INI)
@@ -531,3 +531,3 @@
-q/tq.q GOOD
-Raku/01-dash-uppercase-i.t BAD(Perl)
-Raku/01-parse.t BAD(Perl)
+q/tq.q BAD(HiveQL)
+Raku/01-dash-uppercase-i.t GOOD
+Raku/01-parse.t GOOD
@@ -536 +536 @@
-Raku/A.pm BAD(Perl)
+Raku/A.pm GOOD
@@ -539,2 +539,2 @@
-Raku/calendar.t BAD(Perl)
-Raku/ContainsUnicode.pm BAD(Perl)
+Raku/calendar.t GOOD
+Raku/ContainsUnicode.pm GOOD
@@ -543 +543 @@
-Raku/hash.t GOOD
+Raku/hash.t BAD(Perl)
@@ -558 +558 @@
-Rebol/booters.r GOOD
+Rebol/booters.r BAD(R)
@@ -567 +567 @@
-RPC/rusers.x BAD(Logos)
+RPC/rusers.x GOOD
@@ -575 +575 @@
-RUNOFF/longlib.rno BAD(Roff)
+RUNOFF/longlib.rno GOOD
@@ -583 +583 @@
-SaltStack/top.sls GOOD
+SaltStack/top.sls BAD(Scheme)
@@ -595 +595 @@
-Smalltalk/categories.st GOOD
+Smalltalk/categories.st BAD(UNK)
@@ -599,2 +599,2 @@
-Smalltalk/scriptWithPragma.st BAD(HTML)
-Smalltalk/smallMethod.st GOOD
+Smalltalk/scriptWithPragma.st GOOD
+Smalltalk/smallMethod.st BAD(UNK)
@@ -603 +603 @@
-SourcePawn/mfile.inc BAD(C++)
+SourcePawn/mfile.inc BAD(Pawn)
@@ -608 +608 @@
-SuperCollider/WarpPreset.sc GOOD
+SuperCollider/WarpPreset.sc BAD(Scala)
@@ -623,3 +623,3 @@
-Text/LIDARLite.ncl GOOD
-Text/min-help.ncl BAD(NCL)
-Text/rmMonAnnCycLLT-help.ncl BAD(NCL)
+Text/LIDARLite.ncl BAD(XML)
+Text/min-help.ncl GOOD
+Text/rmMonAnnCycLLT-help.ncl GOOD
@@ -649 +649 @@
-Verilog/sha-256-functions.v BAD(V)
+Verilog/sha-256-functions.v GOOD
@@ -679 +679 @@
-XML/water.tsx BAD(TSX)
+XML/water.tsx GOOD

Combined with other improvements for the tokenizer, the number of errors get down to 22.

Alhadis · 2020-12-05T16:09:36Z

lib/linguist/classifier.rb


      nil
    end

+    # Public: Finalize training.


Don't you think graduate! would be a better name for this method? 😉

Makes sense 😉

Actually, I'm not so fond of train! and finalize_train!. I did it this way to minimize API changes, but it may make sense to replace both by a new train! (or fit!) that takes an enumerable of all traning samples and does the whole thing, without an intermediate state where the classifier object is not valid.

lib/linguist/classifier.rb

Alhadis · 2020-12-05T16:19:07Z

How's it rank in terms of speed? Compared with the current Bayesian classifier, I mean.

smola · 2020-12-05T17:11:07Z

@Alhadis How's it rank in terms of speed? Compared with the current Bayesian classifier, I mean.

On my machine, generating the samples.json goes from ~6.5s in master to ~3.9s. For the following script it goes from ~2.7s to ~2.6s:

require 'linguist'
blob = Linguist::FileBlob.new("lib/linguist/heuristics.rb")
languages = [Linguist::Language['Ruby'], Linguist::Language['Python']]
(1..5000).each do
  Linguist::Classifier.call(blob, languages)
end

This is not too serious benchmark for the classification part, since it's just one example file.

Actually, this is less of an improvement than what I was expecting for this change. I think I have too many local branches 🤣

lildude · 2020-12-11T11:08:00Z

I'm going to hold off looking into this until the new year. As you stated, this is quite a big change which I want to thoroughly review, test and run through a few peeps involved with the GitHub.com backend.

smola · 2020-12-11T13:56:08Z

@lildude I'm going to hold off looking into this until the new year. As you stated, this is quite a big change which I want to thoroughly review, test and run through a few peeps involved with the GitHub.com backend.

Makes sense. When reviewing classifier accuracy, you might be interested in my staging branch which includes this new classifier, tokenizer improvements and cross-validation tooling.

lildude · 2021-01-06T11:51:05Z

I've started playing with the changes in this PR and from a performance perspective, this is looking pretty impressive and seems to be about 3-4 times quicker on my MBP when checking individual files.

Before I get too deep into the weeds, I think it might be a better idea to get all the dependent PRs merged so we can make a like-for-like comparison between the Bayesian and Centroid classifiers without having to consider other pending changes. It'll also make it easier for me to switch between gems when testing from the GItHub app side of things.

Alhadis · 2021-01-08T08:41:57Z

@smola Any chance you could weigh in on #2988? Running the same sample through the Centroid-based classifier yields the same results as our current Bayesian classifier:

λ linguist (new-cbs-classifier): bundle exec bin/github-linguist ~/Downloads/ChatKeys-LF.toc
ChatKeys-LF.toc: 9 lines (9 sloc)
  type:      Text
  mime type: text/plain
  language:  World of Warcraft Addon Data
λ linguist (new-cbs-classifier): bundle exec bin/github-linguist ~/Downloads/ChatKeys-CR.toc
ChatKeys-CR.toc: 9 lines (9 sloc)
  type:      Text
  mime type: text/plain
  language:  TeX

I still haven't the foggiest as to why, how, or where CR line-endings are screwing with Linguist. Hoping you might have more idea than I… 😅

lildude · 2021-01-08T09:26:17Z

Running the same sample through the Centroid-based classifier yields the same results as our current Bayesian classifier:

It doesn't if you use the full staging branch which pulls in the tokeniser changes:

[~/t/t/smola-linguist staging]$ bundle exec bin/github-linguist ~/tmp/trash/nlcr/ChatKeys-CR.toc
ChatKeys-CR.toc: 9 lines (9 sloc)
  type:      Text
  mime type: text/plain
  language:  World of Warcraft Addon Data
[~/t/t/smola-linguist staging]$ bundle exec bin/github-linguist ~/tmp/trash/nlcr/ChatKeys-LF.toc
ChatKeys-LF.toc: 9 lines (9 sloc)
  type:      Text
  mime type: text/plain
  language:  World of Warcraft Addon Data
[~/t/t/smola-linguist staging]$

Also, this classifier performs even better when combined with tokenizer improvements such as #5060 and #5061 (with a few more coming up).

☝️ is why I stopped testing this PR directly as I was seeing the same thing, including the failing identification tests in this PR.

Alhadis · 2021-01-08T09:31:39Z

It doesn't if you use the full staging branch which pulls in the tokeniser changes:

I knew this had to be my fault… 😁 Thanks for the clarification. 😅

smola · 2021-01-12T17:57:10Z

I'm currently trying to split and prepare PRs from my staging branch in a way that makes sense. So bear with me 😉

lildude · 2021-03-08T17:48:18Z

The test failure here is because our bogus file with invalid unicode chars at https://github.com/github/linguist/blob/test/attributes/test/fixtures/ba%EF%BF%BDr/file_%C3%A3.pl is now being detected as Raku instead of Perl.

That's not really an issue as we don't really care what the language is for this test. We only care that a language is detected.

Changing all instances of Perl to Raku in the test should get this passing again. https://github.com/github/linguist/blob/2d76af2adb2b4103ef9ef4cf6dd79a34bc2a841d/test/test_repository.rb#L77-L80

lildude · 2021-03-19T09:31:35Z

I've gone ahead and pushed the test fix so I can update this PR and start testing it.

lildude

This is looking good and I've not hit any obvious problems with my testing over the past few days. I do have a few nits and question which I've outlined inline.

lib/linguist/samples.rb

test/test_classifier.rb

lib/linguist/classifier.rb

test/test_classifier.rb

test/test_blob.rb

test/test_classifier.rb

test/test_file_blob.rb

Training: * A fixed vocabulary is set to all tokens that appear in, at least, 2 samples. * All out-of-vocabulary tokens are discarded. * For every token, we set its Inverse Class Frequency (ICF) to `log(ct / cf) + 1` where `ct` is the total number of classes and `cf` is the number of classes where the token occurs. * Each sample is converted to a vector of `tf * icf` for every token in the vocabulary. `tf` is `1 + log(freq)`, where `freq` is the number of occurrences of the token in the given sample. * Samples are L2-normalized. * For each class (language), we compute the centroid of all its training samples by averaging them and L2-normalizing the result. Classification: * For a new sample, we get the L2-normalized vector with `tf * icf` terms for every known token, then classify the sample using the nearest centroid. Cosine similarity is used as similarity measure for this.

Co-authored-by: Colin Seymour <colin@github.com>

Randl · 2024-06-08T12:08:56Z

Any updates on this one?

lildude · 2024-08-06T14:50:21Z

Right... after a really long time of ignoring this PR, I think I've addressed the concerns I originally added and I'm happy with this now.

@Alhadis @smola time to dig out those rusty memories and lets see if we can put this to bed.

Alhadis · 2024-08-09T15:58:52Z

time to dig out those rusty memories

@lildude I'll try to get back to this in the next few days. Sorry for the slow response. 😞

Alhadis

LGTM! 👍

Alhadis

@Alhadis dismissed their stale review via 9d922e3 4 minutes ago

Uhm… that wasn't supposed to happen for a merge affecting files untouched by this PR. Reapproving this to err on the side of caution…

lildude

YOLO 🤣

smola force-pushed the new-cbs-classifier branch from c796210 to 22372ac Compare December 5, 2020 16:11

Alhadis reviewed Dec 5, 2020

View reviewed changes

smola force-pushed the new-cbs-classifier branch from 22372ac to c6d64e7 Compare December 5, 2020 17:54

lildude self-assigned this Dec 11, 2020

lildude requested a review from a team as a code owner January 4, 2021 10:10

smola force-pushed the new-cbs-classifier branch from 27fdbee to 44d4101 Compare January 5, 2021 16:21

Alhadis mentioned this pull request Jan 8, 2021

Old-style CR terminators can affect classification #2988

Closed

smola force-pushed the new-cbs-classifier branch 3 times, most recently from 1ab03b7 to f4cf832 Compare February 13, 2021 15:02

lildude reviewed Mar 30, 2021

View reviewed changes

lildude removed their assignment May 26, 2021

smola and others added 4 commits June 18, 2021 17:29

Fixture file is now detected as Raku

20b33ee

Update lib/linguist/samples.rb

ec2ca35

Co-authored-by: Colin Seymour <colin@github.com>

Update test/test_classifier.rb

9b6fa51

Co-authored-by: Colin Seymour <colin@github.com>

smola force-pushed the new-cbs-classifier branch from 348d88f to 9b6fa51 Compare June 18, 2021 15:29

Merge branch 'master' into new-cbs-classifier

6c13235

lildude and others added 7 commits October 20, 2022 11:17

Merge branch 'master' into new-cbs-classifier

1712974

Merge branch 'master' into new-cbs-classifier

a96276c

Add exec bit

7b80d5e

Merge remote-tracking branch 'upstream/master' into new-cbs-classifier

8b709be

Adjust acceptable errors

91de502

Merge branch 'master' into new-cbs-classifier

afc0417

Merge branch 'master' into new-cbs-classifier

37db40b

lildude requested a review from a team as a code owner September 8, 2023 06:59

Merge branch 'master' into new-cbs-classifier

8e475ff

lildude added 4 commits August 6, 2024 14:16

Remove two useless samples

1d559dd

Add a better R sample

170ebda

Remove fixmes

43716ce

Remove empty lines

1d50126

lildude requested a review from Alhadis August 6, 2024 14:48

Alhadis previously approved these changes Aug 14, 2024

View reviewed changes

Merge branch 'master' into new-cbs-classifier

9d922e3

Alhadis dismissed their stale review via 9d922e3 August 14, 2024 07:11

Alhadis approved these changes Aug 14, 2024

View reviewed changes

lildude approved these changes Aug 29, 2024

View reviewed changes

lildude added this pull request to the merge queue Aug 29, 2024

Merged via the queue into github-linguist:master with commit f7c7baa Aug 29, 2024
5 checks passed

This was referenced Sep 2, 2024

Update Linguist to v8.0.0 go-enry/go-enry#183

Merged

Port centroid-based classifier from linguist go-enry/go-enry#184

Open

DecimalTurn mentioned this pull request Sep 2, 2024

Update number of acceptable classification errors. #7026

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New Centroid-based Classifier #5103

New Centroid-based Classifier #5103

smola commented Dec 5, 2020 •

edited by lildude

Loading

smola commented Dec 5, 2020

Alhadis Dec 5, 2020

smola Dec 7, 2020

Alhadis commented Dec 5, 2020

smola commented Dec 5, 2020

lildude commented Dec 11, 2020

smola commented Dec 11, 2020

lildude commented Jan 6, 2021

Alhadis commented Jan 8, 2021

lildude commented Jan 8, 2021 •

edited

Loading

Alhadis commented Jan 8, 2021

smola commented Jan 12, 2021

lildude commented Mar 8, 2021 •

edited

Loading

lildude commented Mar 19, 2021

lildude left a comment

Randl commented Jun 8, 2024

lildude commented Aug 6, 2024 •

edited

Loading

Alhadis commented Aug 9, 2024

Alhadis left a comment

Alhadis left a comment

lildude left a comment

New Centroid-based Classifier #5103

New Centroid-based Classifier #5103

Conversation

smola commented Dec 5, 2020 • edited by lildude Loading

Summary

Benefits

How does it work?

Training

Classification

Backwards Compatibility

Future Improvements

Checklist

smola commented Dec 5, 2020

Alhadis Dec 5, 2020

Choose a reason for hiding this comment

smola Dec 7, 2020

Choose a reason for hiding this comment

Alhadis commented Dec 5, 2020

smola commented Dec 5, 2020

lildude commented Dec 11, 2020

smola commented Dec 11, 2020

lildude commented Jan 6, 2021

Alhadis commented Jan 8, 2021

lildude commented Jan 8, 2021 • edited Loading

Alhadis commented Jan 8, 2021

smola commented Jan 12, 2021

lildude commented Mar 8, 2021 • edited Loading

lildude commented Mar 19, 2021

lildude left a comment

Choose a reason for hiding this comment

Randl commented Jun 8, 2024

lildude commented Aug 6, 2024 • edited Loading

Alhadis commented Aug 9, 2024

Alhadis left a comment

Choose a reason for hiding this comment

Alhadis left a comment

Choose a reason for hiding this comment

lildude left a comment

Choose a reason for hiding this comment

smola commented Dec 5, 2020 •

edited by lildude

Loading

lildude commented Jan 8, 2021 •

edited

Loading

lildude commented Mar 8, 2021 •

edited

Loading

lildude commented Aug 6, 2024 •

edited

Loading