Skip to content

Proposal: 0 relevancy by default #2826

@joshgoebel

Description

@joshgoebel

The great relevancy cleanup:

  • 1c
  • abnf
  • accesslog
  • actionscript
  • ada
  • angelscript
  • apache
  • applescript
  • arcade
  • arduino
  • armasm
  • asciidoc
  • aspectj
  • autohotkey
  • autoit
  • avrasm
  • awk
  • axapta
  • bash
  • basic
  • bnf
  • brainfuck
  • c
  • cal
  • capnproto
  • ceylon
  • clean
  • clojure-repl
  • clojure
  • cmake
  • coffeescript
  • coq
  • cos
  • cpp
  • crmsh
  • crystal
  • csharp
  • csp
  • css
  • d
  • dart
  • delphi
  • diff
  • django
  • dns
  • dockerfile
  • dos
  • dsconfig
  • dts
  • dust
  • ebnf
  • elixir
  • elm
  • erb
  • erlang-repl
  • erlang
  • excel
  • fix
  • flix
  • fortran
  • fsharp
  • gams
  • gauss
  • gcode
  • gherkin
  • glsl
  • gml
  • go
  • golo
  • gradle
  • graphql
  • groovy
  • haml
  • handlebars
  • haskell
  • haxe
  • hsp
  • http
  • hy
  • inform7
  • ini
  • irpf90
  • isbl
  • java
  • javascript
  • jboss-cli
  • json
  • julia-repl
  • julia
  • kotlin
  • lasso
  • latex
  • ldif
  • leaf
  • less
  • [ ]
  • lisp
  • livecodeserver
  • livescript
  • llvm
  • lsl
  • lua
  • makefile
  • markdown
  • mathematica
  • matlab
  • maxima
  • mel
  • mercury
  • mipsasm
  • mizar
  • mojolicious
  • monkey
  • moonscript
  • n1ql
  • nestedtext
  • nginx
  • nim
  • nix
  • node-repl
  • nsis
  • objectivec
  • ocaml
  • openscad
  • oxygene
  • parser3
  • perl
  • pf
  • pgsql
  • php-template
  • php
  • plaintext
  • pony
  • powershell
  • processing
  • profile
  • prolog
  • properties
  • protobuf
  • puppet
  • purebasic
  • python-repl
  • python
  • q
  • qml
  • r
  • reasonml
  • rib
  • roboconf
  • routeros
  • rsl
  • ruby
  • ruleslanguage
  • rust
  • sas
  • scala
  • scheme
  • scilab
  • scss
  • shell
  • smali
  • smalltalk
  • sml
  • sqf
  • sql
  • stan
  • stata
  • step21
  • stylus
  • subunit
  • swift
  • taggerscript
  • tap
  • tcl
  • thrift
  • tp
  • twig
  • typescript
  • vala
  • vbnet
  • vbscript-html
  • vbscript
  • verilog
  • vhdl
  • vim
  • wasm
  • wren
  • x86asm
  • xl
  • xml
  • xquery
  • yaml
  • zephir

Original issue:

Is your request related to a specific problem you're having?

Relevance (and hence auto-detect) is all over the map because every mode receives 1 relevance by default. Just assuming that because something should be highlighted (or parsed) does not always mean it should be relevant to auto-detect. Proper auto-detect function [currently] requires VERY careful curation of the modes at a high-level (across many languages)... meaning if one language claims relevance for a specific syntactic structure then any other language that ALSO includes that structure also must claim relevance... otherwise one language just wins by "default".

IE, our recent support of operator is one example... as things stand now operators can't be given default relevance. IE, if we start by adding operators (and relevance) to a few languages then every snippet of code doing lots of math (operators) will now always "win" because it's getting points for operators where-as every other language (who also presumably share many of the same operators) is not getting any points.

This "balance" typically works with strings, comments, and such things because we provide MODE helpers for these that enforce relevance consistency across grammars.

The solution you'd prefer / feature you'd like to see added...

I'd like us to consider that modes receive 0 relevance by default, not 1. That all relevance should be opt-in, rather than opt-out. Grammars should try very hard to only claim relevancy for things that are truly relevant. This would result in more thought being put into relevancy and remove a lot of the relevance:0 dance we currently have to do .

Note I'm talking specifically about modes here, keywords would still retain there 1 by default... (see thoughts on keywords below). Right now it's too easy to accidentally add relevance with a complex ruleset... One quick example: beginKeywords... because this is BOTH a mode and a keywords key any keyword matched with beginKeywords now counts DOUBLE. I've been fixing this on a one off basis, but even if we make no other changes here it's likely that should change. But it's just one example of how easy it is to "accidentally" add relevance.

Suddenly any language trying to more nicely parse something like function blah() (which in one form or another is common in MANY languages) gets double points where-as all the languages without an explicit rule only get single points. Unfair.

It's much harder to accidentally add relevance like this with keywords because of how explicit keyword relevance is.

Any alternative solutions you considered...

I've often wondered if there should be a max count on how many times a rule can score relevancy. Ok, so we see int which is a keyword in language X, that tells us something - for sure, but if we see int 1000 times, does that really mean your language is 1000x more likely to be X?

Perhaps we shouldn't be looking at overall relevance scores but rather how "widely" the scores are spread... (this would require research I think)... IE, it should matter a lot more than your code include 100 different keywords from X than just a single keyword 100 times in a row... (which really might not be X at all)

It's possible these two approaches would pari well together also. Keywords "naturally" balance to a degree because every language has a list (it's usually the one thing even very simple grammars get right)... so if both Basic and Pascal have "for" then neither gets an advantage even if the code contains for 1000 times.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions