-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
File Extension as Aliases #2371
Comments
Of course you wouldn't actually ever do it over and over. If performance mattered at all you'd only do the auto-detect a single time and then cache that result for future use. That said if you can trust the extensions, that could be much faster. Of course historically we don't care about extensions since we aren't dealing with them, we're only dealing with source. My suggestion here is that we make an actual So something like: // plaintext.js
{
aliases: ['text'],
extensions: ['txt'],
...
} |
@isagalaev Do you have any historical context on what the actual original intention was here - that might guide us or add to the discussion? I'm pretty sure this wasn't the original intent because we're sorely lacking some obvious extensions (like htaccess, as mentioned)... yet for many languages it sure appears we are already doing this. @egor-rogov Any thoughts? |
The advantage of keeping them separate:
The only real downside I see is that we need to go over existing data and split out the extensions from the aliases, and this would likely for the short-term mean we needed to behaviorally treat them as the same... Although it does lead to the question should extension SOMETIMES also be an alias? like One thing I'd love to know is a rough count of how many extensions we already have as aliases. If it's already quite high, perhaps we just roll with it... but if it's pretty low, then I think this is worth a moment of thought. |
I'm afraid it's not that hypothetical. For example, Of course we can think about how to make it easier, but I'm afraid it's not straightforward. We can, for example, provide the separate list of extensions (as you suggested), but allow different languages to have same extensions etc., and let the application make the final decision. |
Good point.
This feels right at first glance. I think aliases like "js" and "rb" are pretty common (I use them all the time on Github) but I think the KEY here might be that they are used because they are SHORTCUTS, not because they are extensions. So one wouldn't write I guess right now it's all a bit muddled, which is why my mind instantly thought about creating the separation, but I'm afraid then it will be hard to "prove" something is or isn't an alias... extensions are pretty well defined though.
Sounds reasonable but how does that work when the block is And if we're TRULY going to add extensions, does that mean we need a |
And we have ridiculous things like:
Which seem entirely unnecessary as aliases... I don't think they are extensions either. |
Then you have categories also like assembly... |
In a quick review it seems the ship might have already sailed on adding extensions to aliases... so I'm leaning towards approving things (like the PR to add Honestly though I wonder if anyone wants to put in the work to split the existing aliases out into extensions... I'm not too excited about doing it... perhaps we just soldier on with aliases until it becomes a larger problem? Right now someone who wanted to load up a bunch of conflicting aliases would have to deal with it by hand, or simply not rely on the aliases to work since really the last language loaded would be the "winner"... |
Extension-based autodetection looks reasonable.
Looks like something useful to me, but surely not the first priority. |
Agree. |
This starts to smell a little like shebang lines though (just another way to detect/categorize)... and I don't think you were super encouraging of that as a core feature. What would make this different? Well, maybe it's a little different since we already seem to do it via aliases. :-) |
Well, "extensions feature" doesn't change the way HLJS works. It's just the matter of narrowing down the list of languages for autodetection and passing it to the the existing API. On the other hand, shebang is inside the code, and it is grammar that we use to deal with the code. It doesn't look right to teach HLJS to look into the code using means other that the grammar. (It's okay for the application to sniff the code to be highlighted, find shebang, parse it somehow, and pass the language to the existing HLJS API, though.) It's just how I feel it, of course. Perhaps I'm wrong. |
I think you could potentially say the same for shebang data... really aren't we talking about whether a grammar can host data that we don't use DIRECTLY, but rather plugins or the source application could use indirectly to help correctly categorize a particular file/snippet? If we are ok hosting extension data, BUT we don't use them directly, then why not host shebang lines... or any other "per-language" meta-data that might prove useful in general? And if we're not such a repository, then perhaps we shouldn't host extension data at all? Say "that is external to us, we only look at code".... You could even argue shebang analysis is more in-scope than extensions... since shebang is part of the code itself... where-as extensions (and filenames in general) exist completely outside that sphere. :-) |
It would be great to know if anyone is currently categorizing snippets by extension alone like in the use case mentioned in the first post here. |
Hmm. If we're talking about storing some metadata... you almost convinced me. I think we shall return to the discussion later in more detail. |
At least it is standardized. Some web server configurations use file extensions as a reference to automatically determine MIME types for the browsers. I think this is pretty standard as the configuration data will be served for any browser, anywhere. And so file extensions can be used as language categories too. |
But the standard you're talking about there is extension to mime type mapping... that doesn't directly help us since we don't have a list of canonical extensions or a list of mime types. If you're merely saying it's helpful to be able to map from an extension to knowing what the file is, there is no disagreement on that point. :-) I think the long-term question here is HOW we should encode that data... whether to continue using alias or to split the data out into it's own field. |
ObservableHQ uses highlight.js I made an Observable notebook that lets people include code snippets in their Observables by referencing their URLs and parsing the URL contents into Markdown. Since the user supplies a URL, I automatically get the file extension with it, and I've been using this to figure out what kind of tag to put in the generated markdown block, which is the alias The problem is that as it stands, I have to hard-code in a bunch of file extensions and their aliases, and that seems like the wrong way to do it. If there was an independent mapping between the language full names and the extensions, this would help. |
@anwarhahjjeffersongeorge Why don't you share your list of mappings just so we see what that looks like. |
@yyyc514 no problem. it isn't anything fancy: codeexts: { // these are the code extensions I'm dealing with
javascript: ['mjs', 'ejs', 'js', 'jscad'],
typescript: ['ts'],
openscad: ['scad'],
processing: ['pde'],
arduino: ['ino'],
c: ['c', 'h'],
cpp: ['cpp', 'hpp', 'cxx', 'hxx'],
bash: ['sh'],
python: ['py'],
} For my use case, I just do something like let codetypename = ''
for (let key in codeexts) {
if (codeexts[key].includes(ext) ) {
codetypename = key
break
}
} But I can see this might not work for the whole highlight.js library since I have to add custom ones, like |
I think for now we'll keep adding these aliases unless someone wanted to make a PR and do all the work of splitting them out. Not a high priority for me. If we did split them out I think originally we'd still have to merge extensions and aliases for the same behavior as we have now (to not break anything). So it would really just be a data enhancement to the library to let people work with extensions/query them, etc. if they wanted to. |
Pretty sure we already have most of those as aliases. |
@taufik-nurrohman Any chance you want to do this work and make a PR? :-) |
@yyyc514 It requires editing the core then. I could do it but maybe need to be very careful.
Hotfix: scope.aliases = scope.aliases.concat(scope.extensions || []); |
Essentially, but I was really asking about the work of going thru all the 285 files and trying to make sense of extensions for all of them. :-) Writing the one-liner is the easy part. :-) |
This type of work would pair well with: |
No one seems to be really pushing for this. Closing due to inactivity. |
Would like too see this kind of request #2523 being added so that I can make a plugin related to this out of highlight.js core. |
@taufik-nurrohman If you wanted to whip up a PR for #2523 it'd probably be pretty simple... |
I would suggest to add common file extensions to every language into aliases as it will be easier for developers who want to create language detection without depending on the built-in language detection based on the file extension. For example,
apache
should have an aliashtaccess
too.Imagine someone makes a git repository viewer application and then uses highlight.js to color the code syntax in their files. This will be more robust to automatically detecting language through file extensions than by reading the file contents over and over using the available language packages until it finds the most relevant match.
This also opens up various possibilities to load language packages asynchronously based on file extensions, so the amount of data transferred will be much smaller considering that JavaScript works on the client side.
The text was updated successfully, but these errors were encountered: