-
Notifications
You must be signed in to change notification settings - Fork 94
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
language c is unknown #87
Comments
I would have to see an example file itself to know why this is. Is it possible to post one here? Or at least create a file which replicate this? At the moment it is inclusive... If you look in the classifier database.json file however you can see an array called keywords which I plan to put the most common keywords in as a way of guessing the language when there are duplicate matching extensions. The logic should live in here
|
Any word on this? |
I'm out of office till 19th of April, cant deliver anything meaning full before that date, sorry |
No problem. |
Managed to replicate using the Golang repository https://github.com/golang/go.git search for http://localhost:8080/?q=assert&repo=golang /misc/cgo/life/c-life.c is reported as unknown. |
Replicated. The root cause is that there are multiple extensions which match C. When this is true the logic is meant to try and guess which one is correct based on the contents of the files. However the keywords are missing from the classifier database. For example,
To resolve this need to fix the database To do so I will run against the searchcode.com database to pull back the top 100 or so keywords for each language and use that to populate the database to resolve the issue. |
Proof of concept fix added. Works as expected. Need to build out the database correctly to resolve this properly. |
great 👍 wonder if it is worth keeping keywords that appears in all flavours, or just flavour specific for performance gain. |
It only falls back to that logic where it finds a conflict. So it shouldn't be that much of a problem in terms of performance. What I need to do is decide what to do when there are multiple matches and the keyword isn't able to split them. Probably will have it default to the first option. Have tried out with the following keywords for C and C++ which work for my sample Hello World test.
|
Ok, not totally resolved yet, but the following commit resolves it for C and C++ I will continue to build out the database (it literally takes days to build the keywords due to how much data is being crunched) and move it into resolved when done. |
Confirmed fixed against Go. Going to close this one out. |
have a bunch of files with 8 nothing special in name and "c" as an extension.
all are reported as unknown, even tho classifier has four records for this extension.
btw, are extension definition inclusive? so we can assign single extension to multiple languages and it will be covered by all language filters that met this condition?
The text was updated successfully, but these errors were encountered: