Skip to content

modules/spellfiles: init #3143

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

GaetanLepage
Copy link
Member

@GaetanLepage GaetanLepage commented Apr 3, 2025

Motivation

Nix(vim) is all about offering the best out-of-the-box experience.
Hence, why not providing the spellfiles on-demand?
This PR implements the new spellfiles top-level option which allows users to decide which spellfiles should be automatically fetched from the vim ftp server.

Fixes #1919

Under the hood, this option populates extraFiles."spell/en.utf-8.spl" (for example) with the files fetched from https://ftp.nluug.nl/pub/vim/runtime/spell/ with pkgs.fetchurl.

Nixvim's vendored spellfile hashes

I added an extra script to the update-scripts folder which pre-fetches the hashes of all spell files hosted on the official FTP server.
Hence, while the spellfiles are downloaded at build-time when users deploy their configurations, they won't need to manually provide the expected hash. This information will already be available in the nixvim repo (in generated/spellfiles.nix).

Note

One could think that vendoring the hashes in Nixvim and hoping they will still be valid when a user will build his config is a bad idea. Indeed, it it changes, one could have a hash-mismatch error without any fallback to overwrite it.
My answer to this is that those files are apparently very stable (last updated in 2019), so it is not a huge risk.

Module options (cfg = config.spellfiles)

This behavior is opt-in (cfg.enable defaults to false).

By default, the module automatically infers the list of spellfiles LANG.ENCODING.[spl,sug] to install:

  • LANG: Obtained from the following sources, in decreading priority:
    • cfg.languages
    • opts.spelllang
    • ["en"] (default)
  • ENCODING: Obtained from the following sources, in decreading priority:
    • opts.encoding
    • "utf-8" (default)
  • .sug files: Actual spellfiles are .spl files and are always fetched. However, some LANG.ENCODING combos have a "suggestion" file .sug available. They are fetched if cfg.includeSuggestions is true (defaults to false).

Besides, he can also explicitly specify additionnal spellfiles to fetch:

spellfiles.spellfiles = [
  "en.utf-8.spl"
  "en.utf-8.sug"
  "fr.utf-8.spl"
];

Note: At first, I made cfg.spellfiles and cfg.languages exclusive, but actually, it is better to allow both of them to be set concurrently.

Design questions

  • cfg.languages or cfg.spelllangs?
  • Do we need to provide a cfg.encoding option?

Resources

Credits

The idea is from a very smart suggestion from @mirkolenz.

@GaetanLepage GaetanLepage force-pushed the spellfiles branch 3 times, most recently from 659e056 to 2c168d6 Compare April 3, 2025 16:57
@khaneliman
Copy link
Contributor

I really like this. I had been playing around with spellfiles, at some point, and was surprised we didn't have a better baked in solution for it. Thanks

@mirkolenz
Copy link
Contributor

Wow, this new module looks really nice! Regarding your questions:

cfg.languages or cfg.spelllangs?

I think cfg.languages is more intuitive here.

Do we need to provide a cfg.encoding option?

It could even be useful to have cfg.encodings to support multiple values here (e.g., users may want to install different encodings if they work in different environments).

@GaetanLepage
Copy link
Member Author

It could even be useful to have cfg.encodings to support multiple values here (e.g., users may want to install different encodings if they work in different environments).

Well, I don't know... At this point, users that want to customize both languages and encodings can directly explicitly list the spellfiles they want (with the spellfiles.spellfiles option), right?

Also, I am curious about why a user would want an encoding different than utf-8. Are there common situations where you would need to fallback to latin1?

@mirkolenz
Copy link
Contributor

Well, I don't know... At this point, users that want to customize both languages and encodings can directly explicitly list the spellfiles they want (with the spellfiles.spellfiles option), right?

Yeah that makes sense!

Also, I am curious about why a user would want an encoding different than utf-8. Are there common situations where you would need to fallback to latin1?

I think that mostly happens when dealing with legacy systems or when opening old files (e.g., latex files from like ten years ago where often saved as latin1 due to encoding issues with utf-8).

Copy link
Member

@MattSturgeon MattSturgeon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great! I've left some feedback below.

My only major concern is the way we're pinning URLs and hashes; I think this would too easily get out of sync and result in "old" configs being un-buildable.

Comment on lines 22 to 23
# Extract only .spl and .sug file links
filenames=$(echo "$html" | grep -oP '(?<=href=")[^"]+\.(spl|sug)' | sort -u)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd rather do this with a syntax-aware tool like pandoc, but this is fine for now if it works reliably.

I think a pandoc impl would look very similar to our current docs/fix-links util


hash=$(nix hash convert --to sri --hash-algo sha256 "$sha256")

# Ugly hardcoding of the `%40` -> `@` substitution
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there not a URI decode utility we could use to do this properly? Even if we have to add another runtime input?

Copy link
Member

@MattSturgeon MattSturgeon Apr 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's some different approaches here: https://stackoverflow.com/questions/6250698/how-to-decode-url-encoded-string-in-shell

One answer suggests using python's urllib:

python3 -c "import sys; from urllib.parse import unquote; print(unquote(sys.stdin.read()));"

Although, if we're depending on python it's probably tempting to write the whole script in python instead of bash?

Some bash-based approaches are replacing all instances of '%' with '\x'. IDK if that correctly handles %% though.

This one with printf is nice too.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried many bashishms, and none was working. Also, it was very hard to do escape as this bash script is defined in a .nix file...

Again, I would question the relevance of over-engineering a script which input's changes once every decade.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one with printf is nice too.

Well, I would very much avoid adding this much complexity. I prefer the script completely failing in 2 years rather than this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I was just listing various options. I'd suggest using python's urllib as the simplest and most robust approach, without too much over-engineering.

# Remove region from language id ("en_us" -> "en")
(map (lang: lib.head (builtins.match "^([^_]+)" lang)))
];
languages = if cfg.languages == "auto" then languagesFromSpelllang else cfg.languages;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fancy idea: could we move this to cfg.languages apply function?

That way users can read the value to figure out what languages their "auto" definition actually ended up producing.

Format normalisation could also be done there.

Also makes me wonder if the option should be a list type? 🤔 And maybe auto should be a separate bool option?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fancy idea: could we move this to cfg.languages apply function?

That way users can read the value to figure out what languages their "auto" definition actually ended up producing.

Good idea! Done.

Format normalisation could also be done there.

What do you mean? I think this is what I have done.

Also makes me wonder if the option should be a list type? 🤔 And maybe auto should be a separate bool option?

I don't really see a benefit there...

Copy link
Member

@MattSturgeon MattSturgeon Apr 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't really see a benefit there...

The benefit of a list-type is definition merging, and just generally having a more structural config.

If the option is a list-type, then I can define [ "en" ] in one module, define [ "fr" ] in another module, and it will be merged as [ "en" "fr" ]. With a str-type option, merging is not supported.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or maybe you meant the separate auto-option? I think the benefit of that separation is simplifying the types and the impl, making things more obvious to end-users.

If you had a separate auto-toggle, you wouldn't need any if value == "auto"; in fact, you wouldn't need an apply at all: you could just have an extra definition that sets config.spellfiles.languages = lib.pipe ...

Comment on lines +146 to +163
Maps "en.utf-8.spl" to:
{
name = "spell/en.utf-8.spl";

value.source = pkgs.fetchurl {
url = "https://ftp.nluug.nl/pub/vim/runtime/spell/en.utf-8.spl";
hash = "sha256-/sq9yUm2o50ywImfolReqyXmPy7QozxK0VEUJjhNMHA=";
};
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a little concerned about our pinning technique.

Firstly, I assume that spell files are occasionally modified, and don't offer a versioned/pinned mirror?

This leads to my issue with the assumption that we can generate a table of file hashes without running into issues.

E.g. a user rolls back to an old nixvim revision: now the hash is outdated
Or: nixvim is slow to update: same outcome...
Or: someone is bisecting their config
Or: we are bisecting nixvim (which has a test for spell files)

I think we either need to download the actual spell files in our generate script, or find a way to have versioned URLs.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

@GaetanLepage GaetanLepage Apr 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, I thought about downloading the spellfiles, but they are binary files and some of them weigh several dozens GB. EDIT: MB* of course ^^

@GaetanLepage GaetanLepage force-pushed the spellfiles branch 2 times, most recently from 4f994e0 to 4d03d27 Compare April 3, 2025 21:26
@GaetanLepage
Copy link
Member Author

GaetanLepage commented Apr 3, 2025

Thanks for the review @MattSturgeon!

Of course, I have thought about this issue of "outdated hashed" (see the PR description).
While this is a very valid concern (and could broke users' configurations in all the ways you have mentioned), I still think this design is valid.

--> Indeed, the spellfiles haven't been modified in the past 6+ years.

This doesn't mean that they won't get modified again at some point, but I would consider them to be very stable.
I know that this feels wrong or flaky, but when you think about it ~90% of Nixvim's code (if not more) is tracking waaaay more unstable upstream specifications (plugins, nixpkgs tooling and packages). Actually, even Neovim's own API is significantly more unstable than those spellfiles.

@MattSturgeon
Copy link
Member

Of course, I have thought about this issue of "outdated hashed" (see the PR description).

Oops, I missed that.

Indeed, the spellfiles haven't been modified in the past 6+ years.

That is reassuring, if nothing else 😀

This doesn't mean that they won't get modified again at some point, but I would consider them to be very stable.

For me the issue is that when they inevitably do change, all configs that had been using the spellfile module will suddenly break without having been changed. This means the module is technically non-deterministic.

It's the same reason why you would specify a package src using a commit or a tag, rather than a branch ref, in nixpkgs. We want old revisions of nixpkgs to continue building indefinitely.

I know that this feels wrong or flaky, but when you think about it ~90% of Nixvim's code (if not more) is tracking waaaay more unstable upstream specifications (plugins, nixpkgs tooling and packages).

I think the key difference is that all those other things are handled by our flake.lock; e.g. nvim api changes are handled by having a specific revision of nvim pinned (via pinning nixpkgs)


All that said: I don't see a viable alternative, short of finding or hosting our own mirror. Which I don't want to do, either.

It's great that the files seem to have been stable for 6 years,1 but it isn't great that they seemingly don't offer a stable/versioned URL we can use.

I did see their README has instructions for building spell files from source. That may be a better alternative, especially if it can be done in such a way that the nixpkgs and/or nix-community binary cache has a cached build of the spellfile packages.

Footnotes

  1. This also makes me question whether this is the most up-to-date mirror...

@GaetanLepage
Copy link
Member Author

  1. This also makes me question whether this is the most up-to-date mirror...

It is the official one used by upstream to download spellfiles: https://github.com/vim/vim/blob/2a6be835122d906c6ec10682c2a771b25c87c611/runtime/autoload/spellfile.vim#L9.

I think the key difference is that all those other things are handled by our flake.lock; e.g. nvim api changes are handled by having a specific revision of nvim pinned (via pinning nixpkgs)

I agree...

Alternatives

  • We could expose the spellfiles in nixpkgs and have them cached by Hydra (pkgs.[neo]vim.passthru.spellfiles). That would be the best alternative I think. Although, (apart from building them from source) we would have the same URL-pinning issue there...
  • Otherwise, why not exposing them as one of our flake's outputs (packages)? As such, they would be at least cached on the nix-community cachix repo? Maybe, this wouldn't be reproducible either.
  • Finally, we could expose a cfg.spellfilesSources option allowing the users to specify the url and hash for each spellfile. This would let users use custom mirrors too and prevent anyone from being stuck with a hash-mismatch error.

@MattSturgeon
Copy link
Member

  • We could expose the spellfiles in nixpkgs and have them cached by Hydra (pkgs.[neo]vim.passthru.spellfiles). That would be the best alternative I think. Although, (apart from building them from source) we would have the same URL-pinning issue there...

Vim spellfiles are apparently built from libreoffice dictionaries, which I assume are already packaged. If we can reproduce the from-source instructions as a derivation, it should be possible to package the vim spell files as derivations with the nixpkgs libreoffice dicts as nativeBuildInputs

Build steps

This involves downloading the files from the libreoffice github, applying a
patch, and running Vim to generate the .spl file.

To do this all in one go use the Aap program. It's simple to install, it only requires Python.

Before generating spell files, verify your system has the required locale
support. Source the check_locales.vim script to find out. If something is
missing, see LOCALE below.

You can also do it manually:

  1. Fetch the right spell file from:
    https://github.com/LibreOffice/dictionaries

  2. Unzip the archive:

    unzip LL_RR.zip
  3. Apply the patch:

    patch < LL_RR.diff
  4. If the language has multiple regions do the above for each region. E.g.,
    for English there are five regions: US, CA, AU, NZ and GB.

  5. Run Vim and execute :mkspell. Make sure you do this with the correct
    locale, that influences the upper/lower case letters and word characters.
    On Unix it's something like:

    env LANG=en_US.UTF-8 vim
    mkspell! en en_US en_AU en_CA en_GB en_NZ
  6. Repeat step 5 for other locales. For English you could generate a spell
    file for latin1, utf-8 and ASCII. ASCII only makes sense for languages
    that have very few words with non-ASCII letters.

LOCALE

For proper spell file generation the required locale must be installed. Otherwise Vim doesn't know what are letters and upper-lower case differences. Modern systems use UTF-8, but we also generate spell files for 8-bit locales
for users with older systems.

On Ubuntu the default is to only support locales for your own language. To
add others you need to do this:

sudo vim /var/lib/locales/supported.d/local
# Add needed lines from /usr/share/i18n/SUPPORTED
sudo dpkg-reconfigure locales

When using the check_locales.vim script, you need to exit Vim and restart it
to pickup the newly installed locales.

I guess we'd have to vendor the patch files within nixpkgs, unless there's a safe url from which they can be fetched (are they in a git repo?).

The thing I'm most unsure of is enabling all required locales within the nix builder sandbox. I guess we'd start by running check_locales.vim in the build sandbox and printing the output to $out?

@GaetanLepage GaetanLepage marked this pull request as draft April 6, 2025 19:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Cannot find word list "fr.utf-8.spl" or "fr.ascii.spl"
4 participants