modules/spellfiles: init #3143

GaetanLepage · 2025-04-03T15:52:53Z

Motivation

Nix(vim) is all about offering the best out-of-the-box experience.
Hence, why not providing the spellfiles on-demand?
This PR implements the new spellfiles top-level option which allows users to decide which spellfiles should be automatically fetched from the vim ftp server.

Fixes #1919

Under the hood, this option populates extraFiles."spell/en.utf-8.spl" (for example) with the files fetched from https://ftp.nluug.nl/pub/vim/runtime/spell/ with pkgs.fetchurl.

Nixvim's vendored spellfile hashes

I added an extra script to the update-scripts folder which pre-fetches the hashes of all spell files hosted on the official FTP server.
Hence, while the spellfiles are downloaded at build-time when users deploy their configurations, they won't need to manually provide the expected hash. This information will already be available in the nixvim repo (in generated/spellfiles.nix).

Note

One could think that vendoring the hashes in Nixvim and hoping they will still be valid when a user will build his config is a bad idea. Indeed, it it changes, one could have a hash-mismatch error without any fallback to overwrite it.
My answer to this is that those files are apparently very stable (last updated in 2019), so it is not a huge risk.

Module options (`cfg = config.spellfiles`)

This behavior is opt-in (cfg.enable defaults to false).

By default, the module automatically infers the list of spellfiles LANG.ENCODING.[spl,sug] to install:

LANG: Obtained from the following sources, in decreading priority:
- cfg.languages
- opts.spelllang
- ["en"] (default)
ENCODING: Obtained from the following sources, in decreading priority:
- opts.encoding
- "utf-8" (default)
.sug files: Actual spellfiles are .spl files and are always fetched. However, some LANG.ENCODING combos have a "suggestion" file .sug available. They are fetched if cfg.includeSuggestions is true (defaults to false).

Besides, he can also explicitly specify additionnal spellfiles to fetch:

spellfiles.spellfiles = [
  "en.utf-8.spl"
  "en.utf-8.sug"
  "fr.utf-8.spl"
];

Note: At first, I made cfg.spellfiles and cfg.languages exclusive, but actually, it is better to allow both of them to be set concurrently.

Design questions

cfg.languages or cfg.spelllangs?
Do we need to provide a cfg.encoding option?

Resources

Neovim documentation (https://neovim.io/doc/user/spell.html#spell)
Official spellfiles repo (https://ftp.nluug.nl/pub/vim/runtime/spell/)
NixOS wiki about vim spellfiles (https://nixos.wiki/wiki/Vim#Vim_Spell_Files)

Credits

The idea is from a very smart suggestion from @mirkolenz.

khaneliman · 2025-04-03T17:48:44Z

I really like this. I had been playing around with spellfiles, at some point, and was surprised we didn't have a better baked in solution for it. Thanks

mirkolenz · 2025-04-03T19:40:46Z

Wow, this new module looks really nice! Regarding your questions:

cfg.languages or cfg.spelllangs?

I think cfg.languages is more intuitive here.

Do we need to provide a cfg.encoding option?

It could even be useful to have cfg.encodings to support multiple values here (e.g., users may want to install different encodings if they work in different environments).

GaetanLepage · 2025-04-03T20:00:45Z

It could even be useful to have cfg.encodings to support multiple values here (e.g., users may want to install different encodings if they work in different environments).

Well, I don't know... At this point, users that want to customize both languages and encodings can directly explicitly list the spellfiles they want (with the spellfiles.spellfiles option), right?

Also, I am curious about why a user would want an encoding different than utf-8. Are there common situations where you would need to fallback to latin1?

mirkolenz · 2025-04-03T20:22:19Z

Well, I don't know... At this point, users that want to customize both languages and encodings can directly explicitly list the spellfiles they want (with the spellfiles.spellfiles option), right?

Yeah that makes sense!

Also, I am curious about why a user would want an encoding different than utf-8. Are there common situations where you would need to fallback to latin1?

I think that mostly happens when dealing with legacy systems or when opening old files (e.g., latex files from like ten years ago where often saved as latin1 due to encoding issues with utf-8).

MattSturgeon

This looks great! I've left some feedback below.

My only major concern is the way we're pinning URLs and hashes; I think this would too easily get out of sync and result in "old" configs being un-buildable.

generated/spellfiles.nix

update-scripts/fetch-spellfiles.nix

MattSturgeon · 2025-04-03T19:45:07Z

update-scripts/fetch-spellfiles.nix

+    # Extract only .spl and .sug file links
+    filenames=$(echo "$html" | grep -oP '(?<=href=")[^"]+\.(spl|sug)' | sort -u)


I'd rather do this with a syntax-aware tool like pandoc, but this is fine for now if it works reliably.

I think a pandoc impl would look very similar to our current docs/fix-links util

update-scripts/fetch-spellfiles.nix

MattSturgeon · 2025-04-03T19:49:59Z

update-scripts/fetch-spellfiles.nix

+
+      hash=$(nix hash convert --to sri --hash-algo sha256 "$sha256")
+
+      # Ugly hardcoding of the `%40` -> `@` substitution


Is there not a URI decode utility we could use to do this properly? Even if we have to add another runtime input?

There's some different approaches here: https://stackoverflow.com/questions/6250698/how-to-decode-url-encoded-string-in-shell

One answer suggests using python's urllib:

python3 -c "import sys; from urllib.parse import unquote; print(unquote(sys.stdin.read()));"

Although, if we're depending on python it's probably tempting to write the whole script in python instead of bash?

Some bash-based approaches are replacing all instances of '%' with '\x'. IDK if that correctly handles %% though.

This one with printf is nice too.

I tried many bashishms, and none was working. Also, it was very hard to do escape as this bash script is defined in a .nix file...

Again, I would question the relevance of over-engineering a script which input's changes once every decade.

This one with printf is nice too.

Well, I would very much avoid adding this much complexity. I prefer the script completely failing in 2 years rather than this.

Yeah, I was just listing various options. I'd suggest using python's urllib as the simplest and most robust approach, without too much over-engineering.

update-scripts/fetch-spellfiles.nix

modules/spellfiles.nix

MattSturgeon · 2025-04-03T20:29:27Z

modules/spellfiles.nix

+          # Remove region from language id ("en_us" -> "en")
+          (map (lang: lib.head (builtins.match "^([^_]+)" lang)))
+        ];
+        languages = if cfg.languages == "auto" then languagesFromSpelllang else cfg.languages;


Fancy idea: could we move this to cfg.languages apply function?

That way users can read the value to figure out what languages their "auto" definition actually ended up producing.

Format normalisation could also be done there.

Also makes me wonder if the option should be a list type? 🤔 And maybe auto should be a separate bool option?

Fancy idea: could we move this to cfg.languages apply function?

That way users can read the value to figure out what languages their "auto" definition actually ended up producing.

Good idea! Done.

Format normalisation could also be done there.

What do you mean? I think this is what I have done.

Also makes me wonder if the option should be a list type? 🤔 And maybe auto should be a separate bool option?

I don't really see a benefit there...

I don't really see a benefit there...

The benefit of a list-type is definition merging, and just generally having a more structural config.

If the option is a list-type, then I can define [ "en" ] in one module, define [ "fr" ] in another module, and it will be merged as [ "en" "fr" ]. With a str-type option, merging is not supported.

Or maybe you meant the separate auto-option? I think the benefit of that separation is simplifying the types and the impl, making things more obvious to end-users.

If you had a separate auto-toggle, you wouldn't need any if value == "auto"; in fact, you wouldn't need an apply at all: you could just have an extra definition that sets config.spellfiles.languages = lib.pipe ...

MattSturgeon · 2025-04-03T20:35:51Z

modules/spellfiles.nix

+          Maps "en.utf-8.spl" to:
+          {
+            name = "spell/en.utf-8.spl";
+
+            value.source = pkgs.fetchurl {
+              url = "https://ftp.nluug.nl/pub/vim/runtime/spell/en.utf-8.spl";
+              hash = "sha256-/sq9yUm2o50ywImfolReqyXmPy7QozxK0VEUJjhNMHA=";
+            };
+          }


I'm a little concerned about our pinning technique.

Firstly, I assume that spell files are occasionally modified, and don't offer a versioned/pinned mirror?

This leads to my issue with the assumption that we can generate a table of file hashes without running into issues.

E.g. a user rolls back to an old nixvim revision: now the hash is outdated
Or: nixvim is slow to update: same outcome...
Or: someone is bisecting their config
Or: we are bisecting nixvim (which has a test for spell files)

I think we either need to download the actual spell files in our generate script, or find a way to have versioned URLs.

See #3143 (comment)

Also, I thought about downloading the spellfiles, but they are binary files and some of them weigh several dozens GB. EDIT: MB* of course ^^

GaetanLepage · 2025-04-03T21:32:01Z

Thanks for the review @MattSturgeon!

Of course, I have thought about this issue of "outdated hashed" (see the PR description).
While this is a very valid concern (and could broke users' configurations in all the ways you have mentioned), I still think this design is valid.

--> Indeed, the spellfiles haven't been modified in the past 6+ years.

This doesn't mean that they won't get modified again at some point, but I would consider them to be very stable.
I know that this feels wrong or flaky, but when you think about it ~90% of Nixvim's code (if not more) is tracking waaaay more unstable upstream specifications (plugins, nixpkgs tooling and packages). Actually, even Neovim's own API is significantly more unstable than those spellfiles.

MattSturgeon · 2025-04-03T22:27:53Z

Of course, I have thought about this issue of "outdated hashed" (see the PR description).

Oops, I missed that.

Indeed, the spellfiles haven't been modified in the past 6+ years.

That is reassuring, if nothing else 😀

This doesn't mean that they won't get modified again at some point, but I would consider them to be very stable.

For me the issue is that when they inevitably do change, all configs that had been using the spellfile module will suddenly break without having been changed. This means the module is technically non-deterministic.

It's the same reason why you would specify a package src using a commit or a tag, rather than a branch ref, in nixpkgs. We want old revisions of nixpkgs to continue building indefinitely.

I know that this feels wrong or flaky, but when you think about it ~90% of Nixvim's code (if not more) is tracking waaaay more unstable upstream specifications (plugins, nixpkgs tooling and packages).

I think the key difference is that all those other things are handled by our flake.lock; e.g. nvim api changes are handled by having a specific revision of nvim pinned (via pinning nixpkgs)

All that said: I don't see a viable alternative, short of finding or hosting our own mirror. Which I don't want to do, either.

It's great that the files seem to have been stable for 6 years,¹ but it isn't great that they seemingly don't offer a stable/versioned URL we can use.

I did see their README has instructions for building spell files from source. That may be a better alternative, especially if it can be done in such a way that the nixpkgs and/or nix-community binary cache has a cached build of the spellfile packages.

This also makes me question whether this is the most up-to-date mirror... ↩

GaetanLepage · 2025-04-03T22:42:23Z

This also makes me question whether this is the most up-to-date mirror... ↩

It is the official one used by upstream to download spellfiles: https://github.com/vim/vim/blob/2a6be835122d906c6ec10682c2a771b25c87c611/runtime/autoload/spellfile.vim#L9.

I think the key difference is that all those other things are handled by our flake.lock; e.g. nvim api changes are handled by having a specific revision of nvim pinned (via pinning nixpkgs)

I agree...

Alternatives

We could expose the spellfiles in nixpkgs and have them cached by Hydra (pkgs.[neo]vim.passthru.spellfiles). That would be the best alternative I think. Although, (apart from building them from source) we would have the same URL-pinning issue there...
Otherwise, why not exposing them as one of our flake's outputs (packages)? As such, they would be at least cached on the nix-community cachix repo? Maybe, this wouldn't be reproducible either.
Finally, we could expose a cfg.spellfilesSources option allowing the users to specify the url and hash for each spellfile. This would let users use custom mirrors too and prevent anyone from being stuck with a hash-mismatch error.

MattSturgeon · 2025-04-03T23:07:18Z

We could expose the spellfiles in nixpkgs and have them cached by Hydra (pkgs.[neo]vim.passthru.spellfiles). That would be the best alternative I think. Although, (apart from building them from source) we would have the same URL-pinning issue there...

Vim spellfiles are apparently built from libreoffice dictionaries, which I assume are already packaged. If we can reproduce the from-source instructions as a derivation, it should be possible to package the vim spell files as derivations with the nixpkgs libreoffice dicts as nativeBuildInputs

Build steps

This involves downloading the files from the libreoffice github, applying a
patch, and running Vim to generate the .spl file.

To do this all in one go use the Aap program. It's simple to install, it only requires Python.

Before generating spell files, verify your system has the required locale
support. Source the check_locales.vim script to find out. If something is
missing, see LOCALE below.

You can also do it manually:

Fetch the right spell file from:
https://github.com/LibreOffice/dictionaries
Unzip the archive:
```
unzip LL_RR.zip
```
Apply the patch:
```
patch < LL_RR.diff
```
If the language has multiple regions do the above for each region. E.g.,
for English there are five regions: US, CA, AU, NZ and GB.
Run Vim and execute :mkspell. Make sure you do this with the correct
locale, that influences the upper/lower case letters and word characters.
On Unix it's something like:
```
env LANG=en_US.UTF-8 vim
mkspell! en en_US en_AU en_CA en_GB en_NZ
```
Repeat step 5 for other locales. For English you could generate a spell
file for latin1, utf-8 and ASCII. ASCII only makes sense for languages
that have very few words with non-ASCII letters.

LOCALE

For proper spell file generation the required locale must be installed. Otherwise Vim doesn't know what are letters and upper-lower case differences. Modern systems use UTF-8, but we also generate spell files for 8-bit locales
for users with older systems.

On Ubuntu the default is to only support locales for your own language. To
add others you need to do this:

sudo vim /var/lib/locales/supported.d/local
# Add needed lines from /usr/share/i18n/SUPPORTED
sudo dpkg-reconfigure locales

When using the check_locales.vim script, you need to exit Vim and restart it
to pickup the newly installed locales.

I guess we'd have to vendor the patch files within nixpkgs, unless there's a safe url from which they can be fetched (are they in a git repo?).

The thing I'm most unsure of is enabling all required locales within the nix builder sandbox. I guess we'd start by running check_locales.vim in the build sandbox and printing the output to $out?

GaetanLepage mentioned this pull request Apr 3, 2025

Cannot find word list "fr.utf-8.spl" or "fr.ascii.spl" #1919

Closed

1 task

GaetanLepage requested a review from a team April 3, 2025 15:58

update-scripts/generate: minor cosmetic changes

a324584

GaetanLepage force-pushed the spellfiles branch 3 times, most recently from 659e056 to 2c168d6 Compare April 3, 2025 16:57

MattSturgeon requested changes Apr 3, 2025

View reviewed changes

GaetanLepage force-pushed the spellfiles branch 2 times, most recently from 4f994e0 to 4d03d27 Compare April 3, 2025 21:26

GaetanLepage force-pushed the spellfiles branch from 4d03d27 to 1d675e6 Compare April 3, 2025 22:08

modules/spellfiles: init

420470d

GaetanLepage force-pushed the spellfiles branch from 1d675e6 to 420470d Compare April 3, 2025 22:24

GaetanLepage marked this pull request as draft April 6, 2025 19:16

		# Extract only .spl and .sug file links
		filenames=$(echo "$html" \| grep -oP '(?<=href=")[^"]+\.(spl\|sug)' \| sort -u)


		hash=$(nix hash convert --to sri --hash-algo sha256 "$sha256")

		# Ugly hardcoding of the `%40` -> `@` substitution

Uh oh!

modules/spellfiles: init #3143

Are you sure you want to change the base?

modules/spellfiles: init #3143

Uh oh!

Conversation

GaetanLepage commented Apr 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Nixvim's vendored spellfile hashes

Module options (cfg = config.spellfiles)

Design questions

Resources

Credits

Uh oh!

khaneliman commented Apr 3, 2025

Uh oh!

mirkolenz commented Apr 3, 2025

Uh oh!

GaetanLepage commented Apr 3, 2025

Uh oh!

mirkolenz commented Apr 3, 2025

Uh oh!

MattSturgeon left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

MattSturgeon Apr 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

MattSturgeon Apr 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

GaetanLepage Apr 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

GaetanLepage commented Apr 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

MattSturgeon commented Apr 3, 2025

Footnotes

Uh oh!

GaetanLepage commented Apr 3, 2025

Alternatives

Uh oh!

MattSturgeon commented Apr 3, 2025

Uh oh!

Uh oh!

GaetanLepage commented Apr 3, 2025 •

edited

Loading

Module options (`cfg = config.spellfiles`)

MattSturgeon Apr 3, 2025 •

edited

Loading

MattSturgeon Apr 3, 2025 •

edited

Loading

GaetanLepage Apr 3, 2025 •

edited

Loading

GaetanLepage commented Apr 3, 2025 •

edited

Loading