Skip to content

Conversation

@esimonov
Copy link

What

Read only the first line of the .jsonl file in order to determine its dictinfo, instead of the entire file. That suffices due to JSONL specification:

Each Line [in a JSONL file] is a Valid JSON Value - https://jsonlines.org/

Why

English nouns Kaikki Wiktionary dump (~1.4GB in size) fails to import on my Windows PC (16 Gb RAM) with MemoryError:

Vocabsieve

Stack trace leads to the dictinfo function which loads the entire file into memory just for the sake of determining its format. Having that said, parseKaikki function that also processes JSONL files is already optimised with line-by-line reads.

@esimonov
Copy link
Author

Hey team, can you review it, please? Without these changes, I'm unable to add nouns to my deck, as Wiktionary API is returning 403s.

2025-09-21 15:57:44.199 | ERROR    | vocabsieve.sources.wiktionary_source:_lookup:35 - Failed to get data from Wiktionary: HTTPError('403 Client Error: Forbidden for url: https://en.wiktionary.org/api/rest_v1/page/definition/Language')

@1over137
Copy link
Contributor

Sorry seems like this slipped through the inbox. This is a good idea, but can you also make it handle the case if you give it a jsonl.{xz,gz,bz2}?

@esimonov
Copy link
Author

@1over137 Absolutely! I updated my branch. Please make sure that you agree with the new contents of .gitignore: I had to update it in order to verify my changes with unit tests.

@esimonov
Copy link
Author

Ping, in case your inbox was too busy again :)

@esimonov
Copy link
Author

esimonov commented Oct 4, 2025

Weekly bump anyway.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants