Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Runtime errors with ja, zh, and id. #550

Closed
cyanic-selkie opened this issue Sep 5, 2023 · 6 comments
Closed

Runtime errors with ja, zh, and id. #550

cyanic-selkie opened this issue Sep 5, 2023 · 6 comments

Comments

@cyanic-selkie
Copy link

Hi,

Thank you for the awesome library!

I am currently using dumpster-dip to generate a dataset from all Wikipedia languages. It ran fine for all languages except ja, zh, id.

Specifically, for ja and zh I got the following error:

TypeError [Error]: Cannot read properties of undefined (reading '0')
    at Object.max (file:///.../node_modules/wtf_wikipedia/src/template/custom/text-only/functions.js:544:25)
    at parseTemplate (file:///.../node_modules/wtf_wikipedia/src/template/parse/index.js:60:32)
    at parseNested (file:///.../node_modules/wtf_wikipedia/src/template/index.js:18:24)
    at file:///.../node_modules/wtf_wikipedia/src/template/index.js:40:28
    at Array.forEach (<anonymous>)
    at allTemplates (file:///.../node_modules/wtf_wikipedia/src/template/index.js:40:10)
    at process (file:///.../node_modules/wtf_wikipedia/src/template/index.js:51:24)
    at new Section (file:///.../node_modules/wtf_wikipedia/src/02-section/Section.js:59:5)
    at parseSections (file:///.../node_modules/wtf_wikipedia/src/02-section/index.js:69:19)
    at new Document (file:///.../node_modules/wtf_wikipedia/src/01-document/Document.js:81:22)

For id, I got:

TypeError [Error]: Cannot read properties of undefined (reading 'substr')
    at str mid (file:///.../node_modules/wtf_wikipedia/src/template/custom/text-only/functions.js:68:20)
    at parseTemplate (file:///.../node_modules/wtf_wikipedia/src/template/parse/index.js:60:32)
    at parseNested (file:///.../node_modules/wtf_wikipedia/src/template/index.js:18:24)
    at file:///.../node_modules/wtf_wikipedia/src/template/index.js:15:36
    at Array.forEach (<anonymous>)
    at parseNested (file:///.../node_modules/wtf_wikipedia/src/template/index.js:15:20)
    at file:///.../node_modules/wtf_wikipedia/src/template/index.js:40:28
    at Array.forEach (<anonymous>)
    at allTemplates (file:///.../node_modules/wtf_wikipedia/src/template/index.js:40:10)
    at process (file:///.../node_modules/wtf_wikipedia/src/template/index.js:51:24)

It is also worth noting that for es to complete successfully, I had to set --max-old-space-size to 20000, which seems excessive, especially since no other language requires changing the default. If I left it at default (or even set to 10000), I got the following error:

Error [ERR_WORKER_OUT_OF_MEMORY]: Worker terminated due to reaching memory limit: JS heap out of memory
    at new NodeError (node:internal/errors:405:5)
    at [kOnExit] (node:internal/worker:313:26)
    at Worker.<computed>.onexit (node:internal/worker:229:20) {
  code: 'ERR_WORKER_OUT_OF_MEMORY'
}
@spencermountain
Copy link
Owner

Thank you! I'll take a look at fixing these runtime errors tomorrow. Will release a fix fir them asap
Cheers

spencermountain added a commit that referenced this issue Sep 6, 2023
@spencermountain spencermountain mentioned this issue Sep 6, 2023
Merged
@spencermountain
Copy link
Owner

hey @cyanic-selkie - both errors should be fixed now in 10.1.6. Let me know if you see any others.

Yeah - the es memory issue looks like a memleak in dumpster-dive - Can you help me reproduce it?
I haven't seen it before.
cheers

@cyanic-selkie
Copy link
Author

cyanic-selkie commented Sep 7, 2023

I just reran it for id, zh, and ja and it works without any errors.

The es issue remains. I am using node==20.5.1. on a server with 64 threads and 128 GB of RAM. The code is here. I tried it with 64 and 8 workers, the error happens in both cases after a few minutes of parsing. Do you need any additional information to help you reproduce it?

On a side note, I'd like to suggest using ^10 or similar for the wtf_wikipedia dependency version if you're using SemVer, since I had to clone the repository in order to update to the new version.

@spencermountain
Copy link
Owner

spencermountain commented Sep 9, 2023

thanks - that's a real doozy. Wonder why it's only spanish??
I looked at the script, and you haven't declared a few of those variables, which may do it.

i just ran es on my mac and it ran smoothly:

const opts = {
  input: path.join(dir, `/${lang}wiki-latest-pages-articles.xml`),
  outputMode: "ndjson",
  outputDir: path.join(dir, lang),
  parse: function (doc) {
    return doc.json()
  }
}
dip(opts).then(() => {
  console.log('done!')
})

will you try that, on your machine?
cheers

@spencermountain
Copy link
Owner

good idea using ^10 . Will add that to the next release.

@cyanic-selkie
Copy link
Author

@spencermountain I just fixed the variable declarations and it works perfectly. I'm not used to JS, so thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants