Open
Description
Hi, Thank you for sharing this outstanding repository!
I have been trying to use scripts/make_wikipedia_py
to process a German wikipedia dump:
python scripts/make_wikipedia.py --output wikipedia --lang de --date 20240201 --processes 16
Unfortunately, it has been running for several days and judging from the outputs it seems to have made only little progress if I interpret the output correctly:
[...]
WARNING:root:Template errors in article 'Buckenhof' (395836): title(0) recursion(96, 0, 0)
WARNING:root:Template errors in article 'Imsterberg' (395533): title(0) recursion(7929961, 0, 0)
WARNING:root:Template errors in article 'Spardorf' (395843): title(0) recursion(96, 0, 0)
WARNING:root:Template errors in article 'Marloffstein' (395848): title(0) recursion(96, 0, 0)
WARNING:root:Template errors in article 'Karres' (395572): title(0) recursion(7929961, 0, 0)
[...]
At this speed, it would take weeks to complete. Using htop I can see that all processes are busy, so I don't think that this is a multiprocessing problem (#58), however, I am also running it on a Linux machine.
This is likely a problem of the underlying wikiextractor
library, but since there seems to be little to no activity and I am interested in your experience of using this script. Is it normal for this to take so long?
Metadata
Metadata
Assignees
Labels
No labels