Skip to content

make_wikipedia.py: long running time #121

Open
@chschroeder

Description

@chschroeder

Hi, Thank you for sharing this outstanding repository!

I have been trying to use scripts/make_wikipedia_py to process a German wikipedia dump:

python scripts/make_wikipedia.py --output wikipedia --lang de  --date 20240201 --processes 16

Unfortunately, it has been running for several days and judging from the outputs it seems to have made only little progress if I interpret the output correctly:

[...]
WARNING:root:Template errors in article 'Buckenhof' (395836): title(0) recursion(96, 0, 0)
WARNING:root:Template errors in article 'Imsterberg' (395533): title(0) recursion(7929961, 0, 0)
WARNING:root:Template errors in article 'Spardorf' (395843): title(0) recursion(96, 0, 0)
WARNING:root:Template errors in article 'Marloffstein' (395848): title(0) recursion(96, 0, 0)
WARNING:root:Template errors in article 'Karres' (395572): title(0) recursion(7929961, 0, 0)
[...]

At this speed, it would take weeks to complete. Using htop I can see that all processes are busy, so I don't think that this is a multiprocessing problem (#58), however, I am also running it on a Linux machine.

This is likely a problem of the underlying wikiextractor library, but since there seems to be little to no activity and I am interested in your experience of using this script. Is it normal for this to take so long?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions