A simple tool to create a sorted list of most common words out of a set of pages picked randomly from Wikipedia.
In order to use this, you'll need the wikipedia Pyhton library, installable with:
pip install wikipedia
Run the script with:
wwl.py [OPTIONS]
-h, --helpShows the help message.-p, --pagesSets the number of pages to process (default100).-l, --langSets the language of the pages to retrieve (defaulten).-s, --specialSets the special chars to use as splitters (the space is always used, default\\!\"/()[]{}=?\'<>,;.:-—_+*@#«»).-m, --minSets the minimum length of the words to process (default1).-M, --maxSets the maximum length of the words to process (0for infinity, default0).-t, --threadsSets the maximum amount of threads working simultanously to retrieve pages (default1).-T, --timeoutSets the maximum time to wait for the threads to retrieve the pages once the last thread started (0for infinity, default30).-o, --outputSpecifies the output file location (defaultoutput.txt).-w, --wordsSpecifies the maximum number of words to save (0for infinity, default0).-d, --debugShows debug level logs.
wwl.py -p 1000 -l it -m 8 -t 50 -w 100
Saves 100 most common words of 8 or more characters out of 1000 italian pages using 50 threads.
WikipediaWordList is released under the Apache License 2.0.