Skip to content

Latest commit

 

History

History
65 lines (47 loc) · 2.92 KB

README.md

File metadata and controls

65 lines (47 loc) · 2.92 KB

=============================================================================

Copyright (c) 2013. João Ventura (joaojonesventura@gmail.com).

==============================================================================

WikiCorpusExtractor is free software; you can redistribute it and/or modify it

under the terms of the GNU General Public License, version 3,

as published by the Free Software Foundation.

WikiCorpusExtractor is distributed in the hope that it will be useful,

but WITHOUT ANY WARRANTY; without even the implied warranty of

MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the

GNU General Public License for more details.

You should have received a copy of the GNU General Public License

along with this program. If not, see http://www.gnu.org/licenses/.

=============================================================================

WikiCorpusExtractor is a python library for creating corpora from Wikipedia XML dump files. The target audience are people which need a collection of texts for Language Processing tools.

The output of this library is a text file of the form:

Text which is tokenized , i.e., words and punctuation are separated by a space . Some special words like step-by-step or U.S.A. are correctly handled . ...

Usage for building an English corpus (search in the other Wikipedias for other languages)

DOWNLOAD XML DUMP FILE

CREATE A CORPUS FROM THE XML DUMP FILE (Python example)

from wikiXMLDump import WikiXMLDumpFile

#======== MAIN ========== if name == "main":

# Sources
enSource = 'Resources/sources/EN_Medicine_depth2.xml.bz2'

# Create object
wk = WikiXMLDumpFile(enSource)
# Show a document
wkDoc = wk.getWikiDocumentByTitle('Abortion')
print wkDoc
# Print portuguese translation of the title (if available)
print wkDoc.getTranslatedTitle('pt')
# Clean wikipedia markup and tokenize the text
wkDoc.cleanText()
wkDoc.tokenizeText()
print wkDoc
# Create a corpus of about 4M words and a minimum of about 500 words per document
wk.createCorpus(filename='Resources/corpora/EN_Medicin_corpora.txt', minWordsByDoc=500, maxWords=4000000)

Enjoy! :)