Skip to content

RanAR90/WikiCorpusExtractor

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

=============================================================================

Copyright (c) 2013. João Ventura (joaojonesventura@gmail.com).

==============================================================================

WikiCorpusExtractor is free software; you can redistribute it and/or modify it

under the terms of the GNU General Public License, version 3,

as published by the Free Software Foundation.

WikiCorpusExtractor is distributed in the hope that it will be useful,

but WITHOUT ANY WARRANTY; without even the implied warranty of

MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the

GNU General Public License for more details.

You should have received a copy of the GNU General Public License

along with this program. If not, see http://www.gnu.org/licenses/.

=============================================================================

WikiCorpusExtractor is a python library for creating corpora from Wikipedia XML dump files. The target audience are people which need a collection of texts for Language Processing tools.

The output of this library is a text file of the form:

Text which is tokenized , i.e., words and punctuation are separated by a space . Some special words like step-by-step or U.S.A. are correctly handled . ...

Usage for building an English corpus (search in the other Wikipedias for other languages)

DOWNLOAD XML DUMP FILE

CREATE A CORPUS FROM THE XML DUMP FILE (Python example)

from wikiXMLDump import WikiXMLDumpFile

#======== MAIN ========== if name == "main":

# Sources
enSource = 'Resources/sources/EN_Medicine_depth2.xml.bz2'

# Create object
wk = WikiXMLDumpFile(enSource)
# Show a document
wkDoc = wk.getWikiDocumentByTitle('Abortion')
print wkDoc
# Print portuguese translation of the title (if available)
print wkDoc.getTranslatedTitle('pt')
# Clean wikipedia markup and tokenize the text
wkDoc.cleanText()
wkDoc.tokenizeText()
print wkDoc
# Create a corpus of about 4M words and a minimum of about 500 words per document
wk.createCorpus(filename='Resources/corpora/EN_Medicin_corpora.txt', minWordsByDoc=500, maxWords=4000000)

Enjoy! :)

About

Extracts text from WikiMedia XML Dump files

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 100.0%