A Python library for extracting clean text from Wikipedia articles. This is a refactored and modularized version of the original WikiExtractor tool, designed to be more maintainable and easier to integrate into other projects.
- Clean text extraction from Wikipedia markup
 - Template expansion support
 - Multiple output formats: Plain text, JSON, Markdown
 - Configurable processing options
 - Modular architecture for easy customization
 - Language support for multiple Wikipedia languages
 - HTML entity handling and cleanup
 
git clone https://github.com/Phongng26/wiki-extractor.git
cd wiki-extractor
pip install -r requirements.txtpip install wiki-extractor"""
Basic usage example for WikiExtractor
Simple demonstration with Wikipedia URL
"""
from wiki_extractor.extractor import Extractor
# Example raw Wikipedia markup (usually fetched via the Wikipedia API)
raw_text = """
{{Short description|Quantum algorithm}}
'''Shor's algorithm''' is a [[quantum algorithm]] for integer factorization...
"""
# Initialize extractor
extractor = Extractor(
    id="1",
    revid="101",
    urlbase="https://en.wikipedia.org/wiki",
    title="Shor's algorithm",
    page=raw_text
)
# Extract clean text (list of paragraphs)
result = extractor.clean_text(raw_text)
print("Number of paragraphs:", len(result))
print("First paragraph:", result[0])The library provides several configuration options:
keepLinks: Preserve internal links in outputkeepSections: Keep section structureHtmlFormatting: Enable HTML formattingmarkdown: Output in Markdown formatlanguage: Target language codediscardSections: Set of section titles to discarddiscardTemplates: Set of template names to discard
- Python 3.10+
 
- Fork the repository
 - Create a feature branch (
git checkout -b feature/amazing-feature) - Commit your changes (
git commit -m 'Add some amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
 
# Run tests
python -m pytest tests/
# Run tests with coverage
python -m pytest tests/ --cov=wiki_extractorThis project is licensed under the MIT License - see the LICENSE file for details.
- Based on the original WikiExtractor by Giuseppe Attardi
 - Inspired by the MediaWiki markup processing community
 
See CHANGELOG.md for a detailed history of changes.