GitHub - peter17/mediawiki-parser: An experimental Python parser for MediaWiki syntax with a focus on extensibility and comprehensibility

Presentation

This is a parser for MediaWiki's (MW) syntax. It's goal is to transform wikitext into an abstract syntax tree (AST) and then render this AST into various formats such as plain text and HTML.

It is an original work by Peter Potrowl and his mentor Erik Rose achieved during the Google Summer of Code 2011.

Requirements

This parser relies on Pijnu. You must install the latest version of Pijnu, available at: https://github.com/peter17/pijnu

Do not use the version from http://spir.wikidot.com, which is outdated.

For basic and simple installation, just try:

pip install mediawiki-parser

How it works

Two files, preprocessor.pijnu and mediawiki.pijnu describe the MW syntax using patterns that form a grammar. Another Python tool called Pijnu will interpret those grammars and use them to match the wikitext content and build the AST.

Then, specific Python functions will render the leaves of the AST into the wanted format.

The reason why we use two grammars is that we will first substitute the templates in the wikitext with a preprocessor before actually parsing the content of the page.

Building the parsers

The preprocessor and mediwiki parsers must be built from the Pijnu grammars before you can use mediawiki-parser. You can build them through setup.py, possibly setting the PYTHONPATH to point at pijnu:

cd /PATH/TO/mediawiki-parser/
env PYTHONPATH=/PATH/TO/pijnu python setup.py build_parsers

How to test

The current simplest way to test the tool is to put wikitext inside the wikitext.txt file. Then, run:

python parser.py

and the wikitext will be rendered as HTML in the article.htm file.

Other ways might be implemented in the future.

Unit tests

Install nose and run:

cd /PATH/TO/mediawiki-parser/
env PYTHONPATH=/PATH/TO/pijnu/ nosetests tests

How to use in a program

Example for HTML

In order to use this tool to render wikitext into HTML in a Python program, you can use the following lines:

templates = {}
allowed_tags = []
allowed_self_closing_tags = []
allowed_attributes = []
interwiki = {}
namespaces = {}

from mediawiki_parser.preprocessor import make_parser
preprocessor = make_parser(templates)

from mediawiki_parser.html import make_parser
parser = make_parser(allowed_tags, allowed_self_closing_tags, allowed_attributes, interwiki, namespaces)

preprocessed_text = preprocessor.parse(source)
output = parser.parse(preprocessed_text.leaves())

The output string will contain the rendered HTML. You should describe the behavior you expect by filling the variables of the first lines:

if the wikitext calls foreign templates, put their names and content in the templates dict (e.g.: {'my template': 'my template content'})
if some HTML tags are allowed on your wiki, list them in the allowed_tags list (e.g.: ['center', 'big', 'small', 'span']; avoid 'script' and some others, for security reasons)
if some self-closing HTML tags are allowed on your wiki, list them in the allowed_self_closing_tags list (e.g.: ['br', 'hr']; avoid 'script' and some others, for security reasons)
if some HTML tags are allowed on your wiki, list the attributes they can use the allowed_attributes list (e.g.: ['style', 'class']; avoid 'onclick' and some others, for security reasons)
if you want to be able to use interwiki links, list the foreign wikis in the interwiki dict (e.g.: {'fr': 'http://fr.wikipedia.org/wiki/'})
if you want to be able to distinguish between standard links, file inclusions or categories, list the namespaces of your wiki in the namespaces dict (e.g.: {'Template': 10, 'Category': 14, 'File': 6} where the numbers are the namespace codes used in MW)

Example for text

In order to use this tool to render wikitext into text in a Python program, you can use the following lines:

templates = {}

from mediawiki_parser.preprocessor import make_parser
preprocessor = make_parser(templates)

from mediawiki_parser.text import make_parser
parser = make_parser()

preprocessed_text = preprocessor.parse(source)
output = parser.parse(preprocessed_text.leaves())

The output string will contain the rendered text. If the wikitext calls foreign templates, put their names and content in the templates dict (e.g.: {'my template': 'my template content'})

Example for templates substitution

If you just want to replace the templates in a given wikitext, you can just call the preprocessor and no rendering postprocessor:

templates = {}

from mediawiki_parser.preprocessor import make_parser
preprocessor = make_parser(templates)

output = preprocessor.parse(source)

The output string will contain the rendered wikitext. Put the templates names and content in the templates dict (e.g.: {'my template': 'my template content'})

Postprocessors

The parser produces an AST. In order to provide human readable output, three postprocessors are provided:

html.py, for HTML output
text.py, for text output
raw.py, for raw output

For now, we mainly focused on HTML postprocessor. The text output might not be as cleaned as expected.

You can adapt them according to your needs.

Known bugs

This tool should be able to render any wikitext page into text or HTML.

However, it does not intent to be bug-for-bug compatible with MW. For instance, using HTML entities in template calls (e.g.: '{{temp©late}}') is currently not supported.

Please don't hesitate to report bugs that you may find when using this tool.

Special thanks

To Nicholas Burlett for his directory restructure, performance improvements and other fixes

Name		Name	Last commit message	Last commit date
Latest commit History 170 Commits
mediawiki_parser		mediawiki_parser
performance_test		performance_test
tests		tests
.gitignore		.gitignore
.travis.yml		.travis.yml
COPYING		COPYING
MANIFEST.in		MANIFEST.in
README.rst		README.rst
TODO.rst		TODO.rst
article.htm		article.htm
article.txt		article.txt
config.py		config.py
mediawiki.pijnu		mediawiki.pijnu
parserExample.py		parserExample.py
parsers.rst		parsers.rst
preprocessor.pijnu		preprocessor.pijnu
setup.py		setup.py
wikitext.txt		wikitext.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Repository files navigation

Presentation

Requirements

How it works

Building the parsers

How to test

Unit tests

How to use in a program

Example for HTML

Example for text

Example for templates substitution

Postprocessors

Known bugs

Special thanks

About

Uh oh!

Releases 2

Sponsor this project

Uh oh!

Packages

Uh oh!

Languages

Uh oh!

License

peter17/mediawiki-parser

Folders and files

Latest commit

History

Repository files navigation

Presentation

Requirements

How it works

Building the parsers

How to test

Unit tests

How to use in a program

Example for HTML

Example for text

Example for templates substitution

Postprocessors

Known bugs

Special thanks

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 2

Sponsor this project

Uh oh!

Packages 0

Uh oh!

Languages

Packages