wikt2pron

A Wiktionary Pronunciation Collector

Wikt2pron is a Python toolkit converting pronunciation in enwiktionary xml dump to cmudict format. It supports IPA and X-SAMPA format at present. This project is developed in GSoC 2017 with CMU Sphinx community.

Collected pronunciation dictionaries and related example models can be downloaded at Dropbox.

Requirements

wikt2pron requires:

Installation

# download the latest version
$ git clone https://github.com/abuccts/wikt2pron.git
$ cd wikt2pron

# install and run test
$ python setup.py install
$ python setup.py -q test

# make documents
$ make -C docs html

Usage

Extract pronunciation from Wiktionary XML dump

First, create an instance of Wiktionary class:

>>> from pywiktionary import Wiktionary
>>> wikt = Wiktionary(XSAMPA=True)

Use the example XML dump in [pywiktionary/data]:

>>> dump_file = "pywiktionary/data/enwiktionary-test-pages-articles-multistream.xml"
>>> pron = wikt.extract_IPA(dump_file)

Here's the extracted result:

>>> from pprint import pprint
>>> pprint(pron)
[{'id': 16,
  'pronunciation': {'English': [{'IPA': '/ˈdɪkʃ(ə)n(ə)ɹɪ/',
                                 'X-SAMPA': '/"dIkS(@)n(@)r\\I/',
                                 'lang': 'en'},
                                {'IPA': '/ˈdɪkʃənɛɹi/',
                                 'X-SAMPA': '/"dIkS@nEr\\i/',
                                 'lang': 'en'}]},
  'title': 'dictionary'},
 {'id': 65195,
  'pronunciation': {'English': 'IPA not found.'},
  'title': 'battleship'},
 {'id': 39478,
  'pronunciation': {'English': [{'IPA': '/ˈmɜːdə(ɹ)/',
                                 'X-SAMPA': '/"m3:d@(r\\)/',
                                 'lang': 'en'},
                                {'IPA': '/ˈmɝ.dɚ/',
                                 'X-SAMPA': '/"m3`.d@`/',
                                 'lang': 'en'}]},
  'title': 'murder'},
 {'id': 80141,
  'pronunciation': {'English': [{'IPA': '/ˈdæzəl/',
                                 'X-SAMPA': '/"d{z@l/',
                                 'lang': 'en'}]},
  'title': 'dazzle'}]

Lookup pronunciation for a word

First, create an instance of Wiktionary class:

>>> from pywiktionary import Wiktionary
>>> wikt = Wiktionary(XSAMPA=True)

Lookup a word using lookup method:

>>> word = wikt.lookup("present")

The entry of word "present" is at https://en.wiktionary.org/wiki/present, and here is the lookup result:

>>> from pprint import pprint
>>> pprint(word)
{'Catalan': 'IPA not found.',
 'Danish': [{'IPA': '/prɛsanɡ/', 'X-SAMPA': '/prEsang/', 'lang': 'da'},
            {'IPA': '[pʰʁ̥ɛˈsɑŋ]', 'X-SAMPA': '[p_hR_0E"sAN]', 'lang': 'da'}
],
 'English': [{'IPA': '/ˈpɹɛzənt/', 'X-SAMPA': '/"pr\\Ez@nt/', 'lang': 'en'},
             {'IPA': '/pɹɪˈzɛnt/', 'X-SAMPA': '/pr\\I"zEnt/', 'lang': 'en'},
             {'IPA': '/pɹəˈzɛnt/', 'X-SAMPA': '/pr\\@"zEnt/', 'lang': 'en'}],
 'Ladin': 'IPA not found.',
 'Middle French': 'IPA not found.',
 'Old French': 'IPA not found.',
 'Swedish': [{'IPA': '/preˈsent/', 'X-SAMPA': '/pre"sent/', 'lang': 'sv'}]}

To lookup a word in a certain language, specify the lang parameter:

>>> wikt = Wiktionary(lang="English", XSAMPA=True)
>>> word = wikt.lookup("read")
>>> pprint(word)
[{'IPA': '/ɹiːd/', 'X-SAMPA': '/r\\i:d/', 'lang': 'en'},
 {'IPA': '/ɹɛd/', 'X-SAMPA': '/r\\Ed/', 'lang': 'en'}]

IPA -> X-SAMPA conversion

>>> from pywiktionary import IPA
>>> IPA_text = "/t͡ʃeɪnd͡ʒ/" # en: [[change]]
>>> XSAMPA_text = IPA.IPA_to_XSAMPA(IPA_text)
>>> XSAMPA_text
"/t__SeInd__Z/"

Citation

If you use wikt2pron in your research and want to cite it, please use the following BibTeX:

@misc{xiong2017wikt2pron,
  title={Wikt2pron: A Wiktionary Pronunciation Collector},
  author={Xiong, Yifan},
  howpublished={\url{https://github.com/abuccts/wikt2pron}},
  year={2017}
}

Name		Name	Last commit message	Last commit date
Latest commit History 79 Commits
docs		docs
egs		egs
mwxml @ 2b35e60		mwxml @ 2b35e60
pywiktionary		pywiktionary
.gitignore		.gitignore
.gitmodules		.gitmodules
.pylintrc		.pylintrc
.readthedocs.yml		.readthedocs.yml
.travis.yml		.travis.yml
LICENSE		LICENSE
README.md		README.md
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

wikt2pron

Requirements

Installation

Usage

Extract pronunciation from Wiktionary XML dump

Lookup pronunciation for a word

IPA -> X-SAMPA conversion

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

wikt2pron

Requirements

Installation

Usage

Extract pronunciation from Wiktionary XML dump

Lookup pronunciation for a word

IPA -> X-SAMPA conversion

Citation

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages