Wiktra is a versatile Unicode transliteration tool that brings the linguistic precision of Wiktionary's community-curated transliteration modules to your command line and Python projects. It allows you to convert text from one writing system (script) to another with a high degree of accuracy.
Project Locations:
- Upstream (slower releases): github.com/kbatsuren/wiktra
- Active Development (Wiktra 2+): github.com/twardoch/wiktra2 (This documentation primarily refers to Wiktra 2+)
At its core, Wiktra transliterates text. This means it converts characters or words from one script (e.g., Cyrillic, Arabic, Devanagari) into another (e.g., Latin script). Unlike simple character-by-character replacement, Wiktra utilizes sophisticated rule-based transliteration modules written in Lua, developed and maintained by linguists and contributors on Wiktionary. These modules understand the nuances of how languages are written, leading to more accurate and contextually appropriate results.
Wiktra provides:
- A command-line interface (CLI) tool:
wiktrapy
- A Python 3 module:
wiktra
Wiktra 1.0 was originally developed by Khuyagbaatar Batsuren. Wiktra 2 was significantly rewritten by Adam Twardoch.
Wiktra is designed for a diverse range of users:
- Linguists and Language Researchers: Access accurate transliterations for a vast number of languages and scripts, aiding in comparative studies, data processing, and phonetic analysis.
- Software Developers: Integrate transliteration capabilities into applications dealing with multilingual text, internationalization (i18n), or natural language processing (NLP).
- Archivists and Librarians: Standardize text from various scripts for cataloging and digital preservation.
- Students and Language Learners: Understand pronunciation and script conversions for different languages.
- Anyone working with multilingual data: Convert text to a common script for easier processing, analysis, or display.
- High-Quality Transliterations: Leverages the extensive, collaboratively maintained Lua-based transliteration modules from Wiktionary, ensuring a high standard of accuracy.
- Broad Language and Script Support: Wiktra 2 supports a vast number of languages and scripts (e.g., 514 languages in 102 scripts in the new API as per original README), covering a significant portion of Wiktionary's transliteration capabilities. Use
wiktrapy --stats
for a current list. - Flexibility: Usable as both a standalone CLI tool for quick conversions and as a Python library for integration into larger projects.
- Offline Capability: Once Wiktionary modules are packaged or updated locally, transliteration can be performed offline.
- Open Source: Licensed under GPLv2, allowing for community contributions and transparency.
Wiktra requires Python 3.9+ and Lua (specifically LuaJIT is recommended for performance with the lupa
bridge).
General Installation (using pip):
The primary way to install Wiktra is via pip:
python3 -m pip install wiktra
This will attempt to install Wiktra and its Python dependencies, including lupa
, which bridges Python and Lua. The lupa
installation might require Lua development headers to be present on your system.
macOS:
For macOS, a convenience script install-mac.sh
(available in the source repository) can help install prerequisites like Lua via Homebrew:
- Download or clone the Wiktra repository.
- Navigate to the repository's root directory in your terminal.
- Run the script:
./install-mac.sh
- Then, install Wiktra using pip (if the script doesn't do it already, or to ensure you have the latest version from PyPI):
# If installing from a local clone after running the script: python3 -m pip install --upgrade . # Or to get the latest from PyPI: python3 -m pip install --upgrade wiktra
Linux (Debian/Ubuntu Example):
You'll need to install Python 3, pip, and Lua development files.
sudo apt update
sudo apt install python3 python3-pip liblua5.1-0-dev luajit
# For lupa, LuaJIT (libluajit-5.1-dev) is often preferred over standard Lua dev packages.
# Depending on your distribution and lupa version, you might need different Lua versions like lua5.3-dev etc.
python3 -m pip install wiktra
Windows:
Installation on Windows can be more complex due to lupa
compilation.
- Install Python 3.9+ (e.g., from python.org). Make sure to add Python to your PATH.
- Installing
lupa
typically requires a C compiler (like Microsoft C++ Build Tools, often installed with Visual Studio) and Lua (e.g., by compiling Lua from source, or using a package manager like Scoop or Chocolatey to install Lua/LuaJIT). - It's often easier if pre-compiled
lupa
wheels are available for your Python version and architecture on PyPI. If not, manual setup of the build environment is necessary. - Once
lupa
can be installed (i.e., its prerequisites are met), Wiktra can be installed via pip:pip install wiktra
Note: The original README mentioned that version 2 had not been working well on Ubuntu and Windows 10 at one point. While efforts are made to ensure cross-platform compatibility, installing lupa
correctly is often the main hurdle. Refer to the lupa
documentation for specific guidance on its installation.
Troubleshooting Installation:
LuaError: module 'wikt.mw' not found
or similar Lua errors: This typically means the Lua runtime cannot find the Wiktionary modules. Wiktra attempts to set theLUA_PATH
environment variable correctly during runtime. If issues persist, it might indicate a problem with howlupa
is locating Lua files or an incomplete installation.lupa
installation issues: These are common. Ensure you have a C compiler and the correct Lua (or LuaJIT) development libraries (headers) installed. Consultlupa
's documentation and open issues for platform-specific advice. Using virtual environments (e.g.,venv
) is highly recommended.
Wiktra offers two main ways to perform transliterations:
The wiktrapy
tool is perfect for quick transliterations or use in shell scripts.
Basic syntax:
wiktrapy [options] -t "Your text here"
# or pipe text into it
echo "Your text here" | wiktrapy [options]
Examples:
-
Automatic language/script detection (transliterates to Latin by default):
wiktrapy -t "Привет" # Expected Output: Privet
echo "नमस्ते" | wiktrapy # Expected Output: namaste
-
Specifying input language and script (for explicit transliteration):
wiktrapy -t "Привет" -l ru -s Cyrl # Expected Output: Privet
Here,
-l ru
specifies Russian and-s Cyrl
specifies Cyrillic script. -
Specifying output script:
# This example assumes a module exists for English (Latn) to Cyrillic (Cyrl) # wiktrapy -t "Hello" -l en -s Latn -o Cyrl
The default output script is
Latn
(Latin). -
Listing supported scripts and orthographies:
wiktrapy --stats
-
Getting help for all options:
wiktrapy -h
For more programmatic control, use the wiktra
Python module. The recommended way is to use the Transliterator
class.
Example (New API - Recommended):
from wiktra.Wiktra import Transliterator
# Create a Transliterator instance
# This is best done once if you're doing multiple transliterations
tr = Transliterator()
# Transliterate text with automatic language/script detection
# (will try to guess input script and use 'und' - undefined language for that script)
text_cyrillic = "Привет мир"
latin_text = tr.tr(text_cyrillic)
print(f"'{text_cyrillic}' -> '{latin_text}'")
# Expected Output: 'Привет мир' -> 'Privet mir'
text_devanagari = "नमस्ते दुनिया"
latin_text_dev = tr.tr(text_devanagari)
print(f"'{text_devanagari}' -> '{latin_text_dev}'")
# Expected Output: 'नमस्ते दुनिया' -> 'namaste duniyaa'
# Explicitly specify language, input script, and output script
text_russian = "Русский текст"
# lang='ru' (Russian), sc='Cyrl' (Cyrillic), to_sc='Latn' (Latin)
transliterated_explicit = tr.tr(text_russian, lang='ru', sc='Cyrl', to_sc='Latn', explicit=True)
print(f"'{text_russian}' (explicit) -> '{transliterated_explicit}'")
# Expected Output: 'Русский текст' (explicit) -> 'Russkij tekst'
# Using the class instance is more efficient for multiple transliterations
# as the Lua runtime and modules are initialized only once.
- If
explicit=True
, you must providelang
(input language code, e.g., ISO 639) andsc
(input script code, e.g., ISO 15924).to_sc
(output script code) defaults toLatn
if not specified. - If
explicit=False
(the default), Wiktra attempts to guess the input script ifsc
is not provided. It then typically assumes an "undefined" (und
) language for that script, unlesslang
is also provided.
Legacy Function (translite
):
A legacy translite
function is also available, primarily for compatibility with older versions of Wiktra or specific use cases that relied on its distinct language code mapping.
from wiktra.Wiktra import translite as tr_legacy
# Example for Mongolian (Cyrillic) using its legacy code 'mon'
mongolian_text = "монгол бичлэг"
transliterated_mongolian = tr_legacy(mongolian_text, 'mon')
print(f"'{mongolian_text}' (legacy) -> '{transliterated_mongolian}'")
# Expected Output: 'монгол бичлэг' (legacy) -> 'mongol bichleg'
It is generally recommended to use the new Transliterator.tr()
method for its more standardized approach to language/script codes and broader capabilities.
Wiktra can update its local collection of Wiktionary transliteration modules using the wiktrapy_update
command:
wiktrapy_update -h # For options
wiktrapy_update
This helps keep your transliterations aligned with the latest rules from Wiktionary.
This section provides a deeper insight into Wiktra's architecture, its core components, and guidelines for coding and contributing.
Wiktra's ability to perform complex transliterations stems from its use of Lua modules sourced directly from Wiktionary, executed within a Python environment.
1. Python-Lua Integration via lupa
:
The core of Wiktra's cross-language functionality is the lupa
library. lupa
provides a bridge between Python and Lua (specifically designed for LuaJIT, but can work with standard Lua), allowing Python code to:
- Initialize and manage a Lua runtime environment.
- Load and execute Lua code, including the Wiktionary transliteration modules.
- Pass data (like the text to be transliterated and language/script parameters) from Python to Lua functions.
- Receive results back from Lua functions into Python.
2. The Transliterator
Class (wiktra/Wiktra.py
):
This is the central class orchestrating transliteration.
- Initialization (
__init__
): When aTransliterator
object is created, it:- Initializes a
LuaRuntime
instance fromlupa
. - Loads the mapping between language/script codes and Wiktionary Lua modules from
wiktra/wikt/data/data.json
. This JSON file is key to identifying the correct Lua module for a given transliteration request. - Programmatically configures the
LUA_PATH
environment variable to ensure Lua can locate the necessary modules within thewiktra/wikt/
directory structure.
- Initializes a
- Transliteration Method (
tr
): This is the primary method for the new API.- Language/Script Handling:
- If
explicit=True
, it directly uses the providedlang
(language),sc
(input script), andto_sc
(output script). - If
explicit=False
(default), it callsauto_script_lang
to deduce the input script and language if they are not fully specified.
- If
auto_script_lang
Method: This helper method determines the script of the input text usingfontTools.unicodedata.ucd.script()
if no script is provided. It then useslangcodes.closest_match()
to find the best matching language/script combination available in Wiktra's supported list (derived fromdata.json
) against the (partially) specified or detected input.- Module Loading and Execution: Based on the determined language, input script, and target output script, it looks up the appropriate Lua module name (e.g.,
ru-translit
,ar-translit
) from themod_map
(loaded fromdata.json
). It then constructs and executes a Lua command likerequire("wikt.translit.MODULE_NAME").tr("text_to_transliterate", "lang_code", "script_code")
. The result from the Lua function is then returned to Python.
- Language/Script Handling:
- Legacy Method (
tr_legacy
): This method supports the oldertranslite
function's interface. It uses an internal, hardcodedlang_map
(defined inWiktra.py
) to map legacy language codes to Wiktionary module names and script codes before invoking the Lua modules.
3. Wiktionary Lua Modules (wiktra/wikt/
):
This directory and its subdirectories contain the Lua code and data sourced from Wiktionary.
- Transliteration Modules (
wiktra/wikt/translit/
): These are individual Lua files (e.g.,ru-translit.lua
,ar-translit.lua
) containing the transliteration rules for specific languages or scripts. Each module typically exposes atr(text, lang_code, script_code)
function that Wiktra calls. - Core Lua Dependencies (
wiktra/wikt/mw/
,wiktra/wikt/mw-*.lua
, etc.): These are supporting Lua modules, also from Wiktionary, providing common functionalities (like Unicode string manipulation viamw.ustring
, message handling, site utilities) that the transliteration modules often depend on.mwInit.lua
likely initializes this MediaWiki-like Lua environment for the modules. - Data Files (
wiktra/wikt/data/
):data.json
: The primary mapping used byTransliterator
to find the correct Lua module for a new API request (maps script -> language -> output_script -> module_info).data.yaml
: A human-readable YAML version of the module mappings, also listing the Wiktionary transliteration modules used. Useful for reference and potentially for generation ofdata.json
.- Other data files (e.g., within
translit/data/
or language-specific data modules) are used directly by the Lua modules themselves. make-data-lang.lua
appears to be a script used in the process of generating or updating language data mappings.
4. CLI Entry Point (wiktra/__main__.py
):
This script powers the wiktrapy
command-line tool.
- It uses Python's
argparse
module to define and parse command-line arguments (e.g.,-t
for text,-l
for language,-s
for script). - It reads input text from the
-t
argument, an input file specified by-i
, or directly fromstdin
if no text source is given. - It instantiates the
Transliterator
class fromwiktra.Wiktra
. - It then calls the
tr()
method of theTransliterator
instance with the processed arguments and prints the transliterated result to standard output. - It also handles the
--stats
option to display a list of supported scripts and orthographies by querying theTransliterator
instance.
5. Module Update Mechanism (wiktra/update.py
and wiktrapy_update
):
Wiktra includes a built-in mechanism to update its local cache of Wiktionary Lua modules and associated data.
- The
wiktrapy_update
console script (which callsmain
inwiktra.update
) manages this process. - This involves fetching the latest versions of the Lua modules and data files from Wiktionary (the exact source and methods are implemented in
update.py
). This functionality is crucial for keeping Wiktra's transliteration capabilities current with the ongoing improvements made by the Wiktionary community.
We welcome contributions to Wiktra! Here are some guidelines:
Coding Style:
- Python: Please adhere to PEP 8. Consider using tools like Flake8 for linting and Black or Ruff for formatting to maintain consistency.
- Lua: Aim for clarity and consistency with the existing Lua code style, which originates from Wiktionary.
Dependencies:
- Python dependencies are listed in
requirements.txt
and declared insetup.py
. - When adding or modifying dependencies, ensure these files are updated. Use of virtual environments (e.g.,
python -m venv .venv
) is strongly recommended for development. - Lua module dependencies are generally bundled within the
wiktra/wikt/
directory.
Testing:
- While a comprehensive, dedicated test suite might not be fully established, testing is vital.
- Contributions, especially those adding new features or fixing bugs, should ideally be accompanied by tests.
- Many Wiktionary Lua modules link to their test cases on Wiktionary (e.g., on pages like
Module:xx-translit/testcases
). These can serve as excellent references for expected behavior. - When modifying transliteration logic or adding support for new languages/scripts, test with a diverse set of inputs, including common words, special characters, and edge cases.
Contribution Process:
- Fork the Repository: Start by forking the active Wiktra repository (see Project Locations above).
- Create a Branch: For your changes, create a new branch in your fork (e.g.,
feature/add-georgian-translit
orfix/unicode-error-arabic
). - Code: Implement your changes, following the coding style guidelines.
- Test: Thoroughly test your modifications.
- Commit: Write clear, descriptive commit messages. A common convention is a short subject line (max 50 characters), a blank line, and then a more detailed explanation if necessary.
- Push: Push your changes to your branch in your forked repository.
- Create a Pull Request: Open a pull request against the main branch of the active Wiktra repository. Clearly describe your changes, the problem they solve, and how you tested them.
Reporting Issues:
- Before submitting a new issue, please check the existing issues on GitHub to see if your problem or suggestion has already been reported.
- If not, create a new issue, providing as much detail as possible:
- Wiktra version (
wiktrapy -V
). - Python version.
- Operating system and version.
- Clear steps to reproduce the bug.
- The exact input text, language/script parameters used.
- The expected behavior versus the actual behavior or error message.
- Wiktra version (
License:
Wiktra is distributed under the GPLv2 (GNU General Public License version 2). All contributions to the project are also expected to be made under this license.
This README aims to be a comprehensive guide for both users and developers of Wiktra. For further details, exploring the source code and the linked Wiktionary resources is encouraged.