A comprehensive toolkit for uploading and managing Tibetan Buddhist texts and resources across Wikimedia platforms.
With Wiki Utils, you can automate uploads to Wikimedia Commons, add rich metadata to Wikidata, and upload text content to Wikisource with full language support for Tibetan texts.
- Overview
- Project Dependencies
- Installation
- Configuration
- Components
- Logging
- Troubleshooting
- Contributing
Wiki Utils is designed for content maintainers and librarians working with Tibetan Buddhist texts who want to make these resources available on Wikimedia platforms. The package provides three main components:
- WikiCommons Batch Upload: Upload PDFs and other media files to Wikimedia Commons with rich metadata, captions, and proper licensing.
- Wikidata BDRC Utilities: Query and retrieve information from Wikidata for Buddhist Digital Resource Center (BDRC) resources.
- Wikisource Text Upload: Upload and manage text content on Wikisource with page-by-page organization.
Before using Wiki Utils, ensure you have:
- Python 3.8 or higher
- pywikibot (core library for interacting with Wikimedia APIs)
- pandas (for CSV data handling)
- requests (for API interactions)
-
Clone the repository:
git clone https://github.com/OpenPecha/wikidata_pipeline.git cd wikidata_pipeline
-
Create and activate a virtual environment:
python -m venv venv source venv/bin/activate # On Windows: venv\Scripts\activate
-
Install the required dependencies:
pip install -U pip pip install -e . pip install .[dev] pre-commit install
Wiki Utils uses pywikibot for authentication and API interactions. You need to configure pywikibot before using the scripts:
-
Create a
user-config.py
file in the project root directory with the following content:family = 'commons' mylang = 'commons' # Configuration for Wikimedia Commons usernames['commons']['commons'] = 'YourUsername' # Configuration for multilingual Wikisource usernames['wikisource']['mul'] = 'YourUsername' password_file = "user-password.py"
-
Obtain a bot password from Wikimedia:
- Log in to your Wikimedia account on Wikimedia Commons or Wikisource
- Go to Special:BotPasswords page:
Special:BotPasswords
(e.g., https://commons.wikimedia.org/wiki/Special:BotPasswords) - Create a new bot password with an appropriate name (e.g., "WikiUtils")
- Select the required permissions (minimum required:
Edit pages
,Upload new files
,Upload by URL
) - Click "Create" and save the generated credentials
-
Create a
user-password.py
file in the project root directory with your bot password:# This is an automatically generated file used to store BotPasswords. # See https://www.mediawiki.org/wiki/Manual:Pywikibot/BotPasswords for more information. ('YourUsername', BotPassword('BotName', 'your-bot-password-token'))
Replace
YourUsername
with your Wikimedia username,BotName
with the name you gave your bot, andyour-bot-password-token
with the token provided by Wikimedia. -
Set appropriate file permissions to protect your credentials:
chmod 600 user-password.py
The WikiCommons component allows you to upload images and PDFs to Wikimedia Commons with proper metadata, captions in multiple languages, licensing information, and categories.
Prepare a JSON configuration file with the following structure (example in data/commons_upload_config.json
):
[
{
"image_path": "/path/to/your/file.pdf",
"image_title": "TibetanText_Title.pdf",
"info_template": {
"description": {
"bo": "བོད་ཀྱི་རྒྱལ་རབས།",
"en": "Tibetan Historical Text"
},
"date": "2024",
"source": "BDRC",
"author": "Traditional"
},
"captions": {
"bo": "བོད་ཀྱི་རྒྱལ་རབས།",
"en": "Tibetan Historical Text"
},
"license_templates": [
"PD-old-70",
"PD-US-expired"
],
"categories": [
"Tibetan manuscripts",
"Buddhist texts",
"Tibetan history"
]
}
]
Each entry in the array represents one file to upload with its metadata.
To upload files to WikiCommons:
python -m wiki_utils.wikicommons.batch_upload
By default, the script looks for the configuration file at data/commons_upload_config.json
. You can modify the path in the script if needed.
The Wikidata component provides utilities for querying and retrieving information from Wikidata related to BDRC (Buddhist Digital Resource Center) resources.
from wiki_utils.wikidata.bdrc_utils import get_wikidata_metadata
# Get metadata for a BDRC work
work_id = "WA0RK0529"
metadata = get_wikidata_metadata(
work_id,
language="en",
properties=["P31", "P4969", "P1476"]
)
print(metadata)
The Wikisource component allows you to upload text content to Wikisource pages, organized by page numbers.
-
Text Files: Place your text files in the
data/text/
directory. Each file should follow this format:Page no: 1 ལེའུ་དང་པོ། བོད་ཀྱི་སྔ་རབས་ལོ་རྒྱུས། འདི་ནི་བོད་ཀྱི་སྔ་རབས་ཀྱི་ལོ་རྒྱུས་ཡིན། Page no: 2 ལེའུ་གཉིས་པ། ...
-
Work List CSV: Create a CSV file (
data/work_list.csv
) with the following columns:Index
: The Wikisource index page title (e.g., "Index:My_Tibetan_Book.pdf")text
: Name of the text file indata/text/
directory
Example:
Index,text Index:TibetanHistory_Vol1.pdf,tibetan_history_vol1.txt Index:TibetanDharma_Vol1.pdf,tibetan_dharma_vol1.txt
To upload texts to Wikisource:
python -m wiki_utils.wikisource.etext_upload
By default, the script looks for the work list at data/work_list.csv
. You can modify the path in the script if needed.
Wiki Utils implements comprehensive logging mechanisms to track upload activities and results:
-
Wikisource Upload Logs:
- Location:
src/wiki_utils/wikisource/upload_log.csv
- Format: CSV with timestamp, index_title, page_number, page_title, status, and error_message
- Location:
-
Console Output:
- All scripts provide detailed console output during execution
- Shows progress, successes, and errors in real-time
Issue | Solution |
Authentication failures | Ensure pywikibot is correctly configured with valid credentials. Run python -m pywikibot login to reset and update credentials. |
File upload errors | Check that your file paths in the JSON configuration are correct and absolute. Ensure file formats are supported by Commons (PDF, JPG, PNG, etc.). |
Text parsing issues | Verify text files follow the correct format with "Page no: X" headers. Check encoding is UTF-8. |
API throttling/limits | Reduce batch size or add delays between uploads if hitting API rate limits. |
Wikisource text upload creates and uses cache files in the cache/
directory to optimize page lookups. If you encounter stale data issues, delete the cache files to force fresh fetching.
Contributions to Wiki Utils are welcome! Please feel free to submit a Pull Request.
This project is licensed under the Apache License 2.0 - see the LICENSE file for details.
_Describe the issue here_ _Write solution here_ _Describe the issue here_ _Write solution here_Other troubleshooting supports:
- Link to FAQs
- Link to runbooks
- Link to other relevant support information
If you'd like to help out, check out our contributing guidelines.
Include links and brief descriptions to additional documentation.
For more information:
- File an issue.
- Email us at openpecha[at]gmail.com.
- Join our discord.
Project Name is licensed under the MIT License.