End‑to‑end tools to translate Japanese PPTX and DOCX to English while preserving layout and style. Supports fully local/manual workflows (no API) and online model‑assisted runs, with caching, audits, and formatting safeguards.
Quick links:
- Architecture: docs/ARCHITECTURE.md
- Repository Structure: docs/REPO_STRUCTURE.md
- Proposed Restructure: docs/RESTRUCTURE_PLAN.md
- Style Guide (Gengo‑aligned): STYLE_GUIDE.md (or set
STYLE_GUIDE_FILE) - Glossary: glossary.json
Common tasks:
- PPTX (offline with cache + formatting):
python scripts/translate_pptx_inplace.py --in inputs/demo.pptx --out outputs/demo_en.pptx --offline --glossary glossary.json
- PPTX (apply cache only, no API):
python scripts/apply_cache_only.py --in inputs/demo.pptx --out outputs/demo_en.pptx --cache translation_cache.json
- DOCX (manual/local):
- Prepare:
python scripts/manual_docx_translation.py prepare --input inputs/source.docx --template translations/source_template.json - Apply:
python scripts/manual_docx_translation.py apply --input inputs/source.docx --translations translations/source_translations.json --output outputs/source_en.docx
- Prepare:
—
A production-ready translation system for converting Japanese PowerPoint presentations to English while preserving layout, formatting, and visual elements.
📊 Project Status Summary (Today)
-
Stopped word-splitting & newline bugs: Replaced
set_para_textwith a word-aware version that inserts<a:br/>correctly and never cuts words mid-run. -
Hardened JSON handling: Replaced fragile bracket-counting with
JSONDecoder.raw_decode, multi-strategy extraction, batch splitting on failure, and clampable auto-batch (--max-array-items) so chatty outputs don't kill runs. -
Killed identity/non-translated cache entries: Added
scrub_cache.pyand improved JP-count audit to ignore punctuation (e.g.,・), reducing false positives. -
Concurrency & batching tuned: Demonstrated stable settings (
--concurrency 4–8, small auto-batches) and removed accidental full re-runs (--fresh). -
Style: mechanics only (not voice):
- Added
style_mechanics_normalize.pyand a strongerstyle_autofix_from_report.pyto fix ASCII/full-width, dashes, %/¥ spacing, units, stray punctuation, ellipses, bullet punctuation—without altering tone. - Added a summarizer to see which rules the checker complains about most.
- Added
-
Scope-correct residuals: New
audit_translated_only.pycounts JP only in translated EN text; ignores SmartArt/charts/images by design. -
Overflow solved at the XML layer: Introduced slide-safe layout knobs:
<a:normAutofit>(shrink to fit),- tighter
<a:bodyPr>insets (optional), - normalized
<a:lnSpc>(line spacing). - Exposed flags:
--autofit-mode {norm,shape,none},--font-scale-min,--line-spacing-pct,--tight-margins.
-
ET warning future-proofed: Replaced
r.find(...) or ET.SubElement(...)with explicitif t is None: ….
- Cache refinement pass: Normalized punctuation (NFKC), ranges (
–), yen/percent formatting, time ranges, pluralization, and consistent webinar terminology (attendee, registrant, operations, etc.). Curated overrides for key headlines and Majisemi terms. - Title consistency: Restored Title Case on headings only (acronym-aware, hyphen-aware), leaving bullets/body untouched.
-
Runs: Full online translations (4o/4o-mini) with batch split & retries → cache filled; offline apply to preserve layout.
-
Artifacts:
translation_cache.refined.json— mechanics/terminology upgrades.translation_cache.retitled.json— refined + Title Case for headings.cache_diff.csv— JP | old EN | new EN.- Updated PPTX outputs (e.g.,
outputs/styled_offline.pptx).
- Formatting: Fixed via autofit + margins + line spacing; only a short manual pass needed where English still pushes boundaries.
- Style checker: Mechanical issues substantially reduced; remaining flags are mostly preference/tone or out-of-scope artifacts if the old audit is used.
- Cache: Ready to use.
cp translation_cache.retitled.json translation_cache.json
python3 scripts/translate_pptx_inplace.py --offline \
--in inputs/68b42f175c652_f711fcda865b11f0b6cecace4a312dcf.pptx \
--out outputs/final_retitled.pptx- Bake title-case at write-time: Only for Title/CenteredTitle placeholders.
- Use translated-only audit in CI: Drop legacy residual counters that include non-text artifacts.
- Set sensible defaults:
--autofit-mode norm --font-scale-min 90000 --line-spacing-pct 100000 --tight-margins. - Clamp batching: Respect
--max-array-itemsmin=6 to reduce JSON hiccups on short blurbs.
🎯 Next Steps & Roadmap (Zero-Touch, GPT-5, More Formats)
Here's a tight, no-nonsense plan to make this zero-touch, faster, and broader.
Goal: User drops a file in Drive → system detects → translates → uploads finished pack to Drive (final PPTX/Doc + bilingual CSV + audit) → optional Slack/email ping.
- Folder contract (no UI needed)
Drive:/TranslationInbox/(incoming, read-only to users)Drive:/TranslationOut/(deliverables)Drive:/TranslationArchive/(originals + logs)
- Detection
- Simplest & robust: GitHub Actions (cron every 2–5 min) + Drive Changes API using a stored
startPageToken. - Keep a
jobs/STATE.jsonin the repo (or Redis) with processed file IDs to avoid duplicates.
- Job manifest
-
On new file: create
job_<fileId>.jsonwith:{"fileId":"...", "name":"...", "mime":"application/vnd.openxmlformats-officedocument.presentationml.presentation", "created":"...", "status":"QUEUED", "model":"gpt-4o-mini", "style":"gengo"} -
Status transitions: QUEUED → EXTRACTING → TRANSLATING → QA → DELIVERED (or FAILED, with reason).
- Processing runner (idempotent)
- Extract → Batch translate (cache-first, slide/block-level) → Autofit & layout pass → Style/autofix → Translated-only audit → Package.
- Drive upload results to
TranslationOut/with a suffix:originalName.en-US.[timestamp].pptxplus CSV/JSON. - Always move source to
TranslationArchive/and attach ajob.log.
- Notifications
- Optional: Email via Gmail API or Slack webhook with links to Drive outputs + a small summary (residual=0, changed=N, cost estimate).
- Observability
- Write compact metrics per job:
tokens_in/out,cache_hit_rate,api_errors, total duration. - (Optional) OpenAI Webhooks: add a
/webhookendpoint (tiny Flask/Cloud Run) to receive translate batch updates; mirror to the job manifest.
Definition of Done
- Drop a PPTX in
TranslationInbox/→ see final artifacts inTranslationOut/with residual JP=0 (translated-only audit) and a Drive comment or Slack ping.
Problem you saw: response_format caused failures; client/model mismatch.
- Adapter layer
-
Add
llm_adapter.pywith a singletranslate_batch(items, model, sys, temp)that:- Uses
/chat/completionsfor 4o/4o-mini/4.1; noresponse_format. - If
model.startswith("gpt-5"), choose the correct endpoint & params (no unsupported args). - Always wrap prompts with a strict JSON-array only instruction and validate with
JSONDecoder.raw_decode(you already added). - Feature flag:
--primary-model gpt-5with fallback chain (5 → 4.1 → 4o → 4o-mini) on capability/HTTP errors.
- Uses
- Compatibility switch
- Centralize all OpenAI kwargs in one place; forbid stray params.
- Add
--dry-runto print composed payloads for a single batch to verify.
- Resilience
- Per-batch retries with backoff; if still chatty → auto-split (
--on-batch-fail splitalready there). - Cost guard:
--max-output-tokensand a per-job token budget; abort gracefully if exceeded.
- Tests
- Golden tests: 10 representative JP lines → verify strict JSON array parse and stable output across models.
DoD
--primary-model gpt-5runs end-to-end; if not available, auto-fallback without failing the job.
Unify on a "Document Abstraction Layer" (DAL)
- Common interfaces
class Extractor:
def extract(self, path) -> list[Block] # Block = {id, kind, meta, jp_text}
class BackProjector:
def apply(self, path_in, path_out, translations: dict[id, en_text]) -> None- Handlers (start with easiest)
- DOCX:
python-docx. Extract paragraphs, headings, tables (cell text). Back-project by run order; preserve styles. - Markdown / TXT: trivial; line/block based.
- XLSX:
openpyxl. Translate values only (skip formulas). Keep data types; don't touch numbers/dates. - SRT/VTT: segment by cue; preserve timestamps.
- PDF (text-only first):
pdfminer.sixfor text; back-project as bilingual PDF or export to DOCX then reassemble (full layout-faithful PDF is a separate project—defer). - Google Docs/Slides: fetch via Drive export (DOCX/PPTX) and reuse the above; native API mapping optional later.
- Re-use your core
-
Same batch translator, cache, glossary, style autofix, and translated-only audit.
-
Same autofit concept where applicable:
- DOCX: allow "Automatically adjust right indent when grid is defined"; tighten spacing; avoid font size drops below a floor.
- XLSX: enable wrap, column autosize (optional).
DoD
- Drop a DOCX/XLSX/TXT → get translated file + bilingual CSV in
TranslationOut/.
- One command for everything:
tt submit <local_file>→ uploads toTranslationInbox/and pings the runner. - Cache across projects: move cache to a shared KV (SQLite/Redis) with normalized JP keys (
NFKC + whitespace fold) and optional fuzzy (rapidfuzz) for ≥0.96 similarity. - Defaults baked in:
--autofit-mode norm --font-scale-min 90000 --line-spacing-pct 100000 --style-preset gengo. - Strict gates: Use translated-only audit in CI; fail if residual>0; warn on style mechanics only.
- Cost estimator: quick preflight on extracted blocks: estimated tokens × model price → attach to job manifest and Slack ping.
-
Jobs + Drive poller (GH Action + small script)
scripts/drive_poller.py(changes.list → manifest → enqueue).- Action workflow
on: schedule: runs every few minutes; uses SA creds; posts status.
-
LLM adapter + fallback
llm_adapter.pywith endpoint/param matrix; add--primary-modeland fallback list.
-
DAL + DOCX handler
extract_docx.py/apply_docx.py; wire intotranslate_any.pydriver (detect by MIME/extension).
-
Notifier
- Simple Slack webhook or Gmail email with Drive links and cost/timing.
-
Cache sharing
cache_store.pywith SQLite filecache.db(table: jp_norm TEXT PK, en TEXT, ts INT, src TEXT).
-
Defaults + flags cleanup
- Config file
.translationrc(YAML/JSON) for model, style, autofit defaults; CLI reads it so you don't have to pass flags.
- Config file
If you want, I can turn this into three small PRs: (1) Drive poller + job runner, (2) LLM adapter + GPT-5 fallback, (3) DAL with DOCX handler.
- Smart batch sizing: Auto-optimizes API requests per model
- Comprehensive logging: Real-time progress with ETA estimates
- Robust error handling: Auto-retry with intelligent backoff
- Layout preservation: Maintains original formatting and design
- Style consistency: Unified tone and terminology across slides
- Content-aware processing: Handles titles, bullets, tables differently
- Expansion management: Prevents text overflow with smart compression
- Glossary integration: Ensures consistent translation of key terms
- Translation caching: Avoids re-translating identical content
- Bilingual output: CSV mapping for quality assurance
- Performance metrics: Detailed audit reports and statistics
- Webhook integration: Real-time progress tracking (optional)
export OPENAI_API_KEY=your_key_here# Production presets (recommended)
python scripts/translate_pptx_inplace.py \
--in input.pptx \
--out output_en.pptx \
--model gpt-4o-2024-08-06
# Cost-optimized option
python scripts/translate_pptx_inplace.py \
--in input.pptx \
--out output_en.pptx \
--model gpt-4o-mini| Preset | Model | Batch Size | Use Case |
|---|---|---|---|
| Conservative | gpt-4o-2024-08-06 |
8-12 (auto) | Maximum reliability |
| Balanced | gpt-4o-2024-08-06 |
10-14 (auto) | Recommended |
| Cost-lean | gpt-4o-mini |
12-16 (auto) | Good quality, lower cost |
Batch sizes are automatically calculated based on content complexity and token limits.
python scripts/translate_pptx_inplace.py [OPTIONS]
Required:
--in INPUT.pptx Input PowerPoint file
--out OUTPUT.pptx Output translated file
Optional:
--model MODEL AI model (default: auto-optimized)
--batch N Batch size (default: auto-calculated)
--cache FILE Translation cache (default: translation_cache.json)
--glossary FILE Terminology glossary (default: glossary.json)
--slides RANGE Process specific slides (e.g., "1-10")
--style-preset PRESET Style guide preset (gengo, minimal)├── scripts/
│ ├── translate_pptx_inplace.py # Main translation engine
│ ├── style_checker.py # Style consistency system
│ ├── eta.py # Progress estimation
│ ├── webhook_server.py # Real-time progress tracking
│ └── audit_style.py # Quality analysis
├── tools/
│ ├── derive_deck_tone.py # Tone analysis
│ └── estimate_cost.py # Cost estimation
├── inputs/ # Source presentations
├── outputs/ # Translated results
└── data/ # Glossaries and configs
Create glossary.json for consistent terminology:
{
"株式会社": "Corporation",
"取締役": "Director",
"戦略": "Strategy"
}Configure tone and style preferences:
{
"formality": "business_formal",
"technical_terms": "preserve_english",
"bullet_style": "concise_fragments"
}Run the webhook server for real-time updates:
# Terminal 1: Start webhook server
uvicorn scripts.webhook_server:app --port 8000
# Terminal 2: Run translation
python scripts/translate_pptx_inplace.py --in input.pptx --out output.pptxEach translation run generates:
| File | Description |
|---|---|
output_en.pptx |
Translated presentation |
bilingual.csv |
Side-by-side translation mapping |
audit.json |
Translation statistics and metrics |
translation_cache.json |
Cached translations for efficiency |
translation.log |
Detailed execution log |
- Token-aware sizing: Calculates optimal batch sizes based on model limits
- Dynamic adjustment: Reduces batch size automatically on high retry rates
- Content analysis: Adjusts for complex content (tables, technical text)
- Multi-stage processing: Pre-translation normalization → Translation → Post-processing
- Authority corrections: Deterministic style fixes based on diagnostics
- Tone preservation: Maintains consistent voice across the document
- Progressive backoff: 1s, 2s, 3s delays on retries
- Graceful degradation: Falls back to smaller batches on failures
- Cache recovery: Preserves work through interruptions
- gpt-4o models: 8-14 items (10k token target)
- gpt-4o-mini: 12-18 items (8k token target)
- Complex content: Use lower end of ranges
- Simple text: Can use higher batch sizes
- Cache efficiency: ~90% cache hit rate on re-runs
- Model selection: gpt-4o-mini offers 10x cost savings
- Batch optimization: Reduces API call overhead
High retry rates (>5%)
- System automatically reduces batch size
- Check API key limits and quotas
- Consider using gpt-4o-mini for better stability
Text overflow in slides
- Enable PowerPoint's "Shrink text on overflow"
- Use style presets for more concise translations
- Adjust font sizes manually if needed
Cache corruption
- Delete
translation_cache.jsonto reset - Use
--cache new_cache.jsonfor fresh cache
# Enable verbose logging
export PYTHONPATH=scripts
python -u scripts/translate_pptx_inplace.py --in input.pptx --out output.pptx 2>&1 | tee debug.log- OCR integration: Translate text in images
- Multi-language support: Beyond JA→EN
- Real-time collaboration: Shared translation sessions
- Template management: Reusable style configurations
- Quality scoring: Automatic translation assessment
MIT License - see LICENSE file for details.
- Fork the repository
- Create a feature branch
- Make your changes
- Add tests and documentation
- Submit a pull request
Built with ❤️ for efficient, high-quality presentation translation.