A governed multilingual backend foundation for future AI systems.
ARC Language Module is not just a translator. It is a language knowledge engine that helps an AI system know:
- what languages it has data for
- what scripts, variants, pronunciation hints, and lineage relationships exist
- what it can actually translate right now
- what still depends on external providers or corpora
- what was seeded, imported, changed, or left unresolved
That makes it a better fit for serious AI infrastructure than projects that only expose a translation endpoint.
This table is here to make the repo's niche obvious fast: ARC Language Module is best when you need a governed language backend, not just a translator endpoint.
| Capability / fit | ARC Language Module | Argos Translate | LibreTranslate | Firefox Translations / Bergamot | Unicode CLDR |
|---|---|---|---|---|---|
| Structured language graph | Yes — core strength | Limited | Limited | No | Yes — locale/reference focused |
| Runtime translation | Partial / routed | Yes — core strength | Yes — core strength | Yes — browser focused | No |
| Offline / local-first operation | Yes | Yes | Yes | Yes | Data/library dependent |
| API surface | Yes | Limited / wrapper dependent | Yes — core strength | No public ops API focus | No |
| CLI / operator workflows | Yes | Yes | Limited admin focus | No | Limited tooling focus |
| Coverage / readiness matrix | Yes — core strength | No | No | No | Partial via locale coverage |
| Provenance / governed ingestion | Yes — core strength | No | No | No | Contributor/repository process, not runtime governance |
| Release / evidence snapshots | Yes | No | No | No | No |
| Best used for | AI language substrate, multilingual control plane, governed routing | Offline translation library | Self-hosted translation API | Private browser/page translation | Locale data and internationalization reference |
| Stronger than ARC at | Auditability, routing, graph modeling | Raw offline MT packaging | Simple translation API deployment | Seamless in-browser page translation | Breadth of locale standards/reference data |
| Stronger than others at | Governed language infrastructure | Offline MT inference | Translation API simplicity | Browser-native private translation | Standards/reference ecosystem depth |
- Choose ARC Language Module when you need to know what languages you support, how well you support them, what data you have, what runtime paths exist, and what changed over time.
- Choose Argos Translate when you mainly want local/offline translation models.
- Choose LibreTranslate when you mainly want a translation API you can self-host quickly.
- Choose Firefox Translations / Bergamot when you mainly want private, on-device browser translation.
- Choose Unicode CLDR when you mainly want locale/reference data for i18n and formatting.
Think of this as the brain + filing system + traffic controller behind a multilingual AI stack.
It gives you:
- a language graph stored in SQLite
- a CLI and API for operators and applications
- seeded language knowledge you can inspect and extend
- runtime routing that separates “we know this language” from “we can translate or speak it right now”
- coverage, readiness, and policy surfaces so unsupported paths are visible instead of hidden
- evidence and release snapshots so the package can explain what it contains and what it claims
If you want a one-line summary:
ARC Language Module is a production-track substrate for AI systems that need structured multilingual knowledge, honest capability tracking, and controlled routing between data and runtime providers.
It keeps language records in a real database rather than loose notes or hardcoded conditionals.
That includes things like:
- language records
- aliases and alternate names
- scripts
- lineage / family relationships
- variants (dialects, registers, orthographies, historical stages)
- pronunciation profiles
- broad phonology hints
- transliteration profiles
- seeded phrase translations
- capability/readiness records
- governed language graph surfaces for efficient downstream model/context use
It can answer practical questions such as:
- Which languages are loaded?
- Which scripts are attached to each language?
- Which languages have pronunciation or phonology profiles?
- Which surfaces are seeded versus missing?
- Which capabilities are production, reviewed, experimental, or absent?
This repo does not pretend that every language is fully runtime-ready.
It can route a request through:
- seeded local phrase support
- optional local/runtime providers
- external provider bridges
- explicit “not ready” or “gap” states
That makes it a language operations layer, not just a translator wrapper.
The CLI/API surfaces can be used for:
- coverage reports
- implementation/readiness matrices
- policy snapshots
- acquisition workspace planning
- import validation
- evidence bundle exports
- release integrity checks
The package supports dry-run-safe ingestion and provenance-aware updates, so new datasets can be staged and checked instead of blindly merged.
To keep claims honest, this package is not:
- a universal best-in-class machine translation model
- a finished speech/TTS stack
- a complete transliteration engine for every script pair
- a giant cloud service by itself
It is strongest when used as a multilingual control layer inside a larger AI product or research stack.
Most language projects specialize in one narrow slice:
- translation only
- locale/reference data only
- browser translation only
- API hosting only
Future AI systems need more than that.
They need to know:
- what language knowledge they own
- what runtime tools are available
- which paths are trustworthy
- what support is partial or missing
- how to ingest better data without losing provenance
- how to expose all of this to both humans and software
That is the lane ARC Language Module is trying to lead:
not “best translator in the world,” but best governed language substrate for future AI systems that need multilingual memory, routing, readiness, and auditability.
ARC Language Module is not a hidden dataset dump and it does not replace real training data. Its role is different: it gives ARC systems and compatible LLM stacks a structured language graph so the model does not have to relearn every language relationship only from stored examples.
Instead of treating each language as isolated text, the module stores language identity, script, family, branch, lineage, variants, phonology hints, pronunciation hints, transliteration hints, aliases, and custom lineage overlays. That gives future model training and retrieval systems a reusable linguistic scaffold.
flowchart TD
ARC[ARC Language Module] --> IDS[Language IDs / ISO codes]
ARC --> SCRIPT[Scripts + orthography]
ARC --> LINEAGE[Family / branch / lineage graph]
ARC --> PHONO[Phonology profiles]
ARC --> PRON[Pronunciation hints]
ARC --> TRANS[Transliteration hints]
ARC --> VAR[Variants / dialect notes]
ARC --> CUSTOM[Custom language + custom lineage intake]
LINEAGE --> IE[Indo-European]
LINEAGE --> SEM[Afro-Asiatic / Semitic]
LINEAGE --> IA[Indo-Aryan]
LINEAGE --> DRAV[Dravidian]
LINEAGE --> SIN[Sino-Tibetan]
LINEAGE --> JAPONIC[Japonic]
LINEAGE --> KOREANIC[Koreanic]
LINEAGE --> TURKIC[Turkic]
LINEAGE --> NIGER[Niger-Congo]
LINEAGE --> ATH[Athabaskan]
LINEAGE --> IROQ[Iroquoian]
LINEAGE --> ALG[Algonquian]
SCRIPT --> LATN[Latin]
SCRIPT --> CYRL[Cyrillic]
SCRIPT --> ARAB[Arabic / Nastaliq]
SCRIPT --> DEVA[Devanagari]
SCRIPT --> HANI[Han]
SCRIPT --> JP[Kanji / Kana]
SCRIPT --> HANG[Hangul]
SCRIPT --> ETH[Ge'ez]
SCRIPT --> CANS[Canadian Aboriginal Syllabics]
SCRIPT --> CHER[Cherokee Syllabary]
PHONO --> SOUND[Sound-shape hints]
PRON --> SOUND
TRANS --> BRIDGE[Cross-script bridge]
VAR --> BRIDGE
CUSTOM --> ARC
ARC --> LLM[LLM / ARC-Neuron / compatible model]
ARC --> OMNI[Omnibinary Runtime]
ARC --> RAR[Arc-RAR bundles]
ARC --> STREAM[ARC-StreamMemory visual modules]
LLM --> LOWER[Lower need to store every language relation as raw memorized dataset rows]
LOWER --> PARAM[More efficient parameter use through structured linguistic priors]
The language graph is designed to plug into the wider ARC stack without pretending those systems are bundled into this package:
- ARC-Neuron / LLMBuilder can use the module as a lexical/provenance scaffold for model-growth and candidate evaluation.
- Omnibinary Runtime can preserve language graph events, hashes, and source-spine references as device-portable binary continuity.
- Arc-RAR can package language manifests, graph snapshots, receipts, and rollback evidence into restorable archive bundles.
- ARC-StreamMemory can attach visual/video memory modules to language-aware receipts and AI-readable observation trails.
- ProtoSynth / Neural Synth can later visualize language lineage, scripts, variants, and time-to-space projections as navigable cognition maps.
A normal model without a language graph has to infer language relationships mostly from raw examples:
language behavior ≈ memorized examples + learned statistics
ARC Language Module adds a structured prior:
language behavior ≈ examples + lineage graph + script map + phonology map + transliteration map + variant map
So the model does not need to store every language connection as a separate memorized dataset weight. It can reference a reusable graph.
Simplified:
Effective language coverage = model weights × structured language graph × verified examples
Or:
C_eff = W_model × G_language × E_verified
Where:
W_model= the actual model weightsG_language= the structured language graph from ARC Language ModuleE_verified= verified examples, corrections, and future datasets
The important point is that G_language raises the usefulness of the same model weights because related languages can share structure through lineage, script, phonology, transliteration, and variants.
This changes the “parameter bar” in a practical sense: the system is not relying only on raw stored examples. It has a retrievable, auditable language scaffold that helps future ARC-style systems align new examples against known language structure.
The current seed graph includes 35 languages with supporting surfaces for:
- language identity
- aliases
- scripts
- family / branch lineage
- variants
- transliteration hints
- pronunciation hints
- phonology profiles
- custom language submission
- custom lineage overlays
This does not mean the system already speaks all 35 languages at full native quality. It means ARC has a structured foundation for organizing, comparing, extending, and verifying language knowledge.
External datasets are still useful, but they become more efficient when they enter through the graph.
Instead of adding raw language data blindly:
dataset → model
ARC can do:
dataset → manifest → language graph alignment → lineage/script/phonology checks → candidate training/evaluation
This protects provenance and makes future dataset ingestion more controlled.
The module can add custom languages or project-specific symbolic languages through governed intake:
new language
→ ID / aliases
→ script / orthography
→ phonology hints
→ lineage or custom lineage
→ variants
→ examples
→ review
→ approved graph entry
That lets ARC grow its language map without pretending every new language is already proven model knowledge.
For a focused standalone version of this section, see docs/LANGUAGE_GRAPH_AND_PARAMETER_EFFICIENCY.md.
Different projects solve different problems well.
- Argos Translate is strong for offline open-source translation packages.
- LibreTranslate is strong for self-hosted translation APIs.
- Firefox Translations / Bergamot is strong for local in-browser translation.
- Unicode CLDR is strong for locale/reference data used across software ecosystems.
- ARC Language Module is strongest as the governed orchestration layer that sits above or beside those kinds of tools.
This is a role comparison, not a latency or BLEU benchmark.
| Project | Primary strength | Best use case | What it does not focus on |
|---|---|---|---|
| ARC Language Module | Governed multilingual substrate | AI backends that need language knowledge + routing + readiness + auditability | Being a single best MT engine |
| Argos Translate | Offline open-source translation | Local translation packages and desktop/local workflows | Broader governance / language graph surfaces |
| LibreTranslate | Self-hosted translation API | Drop-in translation endpoints and private deployment | Rich language-knowledge modeling |
| Firefox Translations / Bergamot | Private on-device browser translation | Website translation inside the browser | Operator-facing language registry and ingestion governance |
| Unicode CLDR | Locale/reference data | Internationalization, formatting, display names, locale metadata | Runtime translation orchestration |
For a more explicit comparison, see docs/COMPETITOR_COMPARISON.md.
Current release-integrity snapshot from the repo's single-source version path:
| Surface | Count |
|---|---|
| Version | 0.27.0 |
| Languages | 35 |
| Phrase translations | 385 |
| Language variants | 104 |
| Language capabilities | 245 |
| Pronunciation profiles | 35 |
| Phonology profiles | 35 |
| Transliteration profiles | 21 |
| Semantic concepts | 30 |
| Concept links | 46 |
Provider support is intentionally modeled separately from core graph truth. Runtime provider availability depends on what is installed, registered, and enabled in the target environment.
pip install -e .
PYTHONPATH=src python -m arc_lang.cli.main init-db
PYTHONPATH=src python -m arc_lang.cli.main seed-common-languages
PYTHONPATH=src python -m arc_lang.cli.main stats
PYTHONPATH=src python -m arc_lang.cli.main coverage-report
PYTHONPATH=src python -m arc_lang.cli.main system-status
PYTHONPATH=src python -m arc_lang.cli.main build-implementation-matrix
PYTHONPATH=src python -m arc_lang.cli.main release-snapshot- What languages are in the graph right now?
- Which ones are missing transliteration or pronunciation support?
- Which variants exist for a given language?
- What translation/assertion data came from which source?
- Which capabilities are seeded, reviewed, experimental, or production?
- What changed between releases?
- Which providers are needed for a requested runtime path?
The project is split into clear layers:
core/— config, db, modelsservices/— language logic, ingestion, routing, policy, evidence, coverageapi/— FastAPI surface grouped by concerncli/— operator entrypoints and handlersconfig/— seed manifests and curated inputssql/— schema and indexesdocs/— architecture, runtime, policy, onboarding, and comparison docs
Deep dives:
docs/ARCHITECTURE.mddocs/RUNTIME_ORCHESTRATION.mddocs/POLICY_AND_EVIDENCE.mddocs/IMPLEMENTATION_MANIFESTS_AND_PHONOLOGY.mddocs/COMPETITOR_COMPARISON.md
PYTHONPATH=src python -m arc_lang.cli.main release-snapshotThis emits:
- the package version
- pyproject/version consistency checks
- API health/version integrity checks
- live graph counts for release verification
This package can connect to or sit beside external tooling, but does not bundle all of them by default.
| Provider / source | Role |
|---|---|
| Argos Translate | Local neural MT option |
| NLLB-style external inference | Large-scale MT bridge path |
| PersonaPlex-style speech provider | Speech boundary surface |
| Glottolog | External genealogy/reference corpus |
| ISO 639-3 | Authoritative language identifiers |
| CLDR | Script/locale/reference data |
Use the most specific topics first so the repo lands in the right lane:
multilingual
translation
language-detection
transliteration
pronunciation
phonology
natural-language-processing
multilingual-nlp
knowledge-graph
language-technology
fastapi
sqlite
cli
api
governance
auditability
orchestration
local-first
offline-first
artificial-intelligence
Governed multilingual language-ops substrate for AI systems: language knowledge, provider routing, auditability, readiness, CLI, and API.
A control layer for multilingual AI systems, not just a translator.
If this repo is useful to you:
- Star the repository
- Open issues for bugs, corpus gaps, or runtime/provider edge cases
- Send pull requests for new language data, provider integrations, or hardening work
- Share it with people building multilingual AI, localization systems, or language tools
- Support development on GitHub Sponsors
Current production-track validation for this codebase includes:
- 336 passing tests
- wheel and sdist build verification
- installed-wheel smoke validation
- FastAPI app-load verification
- CLI help / release snapshot verification
These checks support the repo's current positioning as a production-track language infrastructure package, while real-world deployment quality still depends on the target environment, provider integrations, telemetry, and soak testing.
This project is intended to ship under the MIT License. Add a root LICENSE file in the public repository so the visible GitHub repo matches the package metadata.
