Skip to content

GareBear99/arc-language-module

Repository files navigation

ARC Language Module

GitHub Sponsors Python 3.10+ SQLite backed FastAPI API CLI operator tooling Production track

A governed multilingual backend foundation for future AI systems.

ARC Language Module is not just a translator. It is a language knowledge engine that helps an AI system know:

  • what languages it has data for
  • what scripts, variants, pronunciation hints, and lineage relationships exist
  • what it can actually translate right now
  • what still depends on external providers or corpora
  • what was seeded, imported, changed, or left unresolved

That makes it a better fit for serious AI infrastructure than projects that only expose a translation endpoint.

At-a-glance feature fit

This table is here to make the repo's niche obvious fast: ARC Language Module is best when you need a governed language backend, not just a translator endpoint.

Capability / fit ARC Language Module Argos Translate LibreTranslate Firefox Translations / Bergamot Unicode CLDR
Structured language graph Yes — core strength Limited Limited No Yes — locale/reference focused
Runtime translation Partial / routed Yes — core strength Yes — core strength Yes — browser focused No
Offline / local-first operation Yes Yes Yes Yes Data/library dependent
API surface Yes Limited / wrapper dependent Yes — core strength No public ops API focus No
CLI / operator workflows Yes Yes Limited admin focus No Limited tooling focus
Coverage / readiness matrix Yes — core strength No No No Partial via locale coverage
Provenance / governed ingestion Yes — core strength No No No Contributor/repository process, not runtime governance
Release / evidence snapshots Yes No No No No
Best used for AI language substrate, multilingual control plane, governed routing Offline translation library Self-hosted translation API Private browser/page translation Locale data and internationalization reference
Stronger than ARC at Auditability, routing, graph modeling Raw offline MT packaging Simple translation API deployment Seamless in-browser page translation Breadth of locale standards/reference data
Stronger than others at Governed language infrastructure Offline MT inference Translation API simplicity Browser-native private translation Standards/reference ecosystem depth

Quick read of the table

  • Choose ARC Language Module when you need to know what languages you support, how well you support them, what data you have, what runtime paths exist, and what changed over time.
  • Choose Argos Translate when you mainly want local/offline translation models.
  • Choose LibreTranslate when you mainly want a translation API you can self-host quickly.
  • Choose Firefox Translations / Bergamot when you mainly want private, on-device browser translation.
  • Choose Unicode CLDR when you mainly want locale/reference data for i18n and formatting.

What this repo is, in plain English

Think of this as the brain + filing system + traffic controller behind a multilingual AI stack.

It gives you:

  • a language graph stored in SQLite
  • a CLI and API for operators and applications
  • seeded language knowledge you can inspect and extend
  • runtime routing that separates “we know this language” from “we can translate or speak it right now”
  • coverage, readiness, and policy surfaces so unsupported paths are visible instead of hidden
  • evidence and release snapshots so the package can explain what it contains and what it claims

If you want a one-line summary:

ARC Language Module is a production-track substrate for AI systems that need structured multilingual knowledge, honest capability tracking, and controlled routing between data and runtime providers.


What it can do today

1) Store structured language knowledge

It keeps language records in a real database rather than loose notes or hardcoded conditionals.

That includes things like:

  • language records
  • aliases and alternate names
  • scripts
  • lineage / family relationships
  • variants (dialects, registers, orthographies, historical stages)
  • pronunciation profiles
  • broad phonology hints
  • transliteration profiles
  • seeded phrase translations
  • capability/readiness records
  • governed language graph surfaces for efficient downstream model/context use

2) Tell you what the system actually knows

It can answer practical questions such as:

  • Which languages are loaded?
  • Which scripts are attached to each language?
  • Which languages have pronunciation or phonology profiles?
  • Which surfaces are seeded versus missing?
  • Which capabilities are production, reviewed, experimental, or absent?

3) Route translation requests honestly

This repo does not pretend that every language is fully runtime-ready.

It can route a request through:

  • seeded local phrase support
  • optional local/runtime providers
  • external provider bridges
  • explicit “not ready” or “gap” states

That makes it a language operations layer, not just a translator wrapper.

4) Support operator workflows

The CLI/API surfaces can be used for:

  • coverage reports
  • implementation/readiness matrices
  • policy snapshots
  • acquisition workspace planning
  • import validation
  • evidence bundle exports
  • release integrity checks

5) Ingest and govern new language data

The package supports dry-run-safe ingestion and provenance-aware updates, so new datasets can be staged and checked instead of blindly merged.


What it is not

To keep claims honest, this package is not:

  • a universal best-in-class machine translation model
  • a finished speech/TTS stack
  • a complete transliteration engine for every script pair
  • a giant cloud service by itself

It is strongest when used as a multilingual control layer inside a larger AI product or research stack.


Why this matters for future AI

Most language projects specialize in one narrow slice:

  • translation only
  • locale/reference data only
  • browser translation only
  • API hosting only

Future AI systems need more than that.

They need to know:

  • what language knowledge they own
  • what runtime tools are available
  • which paths are trustworthy
  • what support is partial or missing
  • how to ingest better data without losing provenance
  • how to expose all of this to both humans and software

That is the lane ARC Language Module is trying to lead:

not “best translator in the world,” but best governed language substrate for future AI systems that need multilingual memory, routing, readiness, and auditability.


Language graph and parameter efficiency

ARC Language Module is not a hidden dataset dump and it does not replace real training data. Its role is different: it gives ARC systems and compatible LLM stacks a structured language graph so the model does not have to relearn every language relationship only from stored examples.

Instead of treating each language as isolated text, the module stores language identity, script, family, branch, lineage, variants, phonology hints, pronunciation hints, transliteration hints, aliases, and custom lineage overlays. That gives future model training and retrieval systems a reusable linguistic scaffold.

flowchart TD
    ARC[ARC Language Module] --> IDS[Language IDs / ISO codes]
    ARC --> SCRIPT[Scripts + orthography]
    ARC --> LINEAGE[Family / branch / lineage graph]
    ARC --> PHONO[Phonology profiles]
    ARC --> PRON[Pronunciation hints]
    ARC --> TRANS[Transliteration hints]
    ARC --> VAR[Variants / dialect notes]
    ARC --> CUSTOM[Custom language + custom lineage intake]

    LINEAGE --> IE[Indo-European]
    LINEAGE --> SEM[Afro-Asiatic / Semitic]
    LINEAGE --> IA[Indo-Aryan]
    LINEAGE --> DRAV[Dravidian]
    LINEAGE --> SIN[Sino-Tibetan]
    LINEAGE --> JAPONIC[Japonic]
    LINEAGE --> KOREANIC[Koreanic]
    LINEAGE --> TURKIC[Turkic]
    LINEAGE --> NIGER[Niger-Congo]
    LINEAGE --> ATH[Athabaskan]
    LINEAGE --> IROQ[Iroquoian]
    LINEAGE --> ALG[Algonquian]

    SCRIPT --> LATN[Latin]
    SCRIPT --> CYRL[Cyrillic]
    SCRIPT --> ARAB[Arabic / Nastaliq]
    SCRIPT --> DEVA[Devanagari]
    SCRIPT --> HANI[Han]
    SCRIPT --> JP[Kanji / Kana]
    SCRIPT --> HANG[Hangul]
    SCRIPT --> ETH[Ge'ez]
    SCRIPT --> CANS[Canadian Aboriginal Syllabics]
    SCRIPT --> CHER[Cherokee Syllabary]

    PHONO --> SOUND[Sound-shape hints]
    PRON --> SOUND
    TRANS --> BRIDGE[Cross-script bridge]
    VAR --> BRIDGE
    CUSTOM --> ARC

    ARC --> LLM[LLM / ARC-Neuron / compatible model]
    ARC --> OMNI[Omnibinary Runtime]
    ARC --> RAR[Arc-RAR bundles]
    ARC --> STREAM[ARC-StreamMemory visual modules]

    LLM --> LOWER[Lower need to store every language relation as raw memorized dataset rows]
    LOWER --> PARAM[More efficient parameter use through structured linguistic priors]
Loading

Connected ARC ecosystem roles

The language graph is designed to plug into the wider ARC stack without pretending those systems are bundled into this package:

  • ARC-Neuron / LLMBuilder can use the module as a lexical/provenance scaffold for model-growth and candidate evaluation.
  • Omnibinary Runtime can preserve language graph events, hashes, and source-spine references as device-portable binary continuity.
  • Arc-RAR can package language manifests, graph snapshots, receipts, and rollback evidence into restorable archive bundles.
  • ARC-StreamMemory can attach visual/video memory modules to language-aware receipts and AI-readable observation trails.
  • ProtoSynth / Neural Synth can later visualize language lineage, scripts, variants, and time-to-space projections as navigable cognition maps.

Mathematical intuition

A normal model without a language graph has to infer language relationships mostly from raw examples:

language behavior ≈ memorized examples + learned statistics

ARC Language Module adds a structured prior:

language behavior ≈ examples + lineage graph + script map + phonology map + transliteration map + variant map

So the model does not need to store every language connection as a separate memorized dataset weight. It can reference a reusable graph.

Simplified:

Effective language coverage = model weights × structured language graph × verified examples

Or:

C_eff = W_model × G_language × E_verified

Where:

  • W_model = the actual model weights
  • G_language = the structured language graph from ARC Language Module
  • E_verified = verified examples, corrections, and future datasets

The important point is that G_language raises the usefulness of the same model weights because related languages can share structure through lineage, script, phonology, transliteration, and variants.

This changes the “parameter bar” in a practical sense: the system is not relying only on raw stored examples. It has a retrievable, auditable language scaffold that helps future ARC-style systems align new examples against known language structure.

Current scope

The current seed graph includes 35 languages with supporting surfaces for:

  • language identity
  • aliases
  • scripts
  • family / branch lineage
  • variants
  • transliteration hints
  • pronunciation hints
  • phonology profiles
  • custom language submission
  • custom lineage overlays

This does not mean the system already speaks all 35 languages at full native quality. It means ARC has a structured foundation for organizing, comparing, extending, and verifying language knowledge.

Why this matters for future datasets

External datasets are still useful, but they become more efficient when they enter through the graph.

Instead of adding raw language data blindly:

dataset → model

ARC can do:

dataset → manifest → language graph alignment → lineage/script/phonology checks → candidate training/evaluation

This protects provenance and makes future dataset ingestion more controlled.

Custom language growth

The module can add custom languages or project-specific symbolic languages through governed intake:

new language
→ ID / aliases
→ script / orthography
→ phonology hints
→ lineage or custom lineage
→ variants
→ examples
→ review
→ approved graph entry

That lets ARC grow its language map without pretending every new language is already proven model knowledge.

For a focused standalone version of this section, see docs/LANGUAGE_GRAPH_AND_PARAMETER_EFFICIENCY.md.


Where it sits compared to other projects

Different projects solve different problems well.

  • Argos Translate is strong for offline open-source translation packages.
  • LibreTranslate is strong for self-hosted translation APIs.
  • Firefox Translations / Bergamot is strong for local in-browser translation.
  • Unicode CLDR is strong for locale/reference data used across software ecosystems.
  • ARC Language Module is strongest as the governed orchestration layer that sits above or beside those kinds of tools.

Qualitative comparison by role

This is a role comparison, not a latency or BLEU benchmark.

Language infrastructure comparison chart

Comparison table

Project Primary strength Best use case What it does not focus on
ARC Language Module Governed multilingual substrate AI backends that need language knowledge + routing + readiness + auditability Being a single best MT engine
Argos Translate Offline open-source translation Local translation packages and desktop/local workflows Broader governance / language graph surfaces
LibreTranslate Self-hosted translation API Drop-in translation endpoints and private deployment Rich language-knowledge modeling
Firefox Translations / Bergamot Private on-device browser translation Website translation inside the browser Operator-facing language registry and ingestion governance
Unicode CLDR Locale/reference data Internationalization, formatting, display names, locale metadata Runtime translation orchestration

For a more explicit comparison, see docs/COMPETITOR_COMPARISON.md.


Seed and package snapshot

Current release-integrity snapshot from the repo's single-source version path:

Surface Count
Version 0.27.0
Languages 35
Phrase translations 385
Language variants 104
Language capabilities 245
Pronunciation profiles 35
Phonology profiles 35
Transliteration profiles 21
Semantic concepts 30
Concept links 46

Provider support is intentionally modeled separately from core graph truth. Runtime provider availability depends on what is installed, registered, and enabled in the target environment.


Quick start

pip install -e .

PYTHONPATH=src python -m arc_lang.cli.main init-db
PYTHONPATH=src python -m arc_lang.cli.main seed-common-languages
PYTHONPATH=src python -m arc_lang.cli.main stats
PYTHONPATH=src python -m arc_lang.cli.main coverage-report
PYTHONPATH=src python -m arc_lang.cli.main system-status
PYTHONPATH=src python -m arc_lang.cli.main build-implementation-matrix
PYTHONPATH=src python -m arc_lang.cli.main release-snapshot

Example operator questions this repo can answer

  • What languages are in the graph right now?
  • Which ones are missing transliteration or pronunciation support?
  • Which variants exist for a given language?
  • What translation/assertion data came from which source?
  • Which capabilities are seeded, reviewed, experimental, or production?
  • What changed between releases?
  • Which providers are needed for a requested runtime path?

Architecture at a glance

The project is split into clear layers:

  • core/ — config, db, models
  • services/ — language logic, ingestion, routing, policy, evidence, coverage
  • api/ — FastAPI surface grouped by concern
  • cli/ — operator entrypoints and handlers
  • config/ — seed manifests and curated inputs
  • sql/ — schema and indexes
  • docs/ — architecture, runtime, policy, onboarding, and comparison docs

Deep dives:


Release integrity

PYTHONPATH=src python -m arc_lang.cli.main release-snapshot

This emits:

  • the package version
  • pyproject/version consistency checks
  • API health/version integrity checks
  • live graph counts for release verification

External dependencies and optional providers

This package can connect to or sit beside external tooling, but does not bundle all of them by default.

Provider / source Role
Argos Translate Local neural MT option
NLLB-style external inference Large-scale MT bridge path
PersonaPlex-style speech provider Speech boundary surface
Glottolog External genealogy/reference corpus
ISO 639-3 Authoritative language identifiers
CLDR Script/locale/reference data

Repository metadata

Suggested GitHub topics

Use the most specific topics first so the repo lands in the right lane:

multilingual
translation
language-detection
transliteration
pronunciation
phonology
natural-language-processing
multilingual-nlp
knowledge-graph
language-technology
fastapi
sqlite
cli
api
governance
auditability
orchestration
local-first
offline-first
artificial-intelligence

Suggested GitHub About text

Governed multilingual language-ops substrate for AI systems: language knowledge, provider routing, auditability, readiness, CLI, and API.

Short promotional line

A control layer for multilingual AI systems, not just a translator.


Support the project

If this repo is useful to you:

  • Star the repository
  • Open issues for bugs, corpus gaps, or runtime/provider edge cases
  • Send pull requests for new language data, provider integrations, or hardening work
  • Share it with people building multilingual AI, localization systems, or language tools
  • Support development on GitHub Sponsors

Release and validation status

Current production-track validation for this codebase includes:

  • 336 passing tests
  • wheel and sdist build verification
  • installed-wheel smoke validation
  • FastAPI app-load verification
  • CLI help / release snapshot verification

These checks support the repo's current positioning as a production-track language infrastructure package, while real-world deployment quality still depends on the target environment, provider integrations, telemetry, and soak testing.


License

This project is intended to ship under the MIT License. Add a root LICENSE file in the public repository so the visible GitHub repo matches the package metadata.

About

ARC Language Module is a governed multilingual backend foundation for future AI systems. Combines a language graph, ingestion pipeline, runtime routing, coverage/readiness reporting, and evidence surfaces for an AI stack to see what is missing, and how to route work honestly.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages