This project provides a JavaScript dataset and helper code for recognizing and categorizing text by Unicode script.
It maps characters to their respective writing systems (e.g., Latin, Arabic, Cyrillic, Devanagari, etc.) based on the Unicode 10.0 database and related sources.
- Detects which script a given character belongs to.
- Includes metadata for each script:
- Name (e.g.,
Arabic,Latin,Cyrillic) - Unicode ranges
- Writing direction (
ltr,rtl,ttb) - Approximate year of origin
- Whether the script is still in use
- Reference link (Wikipedia)
- Name (e.g.,
- Supports over 140+ world writing systems.
// Example: find which script a character belongs to
function characterScript(code) {
for (let script of SCRIPTS) {
if (script.ranges.some(([from, to]) => code >= from && code < to)) {
return script;
}
}
return null;
}
console.log(characterScript("শ".charCodeAt(0)));
// → { name: "Bengali", direction: "ltr", year: 1050, ... }
Detect script direction
let sample = "مرحبا"; // Arabic text
let script = characterScript(sample.codePointAt(0));
console.log(`Direction: ${script.direction}`);
// → Direction: rtl
Applications
Language/text recognition
Syntax highlighting
Internationalization (i18n)
Font rendering
Digital humanities research