Text translator library based on LLM models, especially EncoderDecoderModel in HuggingFace
Package | repo | description |
---|---|---|
EDMTranslator | Main library |
- .NET 6 or above
- Free RAM spaces at least 3.5GB before running the translator
- JESCJaEnTranslator(sappho192/jesc-ja-en-translator): Japanese-to-English translator based on
tohoku-nlp/bert-base-japanese-v2
andopenai-community/gpt2
, fine-tuned with JESC dataset - FF14JaKoTranslator(sappho192/ffxiv-ja-ko-translator): Japanese-to-Korean translator based on
tohoku-nlp/bert-base-japanese-v2
andskt/kogpt2-base-v2
, fine-tuned with FF14 dataset - AihubJaKoTranslator(sappho192/aihub-ja-ko-translator): Japanese-to-Korean translator based on
tohoku-nlp/bert-base-japanese-v2
andskt/kogpt2-base-v2
, fine-tuned with AIHub dataset - More to be added...
Following guide supposes that you are to use JESCJaEnTranslator mentioned above.
- From the NuGet, install
EDMTranslator
package - And then, install
Tokenizers.DotNet.runtime.win
package too
- Download unidic mecab dictionary
unidic-mecab-2.1.2_bin.zip
from https://clrd.ninjal.ac.jp/unidic_archive/cwj/2.1.2/ and unzip the archive into somewhere
- Download the translator model from sappho192/jesc-ja-en-translator (especially
onnx_jesc-ja-en.7z
) and unzip the archive into somewhere
Write the code like below and you are good to go 🫡
Note that you need to fix the path of encoderDictDir
and modelDir
correctly.
// Console application which translates Japanese sentence to English with JESCJaEnTranslator
using EDMTranslator.Tokenization;
using EDMTranslator.Translation;
// Prepare the tokenizer
var encoderVocabPath = await BertJapaneseTokenizer.HuggingFace.GetVocabFromHub("tohoku-nlp/bert-base-japanese-v2");
var hubName = "openai-community/gpt2";
var decoderVocabFilename = "tokenizer.json";
var decoderVocabPath = await Tokenizers.DotNet.HuggingFace.GetFileFromHub(hubName, decoderVocabFilename, "deps");
string encoderDictDir = @"D:\DATASET\unidic-mecab-2.1.2_bin";
var tokenizer = new BertJa2GPTTokenizer(
encoderDictDir: encoderDictDir, encoderVocabPath: encoderVocabPath,
decoderVocabPath: decoderVocabPath);
void TestTokenizer(ITokenizer tokenizer)
{
Console.WriteLine("--Tokenizer test--");
Console.WriteLine("[Encode]");
var sentenceJa = "打ち合わせが終わった後にご飯を食べましょう。";
Console.WriteLine($"Input: {sentenceJa}");
var (embeddingsJa, attentionMask) = tokenizer.Encode(sentenceJa);
Console.WriteLine($"Encoded: {string.Join(", ", embeddingsJa)}");
Console.WriteLine("[Decode]");
// Tokens of "i was nervous before the exam, and i had a fever."
var tokens = new uint[] { 72, 373, 10927, 878, 262, 2814, 11, 290, 1312, 550, 257, 17372, 13 };
Console.WriteLine($"Input: {string.Join(", ", tokens)}");
var decoded = tokenizer.Decode(tokens);
Console.WriteLine($"Decoded: {decoded}");
}
TestTokenizer(tokenizer);
// Prepare the translator
string modelDir = @"D:\MODEL\jesc-ja-en-translator\onnx"; // The folder should contains encoder_model.onnx and decoder_model_merged.onnx
var translator = new JESCJaEnTranslator(tokenizer, modelDir);
void TestTranslator(JESCJaEnTranslator translator)
{
Console.WriteLine("--Translator test--");
Translate(translator, "打ち合わせが終わった後にご飯を食べましょう。");
Translate(translator, "試験前に緊張したあまり、熱がでてしまった。");
Translate(translator, "山田は英語にかけてはクラスの誰にも負けない。");
Translate(translator, "この本によれば、最初の人工橋梁は新石器時代にさかのぼるという。");
}
TestTranslator(translator);
static void Translate(JESCJaEnTranslator translator, string sentence)
{
Console.WriteLine($"SourceText: {sentence}");
string translated = translator.Translate(sentence);
Console.WriteLine($"Translated: {translated}");
}
- Prepare following stuff:
- .NET build system (
dotnet 6.0, 7.0, 8.0
) - PowerShell (Recommend
7.4.2
or above)
- .NET build system (
- Run
cbuild.ps1
The build artifact will be saved in nuget
directory.