Skip to content

Darcara/TextAnalysis

Repository files navigation

Sentence splitting, named entity recognition, translation and more


Sentence splitting with SaT / WtP

Segment Any Text (June 2024) is the successor to Where's the Point (July 2023). The code from both papers is available on GitHub.
SaT supports 85 languages. The detailed list is available in their GitHub readme.
Models for SaT come in 3 flavors:

  • Base models with 1, 3, 6, 9 or 12 layers available on HuggingFace
    More layers means higher accuracy, but longer inference time
  • Low-Rank Adaptation (LoRA) modules are available for 3 and 12 layer base models in their respective repositories
    The LoRA modules enable the base models to be adapted to specific domains and styles
  • Supervised Mixture (sm) models with 1, 3, 6, 9, 12 layers available on HuggingFace
    SM models have been trained with a "supervised mixture" of diverse styles and corruptions. They score higher both on english and multilingual text.

This project supports the *-sm model family in onnx format.

Configuration

The SaT-Models benefit greatly from the GPU. For running on GPU setting the SessionConfiguration.Batching to batch=4 is best. For running on CPU setting the SessionConfiguration.Batching to batch=1 with IterOperationThreads=1 and IntraOperationThreads=2 will . Higher values for IntraOperationThreads will slightly decrease computing time, but use a lot more processing power. It is preferable to sentencize multiple text in parallel.

A consuming project must nce a proper ONNX-runtime. For Windows deployments Microsoft.ML.OnnxRuntime.DirectML Nuget (with prereleases) with Microsoft.AI.DirectML Nuget (with prereleases) will yield the best performance. Setting the RuntimeIdentifier in the project csproj to win-x64 is required.

Evaluation

The corpora scores are from the original SaT Github
This benchmark used the novel "The Adventures of Tom Sawyer, Complete by Mark Twain" from Project Gutenberg
The -model columns give the speed of only the model runtime, whereas -complete includes all pre and post data preparations, including the word tokenization.

Model English Score Multilingual Score CPU-model CPU-complete GPU-model GPU-complete
sat-1l 88.5 84.3
sat-1l-sm 88.2 87.9
sat-3l 93.7 89.2
sat-3l-sm 96.5 93.5
sat-6l 94.1 89.7
sat-6l-sm 96.9 95.1
sat-9l 94.3 90.3
sat-12l 94.0 90.4
sat-12l-sm 97.4 96.0

Implementation notes

Word tokenization is done by sentencepiece using the xlm-roberta-base (Alt1, Alt2) model. It is used in C# with the help of the SentencePieceTokenizer library.

See:


Language Identification

Library Model Accuracy Reliability¹ Time per prediction² Memory³ Unsupported Languages
Panlingo.CLD2 CLD2 word: 32.14%
pairs: 65.14%
sent: 91.18%
word: 93.97%
pairs: 91.58%
sent: 94.91%
0.005ms 15 KiB Hebrew, Norwegian_Bokmal
Panlingo.CLD3 CLD3 word: 43.22%
pairs: 60.53%
sent: 84.17%
word: 48.07%
pairs: 64.70%
sent: 87.23%
0.038ms 15 KiB Hebrew, Ganda, Norwegian_Bokmal, Norwegian_Nynorsk, Tagalog, Tswana, Tsonga
Panlingo.FastText FastText - 176 compressed word: 45.36%
pairs: 58.92%
sent: 78.43%
word: 52.34%
pairs: 60.10%
sent: 78.43%
0.086ms 10 MiB Ganda, Maori, Norwegian_Bokmal, Shona, Sotho_Southern, Tswana, Tsonga, Xhosa, Zulu
Panlingo.FastText FastText - 176 word: 50.70%
pairs: 64.55%
sent: 80.71%
- 0.104ms 142 MiB Ganda, Maori, Norwegian_Bokmal, Shona, Sotho_Southern, Tswana, Tsonga, Xhosa, Zulu
Panlingo.FastText FastText - 217 word: 52.01%
pairs: 69.98%
sent: 84.87%
- 0.940ms 1.18 GiB Arabic, Azerbaijani, Persian, Latin, Latvian, Mongolian, Malay_macrolanguage, Albanian, Swahili_macrolanguage
FastText.NetWrapper FastText - 176 word: 50.70%
pairs: 64.55%
sent: 80.71%
- 0.009ms 135 MiB Ganda, Maori, Norwegian_Bokmal, Shona, Sotho_Southern, Tswana, Tsonga, Xhosa, Zulu
FastText.NetWrapper FastText - 217 word: 52.01%
pairs: 69.98%
sent: 84.87%
- 0.081ms 1.13 GiB Arabic, Azerbaijani, Persian, Latin, Latvian, Mongolian, Malay_macrolanguage, Albanian, Swahili_macrolanguage
Panlingo.Whatlang Whatlang word: 40.32%
pairs: 51.31%
sent: 68.23%
- 0.042ms 75 Kib Bosnian, Welsh, Basque, Persian, Irish, Icelandic, Kazakh, Ganda, Maori, Mongolian, Malay_macrolanguage, Norwegian_Nynorsk, Somali, Albanian, Sotho_Southern, Swahili_macrolanguage, Tswana, Tsonga, Xhosa, Yoruba, Chinese
Panlingo.Lingua Lingua - Low accuracy word: 60.16%
pairs: 78.35%
sent: 93.38%
word: 62.38%
pairs: 80.20%
sent: 93.99%
0.089ms 100 Mib -
Panlingo.Lingua Lingua - High accuracy word: 73.94%
pairs: 89.06%
sent: 96.01%
- 0.264ms 100 Mib -
SearchPioneer.Lingua
pure .NET port
Lingua - Low accuracy word: 59.89%
pairs: 78.23%
sent: 93.28%
- 0.452ms 100 Mib -
SearchPioneer.Lingua Lingua - High accuracy word: 73.64%
pairs: 88.98%
sent: 95.83%
- 0.565ms 100 Mib -

¹ Reliability is the sum of accurate predictions and the 'unknown' predictions of a library. It is usually better to know that the library can not ascertain the language properly instead of giving out a random wrong language.
² Average for 1000 word, bi-words and sentence predictions for each language. Actual timings will depend on your CPU.
³ Approximate memory requirements for one predictor. Includes native memory for libraries, rules and ML-models.


Compiling onnx runtime on Windows

Prerequisites

Reference https://onnxruntime.ai/docs/build/inferencing.html

  • Python 3.12
  • CMake
  • Visual Studio 2022 (with MSVC v143 C++ x64/x86 BuildTools(v14.41-17.11))
  • Make sure the build folder is empty or missing before starting
git clone https://github.com/microsoft/onnxruntime
-- or --
git fetch
git checkout v1.20.1

onnxruntime> PATH=%PATH%;C:\Program Files\Python312
onnxruntime> build.bat --cmake_path "C:\Program Files\CMake\bin\cmake.exe" --ctest_path "C:\Program Files\CMake\bin\ctest.exe" --config Release --build_shared_lib --parallel --compile_no_warning_as_error --skip_tests --use_mimalloc --use_dml

Currently with problems:
--build_nuget and --use_extensions

The result will be in build\Windows\Release\Release

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages