Releases: WorksApplications/Sudachi
Sudachi version 0.8.0
CAUTION
The v0.8.* is intended as an intermediate release series before the v1.
Please pin the exact version when using this series, as breaking behavioral changes may be introduced even in patch releases.
Changed
- Change PathAnchor behavior for elasticsearch-sudachi (#361)
PathAnchor.Classpathnow loads data via class loader.PathAnchor.Nonedoes not resolve now. You may need to usePathAnchor.filesystem()instead to resolve based on CWD.- Fix
PathAnchor.Chain.resource. We recommend to use it instead oftoResource.
- 0-th column of DictionaryPrinter output is now normalized (#242)
Added
- Add TextNormalizer (#242)
- TextNormalizer normalizes text with a same process to the analysis.
Full Changelog: v0.7.5...v0.8.0
Sudachi version 0.7.5
Highlights
- Behavior of the dictionary printer and builder are changed (#234)
DictionaryPrinternow prints word references in the (Surface, POS, Reading) triple format, instead of the line number format.DictionaryBuildernow allows the dictionary form to be written in the triple format, not only the line number format.
Added
- Benchmark scripts are added (#235)
Fixed
Sudachi version 0.7.4
Highlights
- Add
Tokenizer.lazyTokenizeSentences(SplitMode mode, Readable input), that performs analysis lazily and saves memory usage (#231)Tokenizer.tokenizeSentences(SplitMode mode, Reader input)is marked as deprecated.
Fixed
Sudachi version 0.7.3
This is a support release for Elasticsearch/OpenSearch integration 3.1.0 release.
Highlights
- Added
Config.fromResourcemethod for reading Configs vial PathAnchor. (#212)
Internals
Notes about v0.7.2
Release v0.7.2 contains subset of the functionality of this release but did not contain crucial features. It is not a broken release, but there are no user-visible changed from v0.7.1.
Sudachi version 0.7.1
This is a maintenance release
Highlights
- Fixed analysis truncation when using analysis with sentence splitting and the input does not contain data which can be treated as splittable sentences
- Fixed O(N^2) performance in sentence splitting when underlying reader does not fill buffer fully at once
- Stop calling into reader with full buffer
0.6.4
This is a maintenance release
Highlights
- Fixed analysis truncation when using analysis with sentence splitting and the input does not contain data which can be treated as splittable sentences
- Fixed O(N^2) performance in sentence splitting when underlying reader does not fill buffer fully at once
- Stop calling into reader with full buffer
Sudachi version 0.6.3
Port relaxed boundary mode from 0.7.0 while keeping ABI compatibility with pre-0.7.0 versions.
Sudachi version 0.7.0
Highlights
Tokenizer.tokenizeAPI returnsMorphemeListinstead ofList<Morpheme>. This change is ABI-incompatible with previous versions and applications which use Sudachi require recompilation. The change should be source-compatible with no changes required to the source code which uses Sudachi.- New API:
MorphemeList.split: resplit C-mode token sequence to lower level without re-analyzing the whole string. - Added relaxed boundary matching mode for Regex OOV handler
Sudachi version 0.6.2
Highlights
- Fixed invalid POS tags which appeared when using user-defined POS tags both in user dictionaries and OOV handlers. You are not affected by this bug if you did not use user-defined POS in OOV handlers.
Sudachi version 0.6.1
Highlights
- DO NOT USE 0.6.0, IT IS INCOMPATIBLE WITH 0.6.1
- Regex OOV plugin has configurable maximum token length
- SettingsAnchor renamed to PathAnchor to make more clear its purpose
- Add useful Config methods, e.g. for a common case of loading default configuration with provided PathAnchor to resolve default paths in another directory.
- Filesystem-based PathAnchor now plays correctly with SecurityManager present (e.g. in ElasticSearch).
Regex OOV length
Use maxLength field of the plugin configuration object to set maximum allowed length, in utf-8 bytes (by default 32). The unit will change to unicode codepoints in the future.