26 May 05:18

fa6f4c2

Latest

CAUTION

The v0.8.* is intended as an intermediate release series before the v1.
Please pin the exact version when using this series, as breaking behavioral changes may be introduced even in patch releases.

Changed

Change PathAnchor behavior for elasticsearch-sudachi (#361)
- PathAnchor.Classpath now loads data via class loader.
- PathAnchor.None does not resolve now. You may need to use PathAnchor.filesystem() instead to resolve based on CWD.
- Fix PathAnchor.Chain.resource. We recommend to use it instead of toResource.
0-th column of DictionaryPrinter output is now normalized (#242)

Added

Add TextNormalizer (#242)
- TextNormalizer normalizes text with a same process to the analysis.

Full Changelog: v0.7.5...v0.8.0

Assets 3

05 Nov 05:55

github-actions

v0.7.5

daa6cb0

Sudachi version 0.7.5

Highlights

Behavior of the dictionary printer and builder are changed (#234)
- DictionaryPrinter now prints word references in the (Surface, POS, Reading) triple format, instead of the line number format.
- DictionaryBuilder now allows the dictionary form to be written in the triple format, not only the line number format.

Added

Benchmark scripts are added (#235)

Fixed

Tutorial and readme are updated (#237, #240)
Config.Resource.asByteBuffer now always returns ByteBuffer with little endian byte order (#239)
- StringUtil.readAllBytes also now returns ByteBuffer with little endian byte order.

Assets 3

02 Jul 07:27

github-actions

v0.7.4

4767d60

Sudachi version 0.7.4

Highlights

Add Tokenizer.lazyTokenizeSentences(SplitMode mode, Readable input), that performs analysis lazily and saves memory usage (#231)
- Tokenizer.tokenizeSentences(SplitMode mode, Reader input) is marked as deprecated.

Fixed

Do not segfault on tokenizing with closed dictionary (#217)
The default config sudachi.json sets non-existent property joinKanjiNumeric in JoinNumericPlugin (#221)
fix incorrect size calculation when expand (#227)
Update tutorial.md (#226)

Assets 3

26 Jun 02:07

github-actions

v0.7.3

faa14bc

Sudachi version 0.7.3

This is a support release for Elasticsearch/OpenSearch integration 3.1.0 release.

Highlights

Added Config.fromResource method for reading Configs vial PathAnchor. (#212)

Internals

Plugin classloading is done by PathAnchor and support multiple classloaders (#210, #209)

Notes about v0.7.2

Release v0.7.2 contains subset of the functionality of this release but did not contain crucial features. It is not a broken release, but there are no user-visible changed from v0.7.1.

Assets 3

09 Mar 09:51

github-actions

v0.7.1

48dafcd

Sudachi version 0.7.1

This is a maintenance release

Highlights

Fixed analysis truncation when using analysis with sentence splitting and the input does not contain data which can be treated as splittable sentences
Fixed O(N^2) performance in sentence splitting when underlying reader does not fill buffer fully at once
Stop calling into reader with full buffer

Assets 3

09 Mar 09:51

github-actions

v0.6.4

a576e58

0.6.4

This is a maintenance release

Highlights

Fixed analysis truncation when using analysis with sentence splitting and the input does not contain data which can be treated as splittable sentences
Fixed O(N^2) performance in sentence splitting when underlying reader does not fill buffer fully at once
Stop calling into reader with full buffer

Assets 3

29 Aug 12:50

github-actions

v0.6.3

1856754

Sudachi version 0.6.3 Pre-release

Pre-release

Port relaxed boundary mode from 0.7.0 while keeping ABI compatibility with pre-0.7.0 versions.

Assets 3

16 Aug 03:00

github-actions

v0.7.0

7ba2b99

Sudachi version 0.7.0

Highlights

Tokenizer.tokenize API returns MorphemeList instead of List<Morpheme>. This change is ABI-incompatible with previous versions and applications which use Sudachi require recompilation. The change should be source-compatible with no changes required to the source code which uses Sudachi.
New API: MorphemeList.split: resplit C-mode token sequence to lower level without re-analyzing the whole string.
Added relaxed boundary matching mode for Regex OOV handler

Assets 3

21 Jun 01:05

github-actions

v0.6.2

0b9db89

Sudachi version 0.6.2

Highlights

Fixed invalid POS tags which appeared when using user-defined POS tags both in user dictionaries and OOV handlers. You are not affected by this bug if you did not use user-defined POS in OOV handlers.

Assets 3

10 Jun 08:24

github-actions

v0.6.1

c66c7a0

Sudachi version 0.6.1

Highlights

DO NOT USE 0.6.0, IT IS INCOMPATIBLE WITH 0.6.1
Regex OOV plugin has configurable maximum token length
SettingsAnchor renamed to PathAnchor to make more clear its purpose
Add useful Config methods, e.g. for a common case of loading default configuration with provided PathAnchor to resolve default paths in another directory.
Filesystem-based PathAnchor now plays correctly with SecurityManager present (e.g. in ElasticSearch).

Regex OOV length

Use maxLength field of the plugin configuration object to set maximum allowed length, in utf-8 bytes (by default 32). The unit will change to unicode codepoints in the future.

Assets 3

Uh oh!

Releases: WorksApplications/Sudachi

Sudachi version 0.8.0

CAUTION

Changed

Added

Uh oh!

Sudachi version 0.7.5

Highlights

Added

Fixed

Uh oh!

Sudachi version 0.7.4

Highlights

Fixed

Uh oh!

Sudachi version 0.7.3

Highlights

Internals

Notes about v0.7.2

Uh oh!

Sudachi version 0.7.1

Highlights

Uh oh!

0.6.4

Highlights

Uh oh!

Sudachi version 0.6.3

Uh oh!

Sudachi version 0.7.0

Highlights

Uh oh!

Sudachi version 0.6.2

Highlights

Uh oh!

Sudachi version 0.6.1

Highlights

Regex OOV length

Uh oh!