Skip to content

Releases: WorksApplications/Sudachi

Sudachi version 0.8.0

26 May 05:18
fa6f4c2

Choose a tag to compare

CAUTION

The v0.8.* is intended as an intermediate release series before the v1.
Please pin the exact version when using this series, as breaking behavioral changes may be introduced even in patch releases.

Changed

  • Change PathAnchor behavior for elasticsearch-sudachi (#361)
    • PathAnchor.Classpath now loads data via class loader.
    • PathAnchor.None does not resolve now. You may need to use PathAnchor.filesystem() instead to resolve based on CWD.
    • Fix PathAnchor.Chain.resource. We recommend to use it instead of toResource.
  • 0-th column of DictionaryPrinter output is now normalized (#242)

Added

  • Add TextNormalizer (#242)
    • TextNormalizer normalizes text with a same process to the analysis.

Full Changelog: v0.7.5...v0.8.0

Sudachi version 0.7.5

05 Nov 05:55

Choose a tag to compare

Highlights

  • Behavior of the dictionary printer and builder are changed (#234)
    • DictionaryPrinter now prints word references in the (Surface, POS, Reading) triple format, instead of the line number format.
    • DictionaryBuilder now allows the dictionary form to be written in the triple format, not only the line number format.

Added

  • Benchmark scripts are added (#235)

Fixed

  • Tutorial and readme are updated (#237, #240)
  • Config.Resource.asByteBuffer now always returns ByteBuffer with little endian byte order (#239)
    • StringUtil.readAllBytes also now returns ByteBuffer with little endian byte order.

Sudachi version 0.7.4

02 Jul 07:27

Choose a tag to compare

Highlights

  • Add Tokenizer.lazyTokenizeSentences(SplitMode mode, Readable input), that performs analysis lazily and saves memory usage (#231)
    • Tokenizer.tokenizeSentences(SplitMode mode, Reader input) is marked as deprecated.

Fixed

  • Do not segfault on tokenizing with closed dictionary (#217)
  • The default config sudachi.json sets non-existent property joinKanjiNumeric in JoinNumericPlugin (#221)
  • fix incorrect size calculation when expand (#227)
  • Update tutorial.md (#226)

Sudachi version 0.7.3

26 Jun 02:07

Choose a tag to compare

This is a support release for Elasticsearch/OpenSearch integration 3.1.0 release.

Highlights

  • Added Config.fromResource method for reading Configs vial PathAnchor. (#212)

Internals

  • Plugin classloading is done by PathAnchor and support multiple classloaders (#210, #209)

Notes about v0.7.2

Release v0.7.2 contains subset of the functionality of this release but did not contain crucial features. It is not a broken release, but there are no user-visible changed from v0.7.1.

Sudachi version 0.7.1

09 Mar 09:51

Choose a tag to compare

This is a maintenance release

Highlights

  • Fixed analysis truncation when using analysis with sentence splitting and the input does not contain data which can be treated as splittable sentences
  • Fixed O(N^2) performance in sentence splitting when underlying reader does not fill buffer fully at once
  • Stop calling into reader with full buffer

0.6.4

09 Mar 09:51

Choose a tag to compare

This is a maintenance release

Highlights

  • Fixed analysis truncation when using analysis with sentence splitting and the input does not contain data which can be treated as splittable sentences
  • Fixed O(N^2) performance in sentence splitting when underlying reader does not fill buffer fully at once
  • Stop calling into reader with full buffer

Sudachi version 0.6.3

29 Aug 12:50

Choose a tag to compare

Sudachi version 0.6.3 Pre-release
Pre-release

Port relaxed boundary mode from 0.7.0 while keeping ABI compatibility with pre-0.7.0 versions.

Sudachi version 0.7.0

16 Aug 03:00

Choose a tag to compare

Highlights

  • Tokenizer.tokenize API returns MorphemeList instead of List<Morpheme>. This change is ABI-incompatible with previous versions and applications which use Sudachi require recompilation. The change should be source-compatible with no changes required to the source code which uses Sudachi.
  • New API: MorphemeList.split: resplit C-mode token sequence to lower level without re-analyzing the whole string.
  • Added relaxed boundary matching mode for Regex OOV handler

Sudachi version 0.6.2

21 Jun 01:05

Choose a tag to compare

Highlights

  • Fixed invalid POS tags which appeared when using user-defined POS tags both in user dictionaries and OOV handlers. You are not affected by this bug if you did not use user-defined POS in OOV handlers.

Sudachi version 0.6.1

10 Jun 08:24

Choose a tag to compare

Highlights

  • DO NOT USE 0.6.0, IT IS INCOMPATIBLE WITH 0.6.1
  • Regex OOV plugin has configurable maximum token length
  • SettingsAnchor renamed to PathAnchor to make more clear its purpose
  • Add useful Config methods, e.g. for a common case of loading default configuration with provided PathAnchor to resolve default paths in another directory.
  • Filesystem-based PathAnchor now plays correctly with SecurityManager present (e.g. in ElasticSearch).

Regex OOV length

Use maxLength field of the plugin configuration object to set maximum allowed length, in utf-8 bytes (by default 32). The unit will change to unicode codepoints in the future.