Skip to content

Commit dc39c0e

Browse files
committed
Add NER support
1 parent b5cdd9a commit dc39c0e

15 files changed

+1837
-753
lines changed

README.md

+30-6
Original file line numberDiff line numberDiff line change
@@ -24,10 +24,12 @@ sist2 (Simple incremental search tool)
2424
* Recursive scan inside archive files \*\*
2525
* OCR support with tesseract \*\*\*
2626
* Stats page & disk utilisation visualization
27+
* Named-entity recognition (client-side) \*\*\*\*
2728

2829
\* See [format support](#format-support)
2930
\*\* See [Archive files](#archive-files)
3031
\*\*\* See [OCR](#ocr)
32+
\*\*\*\* See [Named-Entity Recognition](#NER)
3133

3234
## Getting Started
3335

@@ -56,7 +58,7 @@ services:
5658
entrypoint: python3 /root/sist2-admin/sist2_admin/app.py
5759
```
5860
59-
Navigate to http://localhost:8080/ to configure sist2-admin.
61+
Navigate to http://localhost:8080/ to configure sist2-admin.
6062
6163
### Using the executable file *(Linux/WSL only)*
6264
@@ -67,10 +69,9 @@ Navigate to http://localhost:8080/ to configure sist2-admin.
6769
docker run -d -p 9200:9200 -e "discovery.type=single-node" elasticsearch:7.17.9
6870
```
6971

70-
2. Download the [latest sist2 release](https://github.com/simon987/sist2/releases).
71-
Select the file corresponding to your CPU architecture and mark the binary as executable with `chmod +x`.
72-
3. See [usage guide](docs/USAGE.md) for command line usage.
73-
72+
2. Download the [latest sist2 release](https://github.com/simon987/sist2/releases).
73+
Select the file corresponding to your CPU architecture and mark the binary as executable with `chmod +x`.
74+
3. See [usage guide](docs/USAGE.md) for command line usage.
7475

7576
Example usage:
7677

@@ -124,7 +125,7 @@ The `simon987/sist2` image comes with common languages
124125
(hin, jpn, eng, fra, rus, spa, chi_sim, deu) pre-installed.
125126

126127
You can use the `+` separator to specify multiple languages. The language
127-
name must be identical to the `*.traineddata` file installed on your system
128+
name must be identical to the `*.traineddata` file installed on your system
128129
(use `chi_sim` rather than `chi-sim`).
129130

130131
Examples:
@@ -135,6 +136,29 @@ sist2 scan --ocr-images --ocr-lang eng ~/Images/Screenshots/
135136
sist2 scan --ocr-ebooks --ocr-images --ocr-lang eng+chi_sim ~/Chinese-Bilingual/
136137
```
137138

139+
### NER
140+
141+
sist2 v3.0.4+ supports named-entity recognition (NER). Simply add a supported repository URL to
142+
**Configuration** > **Machine learning options** > **Model repositories**
143+
to enable it.
144+
145+
The text processing is done in your browser, no data is sent to any third-party services.
146+
See [simon987/sist2-ner-models](https://raw.githubusercontent.com/simon987/sist2-ner-models/main/repo.json) for more details.
147+
148+
#### List of available repositories:
149+
150+
| URL | Maintainer | Purpose |
151+
|---------------------------------------------------------------------------------------------------------|-----------------------------------------|---------|
152+
| [simon987/sist2-ner-models](https://raw.githubusercontent.com/simon987/sist2-ner-models/main/repo.json) | [simon987](https://github.com/simon987) | General |
153+
154+
155+
<details>
156+
<summary>Screenshot</summary>
157+
158+
![ner](docs/ner.png)
159+
160+
</details>
161+
138162
## Build from source
139163

140164
You can compile **sist2** by yourself if you don't want to use the pre-compiled binaries

docs/ner.png

448 KB
Loading

0 commit comments

Comments
 (0)