Known Words CSV Exporter with MeCab Lemmatization and Advanced Filtering by tunjan · Pull Request #21 · Ajatt-Tools/Japanese

tunjan · 2025-05-08T20:20:49Z

This tool allows users to extract words from their Anki collection, process them (optionally using MeCab for Japanese lemmatization), and export them to a CSV file. It's designed to help users build and maintain lists of known vocabulary, potentially for use with other language learning tools or for analysis.

Core Functionality:

The add-on provides a dialog interface to configure and execute the export process. Users can:

Specify Anki Data Source:
- Filter notes by note type name (e.g., "Japanese," "Basic").
- Select the specific field within those notes that contains the words/sentences to process (e.g., "Expression," "Sentence").
- Set a minimum card interval to only include words from mature cards.
Manage CSV Output:
- Read Existing CSV: Optionally load an existing "known words" CSV. The add-on expects "Word" and "Source" columns.
- Operation Mode:
  - Update Selected CSV: Merge new Anki data with an existing CSV, adding new words, updating sources, and removing words no longer found in Anki (if they were previously marked as anki source).
  - Save As New CSV: Export all processed words to a new CSV file.
- Automatic Filename Timestamping: Optionally append a _YYYY-MM-DD_HHMMSS timestamp to new CSV filenames.
Advanced Word Processing (especially for Japanese):
- MeCab Lemmatization:
  - If MeCab (Japanese morphological analyzer) is available, users can choose to lemmatize words from the Anki field (e.g., "食べました" -> "食べる").
  - Custom Stopwords: Define a list of custom stopwords (lemmas) to be excluded from the export. These can either supplement built-in stopwords (like する, ある) or replace them entirely.
  - Part-of-Speech (POS) Filtering: Common particles, symbols, prefixes, etc., are automatically filtered out during lemmatization.
  - MeCab Test Tool: A built-in utility allows users to test the current MeCab lemmatization settings (stopwords, POS filtering) on sample Japanese text.
- Basic Word Extraction (if MeCab is unavailable or disabled):
  - Words are extracted by splitting the field content by common delimiters and removing punctuation/HTML.
Dictionary Filtering:
- Optionally filter the extracted words/lemmas against a user-provided dictionary file (plain text, one word per line). Only words/lemmas present in this dictionary will be included in the final CSV.
Settings Persistence:
- The dialog remembers the last used settings (paths, filters, options) for convenience.

Key Components:

ExportVocabCsvDialog: The main Qt dialog for user interaction and settings.
KnownWordsProcessor: Handles the core logic of reading CSVs, fetching/processing Anki data, merging word lists, and writing output CSVs.
MeCabProcessor: Encapsulates all MeCab-related functionality, including initialization, lemmatization, POS filtering, stopword management, and self-testing.
Graceful degradation if MeCab is not installed or properly configured (lemmatization features will be disabled).

Purpose & Use Cases:

Creating a "known words" list for import into reading assistance tools (e.g., browser extensions that highlight known/unknown words on Japanese websites).
Tracking vocabulary acquisition over time.
Generating word lists for further study or analysis.
Migrating vocabulary data between different systems.

tunjan and others added 10 commits May 8, 2025 22:13

Create main.py

de927b1

Create __init__.py

5a810c0

Update gui.py

981e760

update submodules

2af2a5c

format code

b6396c3

extract function

b33e516

add return type

401bf42

add copyright

4a40858

annotate collection

dd1a05e

move init

be0d125

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Comments

Known Words CSV Exporter with MeCab Lemmatization and Advanced Filtering#21

Known Words CSV Exporter with MeCab Lemmatization and Advanced Filtering#21
tunjan wants to merge 10 commits intoAjatt-Tools:mainfrom
tunjan:main

tunjan commented May 8, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Comments

Conversation

tunjan commented May 8, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants