Skip to content

Comments

Known Words CSV Exporter with MeCab Lemmatization and Advanced Filtering#21

Open
tunjan wants to merge 10 commits intoAjatt-Tools:mainfrom
tunjan:main
Open

Known Words CSV Exporter with MeCab Lemmatization and Advanced Filtering#21
tunjan wants to merge 10 commits intoAjatt-Tools:mainfrom
tunjan:main

Conversation

@tunjan
Copy link

@tunjan tunjan commented May 8, 2025

This tool allows users to extract words from their Anki collection, process them (optionally using MeCab for Japanese lemmatization), and export them to a CSV file. It's designed to help users build and maintain lists of known vocabulary, potentially for use with other language learning tools or for analysis.

Core Functionality:

The add-on provides a dialog interface to configure and execute the export process. Users can:

  1. Specify Anki Data Source:

    • Filter notes by note type name (e.g., "Japanese," "Basic").
    • Select the specific field within those notes that contains the words/sentences to process (e.g., "Expression," "Sentence").
    • Set a minimum card interval to only include words from mature cards.
  2. Manage CSV Output:

    • Read Existing CSV: Optionally load an existing "known words" CSV. The add-on expects "Word" and "Source" columns.
    • Operation Mode:
      • Update Selected CSV: Merge new Anki data with an existing CSV, adding new words, updating sources, and removing words no longer found in Anki (if they were previously marked as anki source).
      • Save As New CSV: Export all processed words to a new CSV file.
    • Automatic Filename Timestamping: Optionally append a _YYYY-MM-DD_HHMMSS timestamp to new CSV filenames.
  3. Advanced Word Processing (especially for Japanese):

    • MeCab Lemmatization:
      • If MeCab (Japanese morphological analyzer) is available, users can choose to lemmatize words from the Anki field (e.g., "食べました" -> "食べる").
      • Custom Stopwords: Define a list of custom stopwords (lemmas) to be excluded from the export. These can either supplement built-in stopwords (like する, ある) or replace them entirely.
      • Part-of-Speech (POS) Filtering: Common particles, symbols, prefixes, etc., are automatically filtered out during lemmatization.
      • MeCab Test Tool: A built-in utility allows users to test the current MeCab lemmatization settings (stopwords, POS filtering) on sample Japanese text.
    • Basic Word Extraction (if MeCab is unavailable or disabled):
      • Words are extracted by splitting the field content by common delimiters and removing punctuation/HTML.
  4. Dictionary Filtering:

    • Optionally filter the extracted words/lemmas against a user-provided dictionary file (plain text, one word per line). Only words/lemmas present in this dictionary will be included in the final CSV.
  5. Settings Persistence:

    • The dialog remembers the last used settings (paths, filters, options) for convenience.

Key Components:

  • ExportVocabCsvDialog: The main Qt dialog for user interaction and settings.
  • KnownWordsProcessor: Handles the core logic of reading CSVs, fetching/processing Anki data, merging word lists, and writing output CSVs.
  • MeCabProcessor: Encapsulates all MeCab-related functionality, including initialization, lemmatization, POS filtering, stopword management, and self-testing.
  • Graceful degradation if MeCab is not installed or properly configured (lemmatization features will be disabled).

Purpose & Use Cases:

  • Creating a "known words" list for import into reading assistance tools (e.g., browser extensions that highlight known/unknown words on Japanese websites).
  • Tracking vocabulary acquisition over time.
  • Generating word lists for further study or analysis.
  • Migrating vocabulary data between different systems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants