This project provides a multithreaded Naive Bayesian text classification
library written in TypeScript for NodeJS. It is designed for efficient
supervised learning and probabilistic classification of textual data into
predefined categories. The classifier supports parallelized directory
training via Node’s worker_threads, enabling significant performance
gains on multicore systems.
- Naive Bayes text classification with Laplace (add-one) smoothing
- Multithreaded training for improved scalability
- Support for arbitrary categories and document-based datasets
- Porter Stemming for word normalization for improved feature detection
- Higher Order for improved flexibility and increased accuracy for large datasets
- Automatic vocabulary management
- Stopword filtering from file
- Softmax scoring and maximum likelihood classification utilities
- Synchronous and asynchronous training modes
- Simple API suitable for classification of natural language documents
Clone or copy the module into your NodeJS project:
git clone https://github.com/thepixelist11/bayesian-text-classifier.git
cd bayesian-text-classifier
npm installEnsure that your NodeJS version supports ES modules and worker_threads (Node ≥ 16 recommended).
Below is an example of how the classifier may be used in a NodeJS environment to classify categories and star ratings for Amazon reviews:
import path from "path";
import url from "url";
import * as btc from "./bayesian-text-classifier.js";
const __dirname = url.fileURLToPath(new URL('.', import.meta.url));
const main = async () => {
const stopwords = btc.getStopwordsFromFile(path.join(__dirname, "english-stopwords"));
const classifier_stars = new btc.Classifier(stopwords, 2); // Using bigrams (depth 2) for stars
const classifier_category = new btc.Classifier(stopwords);
await classifier_stars.trainDir(path.join(__dirname, "../data/amazon-reviews-stars"));
await classifier_category.trainDir(path.join(__dirname, "../data/amazon-reviews-category"));
const reviews = [
"These eyelashes do the job for the price...",
"They burned my feet, but they did keep them warm...",
"It was too small to fit any paper and the ink was purple..."
];
for (const text of reviews) {
const stars_scores = classifier_stars.classify(text, btc.softmax);
const category_scores = classifier_category.classify(text, btc.softmax);
console.log(`${text}: ${btc.getMostLikely(stars_scores)} stars, ${btc.getMostLikely(category_scores)}`);
}
};
main();This example requires directories data/amazon-reviews-stars and
data/amazon-reviews-category containing data in the format below.
Each category used for training must correspond to either:
- A directory containing one or more text documents. The structure should resemble the following:
data/
├── amazon-reviews-stars/
│ ├── 1/
│ │ ├── review1.txt
│ │ └── review2.txt
│ ├── 2/
│ ├── 3/
│ ├── 4/
│ └── 5/
└── amazon-reviews-category/
├── electronics/
├── beauty/
└── home/
Each subdirectory name represents a category label. The classifier will read all files within each category during training.
- A file containing one document per line. The structure should resemble the following:
data/
├── amazon-reviews-stars/
│ ├── 1.txt
│ ├── 2.txt
│ ├── 3.txt
│ ├── 4.txt
│ └── 5.txt
└── amazon-reviews-category/
├── electronics.txt
├── beauty.txt
└── home.txt
Each file name represents a category label. The classifier will read all lines within each category during training.
Typically, for short documents (like Amazon reviews), using the file-based approach can be ideal. For large documents, the directory approach can be better. Note that for very large datasets, using a file for each document may cause issues (see inodes for Linux).
Reads a file containing stopwords (one or more per line, separated by non-word characters) and returns a Set of stopwords.
Computes the softmax transformation of a log-probability value given the entire score set. Useful for converting log-likelihoods into normalized probabilities.
Returns the label corresponding to the highest score in a given score map.
Constructs a new Naive Bayesian classifier instance.
The stopwords set is used to exclude frequent, non-informative terms.
The depth (defaults to 1) is used to set the depth or order of the
classifier. Higher depth's increase training time with the potential of
increasing accuracy by allowing for multi-word contexts. Note: See The Curse of Dimensionality.
Experiment with different depths to see what works best for your use case.
The alpha (defaults to 1) defines the alpha for additive Laplace
smoothing to smooth count data, especially for terms the classifier hasn't
been trained on.
The stem parameter (defaults to true) determines whether or not Porter
stemming is used to normalize words by reducing them to their stems. All
stopwords are filtered before words are stemmed.
Trains the classifier on a single text sample associated with the specified category.
Trains the classifier on the contents of a single file under a given category.
Trains the classifier in parallel using worker threads. Each category (or file) is processed by a separate worker thread to accelerate training.
Classifier.classify(document: string, outputFn?: (val: number, vals: number[]) => number): Record<string, number>
Computes the log-probability (or transformed probability) for each category given the input document.
If outputFn (e.g., softmax) is provided, the results are transformed accordingly.
Computes prior and likelihood probabilities based on accumulated word and document counts.
This method is automatically invoked after trainDir() or
trainDirParallel(). It must be called before attempting classification.
- Uses Laplace smoothing (α defaults to 1.0) to prevent zero-probability words.
- Maintains an explicit vocabulary to ensure consistent probability computation.
- Stopwords are ignored during tokenization to improve classification accuracy.
- The classifier expects UTF-8 encoded text files. Support for additional formats will be implemented in the future.
the
a
an
and
or
in
on
with
at
of
to
for
This project is licensed under the MIT License. You may freely use, modify, and distribute this software with proper attribution.