Releases: lanl/T-ELF
v0.0.20
v0.0.19
- Fixes a bug with HNMFk checkpointing where if continuing from checkpoint on a HPC system, not all nodes would be free on the job queue due to the bug.
- Fixes a bug with BST post-order search where the order was incorrect.
- Adds BST in-order search capability. NMFk hyper-parameter changed accordingly:
k_search_method : str, optional
Which approach to use when searching for the rank or k. The default is "linear".
* ``k_search_method='linear'`` will linearly visit each K given in ``Ks`` hyper-parameter of the ``fit()`` function.
* ``k_search_method='bst_post'`` will perform post-order binary search. When an ideal rank is found, determined by the selected ``predict_k_method``, all lower ranks are pruned from the search space.
* ``k_search_method='bst_pre'`` will perform pre-order binary search. When an ideal rank is found, determined by the selected ``predict_k_method``, all lower ranks are pruned from the search space.
* ``k_search_method='bst_in'`` will perform in-order binary search. When an ideal rank is found, determined by the selected ``predict_k_method``, all lower ranks are pruned from the search space.
v0.0.18
- Fixes a bug where Ks were not organized correctly for BST post and pre order.
- Fixes a bug for H_sill_thresh, now allowing for being able to set threshold at negative values as well.
- Adds option to use either W sill for k prediction, H sill for k prediction, or both. Selection of the
predict_k_method
also changes how the BST search is done withk_search_method
. Below hyper-parameters for NMFk are modified accordingly:
predict_k_method : str, optional
Method to use when performing automatic k prediction. Default is "WH_sill".
predict_k_method='pvalue' # will use L-Statistics with column-wise error for automatically estimating the number of latent factors.
predict_k_method='WH_sill' # will use Silhouette scores from minimum of W and H latent factors for estimating the number of latent factors.
predict_k_method='W_sill' # will use Silhouette scores from W latent factor for estimating the number of latent factors.
predict_k_method='H_sill' # will use Silhouette scores from H latent factor for estimating the number of latent factors.
predict_k_method='sill' # will default to ``predict_k_method='WH_sill'``.
v0.0.17
New Features
-
Introduces a new Vulture subclass
VocabularyConsolidator
, underTELF.pre_processing.Vulture.tokens_analysis
, designed to consolidate vocabularies and textual terms. -
Refactors NMFk, RESCALk, HNMFk, and SymNMFk to enhance modularity. Helper functions are created under
TELF.factorization.utilities
to modularize the code. -
Adds a new search criterion for identifying the optimal rank, or K, to NMFk, HNMFk, WNMFk, and RNMFk. This enhancement introduces a significant speedup to each algorithm. The new criterion utilizes a Binary Search Tree to streamline the process of determining the optimal rank, drastically reducing the search space and the time needed for factorization. Additionally, this K search feature is compatible with High Performance Computing (HPC) systems, ensuring that changes in the K search space by any node are synchronized across all nodes. NMFk has been updated to include new hyper-parameters tailored to these search settings.
k_search_method : str, optional
Which approach to use when searching for the rank or k. The default is "linear".k_search_method='linear'
will linearly visit each K given inKs
hyper-parameter of thefit()
function.k_search_method='bst_post'
will perform post-order binary search. When an ideal rank is found withmin(W silhouette, H silhouette) >= sill_thresh
, all lower ranks are pruned from the search space.k_search_method='bst_pre'
will perform pre-order binary search. When an ideal rank is found withmin(W silhouette, H silhouette) >= sill_thresh
, all lower ranks are pruned from the search space.
H_sill_thresh : float, optional
Setting for removing higher ranks from the search space. The default is -1.When searching for the optimal rank with binary search using
k_search='bst_post'
ork_search='bst_pre'
, this hyper-parameter can be used to cut off higher ranks from search space.
The cut-off of higher ranks from the search space is based on threshold for H silhouette. When a H silhouette belowH_sill_thresh
is found for a given rank or K, all higher ranks are removed from the search space.
IfH_sill_thresh=-1
, it is not used.
Bugs
- Fixes a bug in RESCALk plotting where plotting function was expecting W and H silhouettes.
- Fixes a bug where k predict would not work if none of the
W
orH
silhouettes are above thesill_thresh
hyper-parameter. New fix selects newsill_thresh
based on the rule:self.sill_thresh = min([max(sils_min_W), max(sils_min_H)])
when none of theW
orH
silhouettes are above thesill_thresh
hyper-parameter. - Fixes a bug in document substitutions of Vulture where an error is raised if no corpus substitutions are passed.
v0.0.16
- Fixes a bug for HPC HNMFk capability when checkpointing would not save if using custom callback functionality.
- Fixes a bug in the stopwords option in Vulture Clean that excludes hyphens from stop word checks, a boolean in iterable’s place bug.
- Fixes a bug to flatten the output dictionary in the Vulture Acronyms module, a dictionary iteration bug.
- Fixes a bug where
itertools
was missing in permutation import in Vulture material permutations. - Fixes a bug in Vulture materials permutations for the
save_path
definition. - Adds Ks range and X shape checks for HNMFk to make sure the decomposition can still be done if using a callback functionality.
- Adds a feature to include lowercased materials in permutations.
- Adds future for material permutations.
- Adds multithread string consolidation in levenshtein.
- Levenshtein consolidation criteria change from shorest string to most common string.
- Moves HNMFk leaf node termination, based on sample threshold, to after factorization to obtain the latent factors W and H even for nodes where number of samples are less than the threshold.
v0.0.15
- Fixes a bug where Vulture Acronym Operator edge case producing wrong results when using substitutions.
- Fixes a bug where Vulture cleaning operations for stop words would not remove hyphenated words if they contain a stop word.
- Fixes minor bugs where conda environment activation was done wrong in hpc example scripts.
- Vulture Acronym Operator example notebook to be organized to show when the cleaning is done and when the acronym operation is done with substitutions.
- Acronym warning message printing class attribute instead of data.
- Adds HPC capability to HNMFk.
- Adds checkpointing capability for HNMFk.
- Adds online node operations for HNMFk, reducing the space taken by graph nodes.
- Adds per document based substitutions operator feature to Vulture.
- Adds Levenstein distance based acronym consolidation for post-processing of acronyms.
v0.0.14
- Adds callback functionality to HNMFk for generating new data matrix X at each NMFk application. This allows Semantic HNMFk by re-generating TF-IDF matrix at each node.
- Adds capability to HNMFk for saving custom user data in each node when using
generate_X_callback
. - Adds taking note for after pruning X shape and Ks range, and if decomposition is no longer possible after pruning by noting prune status.
- HNMFk now uses Path library to generate sub-directories automatically.
- Fixed bug where max(Ks) is more than min(X.shape) after pruning in NMFk.
- Fixed a bug where HNMFk is loading wrong factors when k=2 is True.
- Fixed a bug where NMFk would try to decompose data after pruning even if not possible (for example if the number of samples left is 1, or K range is empty based on the rule
k < min(X.shape)
. - Fixed a bug where
Beaver.get_vocabulary()
was not consistent with the vocabulary that is generated in the other matrix creation routines.
v0.0.13
- Adds HNMFk. Hierarchical Non-negative matrix factorization with automatic model determination with custom settings including missing value prediction. HNMFk has multi-processing capabilities for both CPU and GPU systems. HPC capabilities for HNMFk is planned to be added later.
- Fixes a bug on HPC example for WNMFk where number of nodes was not correct in the hyper-parameters.
v0.0.12
- Added ability to plot both silhouttes of latent patterns (W matrix) and the latent clusters (H matrix) to assist selecting the number of hidden patterns and the corresponding number of hidden clusters.
predict_k_method
default is changed to"sill"
.- NMFk plot will no longer include the blue relative error line when
calculate_error=False
. - New
predict_k_method="sill"
will predict k based on:- The maximum k where W silhoutte is above the threshold
sill_thresh
: Wk - The maximum k where H silhoutte is above the threshold
sill_thresh
: Hk - Final k, or number of hidden signals, will be
k=min(Wk, Hk)
.
- The maximum k where W silhoutte is above the threshold
v0.0.11
- Adds acronym identification and substitution for acronyms capability to Vulture.
- Fixes the dependency list in .yml files for installation.