Skip to content

rtbs-dev/affinis

Repository files navigation

Affinis

Tools for inferring relations from binary co-occurrence data

Affinis is a library of tools for assisting in unsupervised structure learning on sparse, binary data.

What does it help with?

In large (sparse) feature matrices, especially ones with binary or integer-valued entries, you commonly need to figure out the underlying structure of your feature space from the observations.

E.g. given a bag-of-words matrix (a type of NLP embedding) figure out how the tokens/concepts (columns) in the corpus are related to each other, using only the set of documents (rows) that record token co-occurrences in them.

Techniques for this are widely varied, and different communities have widely different practices and assumptions for what is an appropriate approach. Affinis provides a library of implementations---with a consistent interface---for approaching this problem.

What's inside?

Affinis should be considered a prototype for the purposes of research and community benchmark assistance. (approx. TRL 4-5)

Primarily, this library's core features live in the associations module. Here you will find functions collected from a wide variety of disciplines that accept a feature matrix $X$ with $n$ features (columns), and return $n\times n$ square matrices with association measures.

Other things to see:

  • Reference implementations of our new Forest Pursuit algorithm,

    Forest Pursuit is lazily executable, trivially parallelizable, and scales approximately linearly with the size of your feature matrix for diffusion-like problems (worst-case quadratic, otherwise).

  • Universal smoothing api: use pseudocts= for easy application of Beta-Binomial prior!
  • Makes use of new PyData sparse library to avoid full instantiation of $X$ in memory
  • Plotting utilities (including a vectorized implementation of so-called Hinton diagrams)
  • Linear-algebra-based graph utilities,
    • Edge probability in random spanning trees/forests,
    • Minimum-connectivity graph weight thresholding,
    • Closed-form edge-to-node-pair index mapping for undirected graph edge subsampling

Work-in-Progress:

  • Gibbs-sampling technique for fully bayesian semiparametric edge probability estimation

Installation

affinis is currently awaiting pre-publication review. Reference installations can be achieved for development purposes with pip:

pip install git+https://github.com/usnistgov/affinis.git

Other Information

Contact the PI

Rachael Sexton

  • rachael.sexton@nist.gov
  • NIST Engineering Laboratory
  • Systems Integration Division
  • Information Modeling & Testing Group

Related Material

  • Link to documentation webpage: WIP
  • Original work first describing Forest Pursuit: dissertation link
  • Citation:

    AWAITING PUBLICATION APPROVAL

About

Tools for inferring relations from binary co-occurrence data

Topics

Resources

License

Stars

Watchers

Forks

Packages

No packages published