- Project Overview
- Project Hierarchy
- Project Prerequisites
- Data-Driven Approach
- Knowledge-Injected Approach
- Original Project
- Publications
- Acknowledgements
- License
- Contact
This codebase implements LearnSPN, a structure learning algorithm for Sum-Product Networks (SPNs), better known as Probabilistic Circuits (PCs). The project was initially created and licensed by Robert Gens and Pedro Domingos, with further details provided in their paper.
The codebase has been extended and adapted for the Neural Probabilistic Circuit (NPC) project to construct and generate PCs in two ways:
- Data-Driven Approach: Uses the LearnSPN algorithm to automatically learn circuit structures from data. The project has been modified to output structures in a standardized format compatible with the NPC pipeline.
- Knowledge-Injected Approach: Supports manually defined circuit structures that encode human domain knowledge directly, enabling explicit logical reasoning within the NPC framework.
Originally designed to handle only binary distributions, the implementation has been modified to support one-hot categorical distributions. Together, these capabilities make this codebase the foundation for generating the PCs used throughout the NPC project.
This project is part of the NPC pipeline. To ensure compatibility and maintain consistent references across the pipeline, organize the project directories as follows:
npc
├── datasets
├── learnspn
├── npc-dataset-utils
├── npc-models
└── venv
All subsequent instructions assume the above project hierarchy.
The PC datasets are stored under npc/learnspn/data. These datasets are generated by the npc-dataset-utils project and are located in the npc/datasets directory.
Before running this project, first ensure that all datasets, including the PC datasets, are properly set up under npc/datasets by following the instructions in the npc-dataset-utils project.
Then, verify that the symlinks under npc/learnspn/data correctly point to the corresponding generated PC datasets within npc/datasets, or replace the symlinks with the actual PC dataset files.
This project requires the following system packages:
Ubuntu:
apt install openjdk-17-jdk python3.10 python3-venvArch Linux:
yay -S jdk17-openjdk python310Java 21 has also been verified to work with this project, though Java 17 is recommended for maximum compatibility.
This project was developed on Ubuntu and tested on both Ubuntu and Arch Linux. Other Linux distributions, macOS, or Windows Subsystem for Linux (WSL) may also work with additional setup. However, these platforms are not officially supported.
This project is designed to run within a simple Python virtual environment. Create and activate the environment as follows:
cd npc
deactivate
python3.10 -m venv venv
source venv/bin/activate
python3.10 -m pip install -r learnspn/requirements.txtAlways ensure the virtual environment is activated before running the project.
Start by reviewing npc/learnspn/scripts/learnspn/learnspn.bash for all permissible dataset prefixes and ensure all relevant parameters within the script are set to the desired values.
Next, confirm that the number of variables and instances defined in npc/learnspn/src/data/Discretized.java for each dataset split matches the dimensions of the corresponding PC dataset splits under npc/learnspn/data.
For example:
public static class AwA2 extends Discretized
{
public AwA2()
{
super("awa2", 5, 29857, 3732, 3733);
}
}Based on the above declaration for the AwA2 dataset, there must be:
- 5 values per line and 3,733 lines in
npc/learnspn/data/awa2.test.data - 5 values per line and 29,857 lines in
npc/learnspn/data/awa2.ts.data - 5 values per line and 3,732 lines in
npc/learnspn/data/awa2.valid.data
It is critical to configure these dimensions correctly in npc/learnspn/src/data/Discretized.java. The original implementation does not validate the number of variables and instances and may continue to run and produce incorrect results even if the dimensions are not properly configured.
Once the above parameters are verified, compile and run the LearnSPN algorithm:
cd npc/learnspn/scripts/learnspn
./learnspn.bash <dataset prefix>The constructed PC is stored as npc/learnspn/outputs/learnspn/<dataset prefix>.spn.txt.
Start by reviewing npc/learnspn/scripts/manual/manual.py and ensure all relevant parameters within the script are set to the desired values.
Then, the manual PC is constructed as follows:
cd npc/learnspn/scripts/manual
./manual.pyThe constructed PC is stored as npc/learnspn/outputs/manual/<dataset prefix>.spn.txt.
The following README was authored by the creators of the original project, Robert Gens and Pedro Domingos. The original instructions are included for reference only and may no longer apply to the latest version of the project.
LearnSPN Version 1.0
6/17/13 Robert Gens rcg@cs.washington.edu
This is raw, unoptimized research code. We provide the subroutines as described in the paper: pairwise independence (G-test) and online hard EM over a naive Bayes mixture model. We will likely expand this code with other subroutines.
Below are several commands to run experiments as in the paper (run from the main directory). Java bytecode is provided so you don't have to compile.
If you have any questions about the code or paper, don't hesitate to email me (rcg@cs.washington.edu).
**Run LearnSPN**
This command will run LearnSPN, iterate over a few smoothing values, and then output the SPN with highest validation log-likelihood to d12.spn .
java -cp bin exp.RunSLSPN DATA 12 GF 10 CP 0.6 INDEPINST 4 N d12.spn
**Compute the log-likelihood for the test set**
Shows the average LL and total time at the end.
java -cp bin exp.inference.SPNInfPLL DATA 12 N d12.spn
**Compute the pseudo log-likelihood for the test set**
Shows the average PLL and total time at the end.
java -cp bin exp.inference.SPNInfPLL DATA 12 N d12.spn
**Generate 1000 queries for each of 10 proportions from test set**
This will create a file for query settings (.q) and a corresponding file for evidence settings (.ev)
mkdir data/nltcs/
java -cp bin exp.inference.GenQEV DATA 12
**Run inference over the set of queries with 30% query and 20% evidence**
Lists the CLL of each instance. Shows the average CLL and total time at the end.
java -cp bin exp.inference.SPNInf N d12.spn Q nltcs/VE_Q0.30_E0.20.q EV nltcs/VE_Q0.30_E0.20.ev
**Run same queries as previous example but with marginal inference (CMLL)**
java -cp bin exp.inference.SPNInfCMLL N d12.spn Q nltcs/VE_Q0.30_E0.20.q EV nltcs/VE_Q0.30_E0.20.ev
Parameters:
"N" is the filename of the saved/loaded SPN
"INDEPINST" is just to prevent the algorithm from needlessly running pairwise independence tests on matrices with this many or fewer instances (in which case all variables are independent)
Grid search in paper:
Cluster penalty "CP": {0.2, 0.4, 0.6, 0.8}
Significance threshold "GF": {10 , 15} (corresponding to p-values 0.0015 and 0.0001, respectively)
Datasets "DATA"
0 EachMovie
1 MSWeb
2 KDD
6 Audio
7 Book
9 Jester
10 MSNBC
11 Netflix
12 NLTCS
13 Plants
19 Accidents
20 Ad
21 BBC
22 C20NG
23 CWebKB
24 DNA
25 Kosarak
26 Retail
27 Pumsb_Star
28 CR52
Upon using this project, cite any relevant publications listed below:
@article{chen2025neural,
title={Neural probabilistic circuits: Enabling compositional and interpretable predictions through logical reasoning},
author={Chen, Weixin and Yu, Simon and Shao, Huajie and Sha, Lui and Zhao, Han},
journal={arXiv preprint arXiv:2501.07021},
year={2025}
}
@inproceedings{chenneural,
title={Neural Probabilistic Circuits: An Overview},
author={Chen, Weixin and Yu, Simon and Shao, Huajie and Sha, Lui and Zhao, Han},
booktitle={Eighth Workshop on Tractable Probabilistic Modeling}
}
@inproceedings{gens2013learning,
title={Learning the structure of sum-product networks},
author={Gens, Robert and Pedro, Domingos},
booktitle={International conference on machine learning},
pages={873--880},
year={2013},
organization={PMLR}
}
Special thanks to Rahim Khan, Tommy Tang, Alex Tanthiptham, and Trusha Vernekar for their contributions to the implementation, testing, and experiments involved in this project.
This codebase is released under the license of the original project, which can be viewed under LICENSE.
For questions, feedback, or comments, open an issue or reach out to Simon Yu.
Written by Simon Yu.