This repository provides examples of how to perform the training of logistic regression models on data encrypted by FHE using the OpenFHE library. This implementation is intended for demonstrations of how to use OpenFHE for model training. The examples are intended to be used for illustrative purposes only, and not for benchmarking. There are many much more efficient approaches to logistic regression training that either use proprietary designs or aren't as good for illustrative examples. The specific approach we use here is based on Nesterov-accelerated gradient descent.
Note These examples were developed as part of the DARPA DPRIVE program. They are solely for research purposes and should not be used for benchmarking or production purposes where performance is critical. Although the sample code was contributed by Duality Technologies, this sample code is not related to the privacy-preserving logistic regression training capabilities provided in past, present or future Duality Technology products.
In this repository we have implemented logistic-regression model training and model inference on the 2014 US Infant
Mortality Dataset (as provided in the train_data
directory).
Given our design matrix, X, and our vector of weights,
Note: the sigmoid function can also be called the logistic function, and gives logistic regression its name
Note: one limitation of FHE is that we cannot directly calculate non-linear functions. As such, we use polynomial
approximations of our non-linear functions. In OpenFHE, we provide the utility EvalChebyshev
to do such
approximations. See
our official documentation
and
our working example
Our loss function is the cross-entropy loss function
We optimize our model via the gradient descent method so our objective is:
Nesterov Accelerated Gradient can be thought of as a classical gradient descent, with a second "phase" that involves a special momentum parameter.
Given our parameters,
-
$\theta$ is our$\bar{w}$ , our "weights" -
$\phi$ is our momentum tracker
our update can be expressed as follows:
Although logistic regression in-the-clear and in FHE are similar, we leverage a number of optimizations. For example, we pre-compute various multiplications in-the-clear to reduce the number of required ciphertext multiplications.
- pre-computing the
$-X^T$ prior to the gradient descent step - pre-scaling the
$-X^T$ by dividing by the number of samples
-
Install OpenFHE as per the instructions
-
Build this repo
mkdir build
cd build
cmake ..
make -j N
where N
is the number of cores you want to use.
- Go into your build directory and run
./lr_nag
.
-b flag: whether to run in bootstrap or not. DEFAULT: true
-n int: number of iterations for training. DEFAULT: 200
-r int: rows to read from the dataset. DEFAULT: -1 (read all)
-x string: train features (CSV). DEFAULT: train_data/X_norm_1024.csv
-y string: train labels (CSV). DEFAULT: train_data/y_1024.csv
-j string: test features (CSV). DEFAULT: train_data/X_norm.csv
-k string: test labels (CSV). DEFAULT: train_data/y.csv
-d int: ring dimension. DEFAULT: 1 << 17
-w string: Outpuit file prefix. DEFAULT: See below
-p int: Output precision. DEFAULT: 0. If non-0 we run 2-iteration bootstrap. See below for more information
-w
default: depends on the formulation (sgd/ nag) but amounts to either ../results/nag_
or ../results/sgd_
In the case of 64-bit precision (specified when installing OpenFHE), we can run bootstrap twice, which leads to improved precision, and should rival that of bootstrapping in 128-bit. If you specify a non-zero precision, we run in 2-iteration mode, else just single iteration. See iterative-ckks-bootstrapping for more information.
Note how we pack the Theta
and the Phi
into a single ciphertext. This is to allow us to run only a single bootstrap as opposed to two, one for each parameter. See advanced-ckks-bootstrapping for more information.
-
cheb_analysis.cpp
: for a given polynomial degree and range to estimate over, this generates the estimations and outputs the contents to a file in thepy_scripts/
folder. The file can then be analyzed to study the estimated error between the estimated value and the actual value at various points. -
data_io
: header and source file for reading in a CSV file. -
enc_matrix
: header and source file for various encrypted matrix operations, primarily encrypted matrix multiplications -
lr_nag.cpp
: the "main" file to kick off the logistic regression training. -
lr_train_funcs
: header and source file for handling training. -
lr_types.h
: Type aliases -
parameters.h
: code for crypto-parameter setting and parsing from command-line arguments. -
pt_matrix
: code for plaintext matrix operations e.g. matrix multiplication, transpose, addition -
utils
: printing and packing plaintext matrices
Contains various misc. python scripts.
Note: the code for training in parameter_search.ipynb
and step_by_step_training_debugger.ipynb
is very similar.
However,
the code is separated because we use numba
to accelerate our hyperparameter search, and numba is not amenable to
debugging via print
.
parameter_search.ipynb
:
- Code to run our hyperparameter search
- Code to analyze the
- convergence: according to criteria set in original R scripts
- loss
- AUC-ROC
- Outputs the plots into
plots
subfolder - Note: we ran the search process for a maximum of
1_000
epochs before terminating. At which point we mark the run as "not converged"
step_by_step_training_debugger.ipynb
:
- when prototyping the code, I found it useful to have a python implementation that I could compare against to ensure that my implementation was not diverging. This file implements the NAG.
Where we store the results of our encrypted runs.
Note: our encrypted run produces four artifacts:
- train predictions
- test predictions
- train loss
- test loss
Investigates what happens as we modify the sigmoidApprox parameters. The goal of this is to explore:
- the amount of error at various approximation degrees
- the effect that changing the degrees has on the training process
chebyshev_approx_analysis.ipynb
:
-
reads in data from
raw_data/sigmoidResults_X_Y.txt
whereX
describes the range of estimation(-X, X)
andY
describes the polynomial degree estimation. -
plots the
- values across a specified range
- estimation error (exact - approximation)
- stacked plot of the above two
-
note: the data that is read in is generated from
cheb_analysis.cpp
train_analysis.ipynb
:
-
studies the amount to which the approximation degree impacts the training process.
-
read in data from
raw_data/X_Y_loss.csv
whereX
is the optimization method andY
is the approximation degree -
note: we explored two optimization methods
- nesterov-accelerated-gradient descent
Contains the data files. We prototyped on (X_norm_64, y_64)
and then validated on (X_norm_1024, y_1024)
before
moving to the full-scale (X_norm_32764, y_32764)
reduceDataset.py
subsamples the dataset to make the number of true cases and false cases to be equal.
These examples were mainly developed by Ian Quah, with some contributions/suggestions from Ahmad Al Badawi, David Bruce Cousins, and Yuriy Polyakov.
Distribution Statement "A" (Approved for Public Release, Distribution Unlimited). This work is supported in part by DARPA through HR0011-21-9-0003 and HR0011-20-9-0102. The views, opinions, and/or findings expressed are those of the author(s) and should not be interpreted as representing the official views or policies of the Department of Defense or the U.S. Government.