Network regression embeddings reveal cell-type transcription factor coordination for gene regulation
NetREm is a software package that utilizes network-constrained regularization for biological applications and other network-based learning tasks. In biology, traditional regression methods can struggle with correlated predictors, particularly transcription factors (TFs) that regulate target genes (TGs) in gene regulatory networks (GRNs). NetREm incorporates information from prior biological networks to improve predictions and identify complex relationships among predictors (e.g. TF-TF coordination: direct/indirect interactions among TFs). This approach can highlight important nodes and edges in the network, reveal novel regularized embeddings for genes, provide insights into underlying biological processes, identify subnetworks of predictors that group together to influence the response variable, and improve model accuracy and biological/clinical significance of the models. NetREm can incorporate multiple types of network data, including Protein-Protein Interaction (PPI) networks, gene co-expression networks, and metabolic networks. In summary, network-constrained regularization may bolster the construction of more accurate and interpretable models that incorporate prior knowledge of the network structure among predictors.
The minimum requirement is a computer with 8 GB of RAM and 32 GB of storage. For large prior graph networks, 32 GB of RAM is recommended.
The software uses Python 3.10. After downloading the NetREm Github code, conda/Anaconda users can use the following steps to install:
- In the Anaconda navigator prompt, create a virtual environment of Python 3.10 by running:
conda create -n NetREm python=3.10
- Activate the environment:
conda activate NetREm
- Make sure to change the current directory to the NetREm folder.
- Install the packages and dependencies (math, matplotlib, networkx, numpy, typing, os, pandas, plotly.express, random, scipy, scikit-optimize, sklearn, sys, tqdm, warnings):
pip install -r requirements.txt
Please note that if you encounter import errors from files or functions in the code folder (such as Netrem_model_builder.py
), add an empty file named init.py to the code folder, and add the "code." prefix to all imports from the "code" folder. For example, import Netrem_model_builder as nm
➡️ import code.Netrem_model_builder as nm
.
NetREm fits a Network-constrained Lasso regression machine learning model with user-provided weights for the prior network. Here, netrem is the main function with the following usage:
netrem(
edge_list,
beta_net = 1,
alpha_lasso = 0.01,
default_edge_weight = 0.01,
edge_vals_for_d = True,
w_transform_for_d = "none",
degree_threshold = 0.5,
gene_expression_nodes = [],
overlapped_nodes_only = False,
y_intercept = False,
view_network = False,
model_type = "Lasso",
...
)
Parameter | Definition |
---|---|
edge_list |
list A list of lists corresponding to a prior network involving predictors (nodes) and relationships among them (edges): [[source1, target1, weight1], ..., [sourceZ, targetZ, weightZ]]. Here, weight1, ..., weightZ are optional. Nodes found in the edge_list are referred to as network nodes
|
beta_net |
float, default = 1 Regularization parameter for network penalization: |
alpha_lasso |
float, default = 0.01 A numerical regularization parameter for the lasso term ( model_type = LassoCV . Larger values typically reduce the number of final predictors in the model. |
default_edge_weight |
float, default = 0.01 Default edge weight ( |
edge_vals_for_d |
boolean, default = True If True, we focus on summing the edge weights to calculate the node degree |
w_transform_for_d |
string, default = "none" Other options are "sqrt", "square", "avg". Here, "none" means we add the weights of the edges that are connected to the node to get the node degree. These other options represent transformations that can be done on this sum to yield various other node degree values |
degree_threshold |
float, default = 0.5 If edge_vals_for_d is False, then Edges with weight |
gene_expression_nodes |
list, default = [] A list of predictors (e.g. TFs) to use that typically is found as columns in the training gene expression data Any gene_expression_nodes not found in the edge_list are added internally into the network prior edge_list using pairwise default_edge_weight . Specifying gene_expression_nodes is optional but may boost the speed of training and fitting NetREm models (by adjusting the network prior in the beginning). Thus, if the gene expression data (gene_expression_nodes . Otherwise, NetREm automatically determines gene_expression_nodes when fitting the model with overlapped_nodes_only . |
overlapped_nodes_only |
boolean, default = False This determines if NetREm should focus on common nodes found in network nodes (from edge_list ) and gene expression data (based on gene_expression_nodes ). Here, network nodes not found in the gene expression data will always be removed. The priority is given to gene_expression_nodes since those have gene expression values that are used by the regression. • If overlapped_nodes_only = False , the predictors will come from gene_expression_nodes , even if those are not found in the network edge_list . Some predictors may lack relationships in the prior network. • If overlapped_nodes_only = True , the predictors used will need to be a common node: network node also found in the gene_expression_nodes . See overlapped_nodes_only.pdf for hands-on examples. |
standardize_X |
boolean, default = True This determines if NetREm should standardize |
standardize_y |
boolean, default = True This determines if NetREm should standardize |
center_y |
boolean, default = False This determines if NetREm should center |
y_intercept |
boolean, default = 'False' This is the fit_intercept parameter found in the Lasso and LassoCV classes in sklearn. • If y_intercept = True , the model will be fit with a y-intercept term included. • If y_intercept = False , the model will be fit with no y-intercept term. |
view_network |
boolean, default = False • If view_network = True , then NetREm outputs visualizations of the prior graph network. Recommended for small networks (instead of for dense hairballs) If view_network = False , then NetREm saves time by not outputting visuals of the network. |
model_type |
{'Lasso', 'LassoCV'}, default = 'Lasso' • Lasso: user specifies value of • LassoCV: NetREm performs cross-validation (CV) on training data to determine optimal |
... (additional parameters) | Read more in the User Guide: Additional Parameters for more parameters after model_type |
We input an edge list of the prior graph network (constrains the model via network-based regularization) and a beta_net (model_type = Lasso
) or NetREm may select an optimal model_type = LasssoCV
). Then, netrem builds an estimator object from the class Netrem that can then take in input
- A NetREm network-regularized linear model estimator object from the NetREmModel class.
Method | Definition | Returns |
---|---|---|
fit( |
Building and training the NetREm model with |
Fitted NetREm Object with several updated attributes. For instance, if model_type = LassoCV , will also return optimal value for |
predict( |
Use our model to predict values for our response variable |
Numpy array of |
test_mse( |
Evaluate our model performance capabilities on testing data using Mean Squared Error (MSE) as our metric. | Numeric value corresponding to the Mean Square Error (MSE). |
We assume that our
Parameter | Definition |
---|---|
Pandas dataframe ( |
|
Pandas dataframe ( |
We can retrieve our model coefficients and other attributes by calling these outputs:
Output | Definition |
---|---|
model_coef_df |
Pandas dataframe of the Lasso model coefficients for all of the predictors and y-intercept (if y_intercept = True ) |
model_nonzero_coef_df | Filtered Pandas dataframe of the Lasso model coefficients for only the predictors and y-intercept (if y_intercept = True ) that have non-zero values. |
optimal_alpha | If model_type = LassoCV , returns the optimal |
predY_train | NetREm's predicted values for the training response |
mse_train | Mean Square Error (MSE): predicted predY_train versus actual training values for the response variable Y. |
sorted_coef_df | Pandas dataframe that sorts the final model coefficients (including the y-intercept) based on their absolute values. The rank is provided from least (most important: highest absolute value coefficient) to highest (least important in model). |
final_corr_vs_coef_df |
Pandas dataframe with 3 rows. • NetREm regression coefficient for predictor • correlation of each predictor with y based on the training data • absolute value ranking of the coefficients for the predictors |
combined_df |
Pandas dataframe with a row for each predictor and several columns detailing: • general NetREm model information: y_intercept , train MSE, beta_net , alpha_lasso , original number of predictors in • predictor-specific results: NetREm coefficient for predictor, absolute value of NetREm coefficient, rank of the absolute value of the coefficient (low ranks imply higher |
- predict(
$X$ )
We can use our model to predict values for our response variable
Parameter | Definition |
---|---|
Pandas dataframe ( |
Returns:
Numpy array of
- test_mse(
$X$ ,$y$ )
We can evaluate our model performance capabilities on data like testing data using the Mean Squared Error (MSE) as our metric.
Parameter | Definition |
---|---|
Pandas dataframe ( |
|
Pandas dataframe ( |
Returns: Numeric value corresponding to the Mean Square Error (MSE).
=======
Attribute | Definition |
---|---|
model_coef_df |
Pandas dataframe of the Lasso model coefficients for all of the predictors and y-intercept (if y_intercept = True ) |
model_nonzero_coef_df | Filtered Pandas dataframe of the Lasso model coefficients for only the predictors and y-intercept (if y_intercept = True ) that have non-zero values. |
optimal_alpha | If model_type = LassoCV , returns the optimal |
predY_train | NetREm's predicted values for the training response |
mse_train | Mean Square Error (MSE): predicted predY_train versus actual training values for the response variable Y. |
sorted_coef_df | Pandas dataframe that sorts the final model coefficients (including the y-intercept) based on their absolute values. The rank is provided from least (most important: highest absolute value coefficient) to highest (least important in model). |
final_corr_vs_coef_df |
Pandas dataframe with 3 rows. • NetREm regression coefficient for predictor • correlation of each predictor with y based on the training data • absolute value ranking of the coefficients for the predictors |
combined_df |
Pandas dataframe with a row for each predictor and several columns detailing: • general NEtREm model information: y_intercept , train MSE, beta_net , alpha_lasso , original number of predictors in • predictor-specific results: NetREm coefficient for predictor, absolute value of NetREm coefficient, rank of the absolute value of the coefficient (low ranks imply higher absolute value for the NetREm coefficient) |
B_interaction_df |
Pandas dataframe with a row for each pair of final predictors selected in the final model (with nonzero efficients) and several columns detailing the • general NEtREm model information: model_type , beta_net • predictor-predictor specific results: coord_score_cs (this is the key value we focus on as it is the coordination score), original_corr (that is the same as standardized_corr), input network weight, NetREm coefficient for each predictor, novel_link (if yes, it is artifical default edge added to the input network) |
The goal is to build a machine learning model to predict the gene expression levels of our target gene (TG) sparsity_factor_perc
is 40%, so roughly 4,000 out of the 10,000 cells for each variable will be ≈0; we incorporate these dropoout, so that the underlying data mimics single-cell expression data. Thus,
The dimensions of
Please note that the code for this demo example is demo_toy.py in the demo folder.
from DemoDataBuilderXandY import generate_dummy_data
from Netrem_model_builder import netrem
import PriorGraphNetwork as graph
import error_metrics as em
import essential_functions as ef
import netrem_evaluation_functions as nm_eval
num_samples = 10000 # number of samples (e.g. single-cells)
corrs_list = [0.9, 0.5, 0.4, -0.3, -0.8] # the individual correlations (r) of each X variable with y
train_p = 70 # the % of original data we use for M
dummy_data = generate_dummy_data(corrVals = corrs_list, # the # of elements in corrVals is the # of predictors (X)
num_samples_M = num_samples, # the number of samples M is: (num_samples_M)*(train_data_percent)
sparsity_factor_perc = 40, # approximately what % of data for each variable is sparse (0)
train_data_percent = train_p) # the remainder out of 10,000 will be kept for testing. If 100, then ALL data is used for training and testing.
The Python console or Jupyter notebook will print out the following:
same_train_test_data = False
Please note that since we hold out 30.0% of our 100000 samples for testing, we have:
X_train = 7,000 rows (samples) and 5 columns (N = 5 predictors) for training.
X_test = 3,000 rows (samples) and 5 columns (N = 5 predictors) for testing.
y_train = 7,000 corresponding rows (samples) for training.
y_test = 3,000 corresponding rows (samples) for testing.
The
X_df = dummy_data.X_df
X_df.head(10) # please view the first 10 rows
TF1 | TF2 | TF3 | TF4 | TF5 | |
---|---|---|---|---|---|
0 | 0.000000 | 1.431629 | 0.000000 | 0.000000 | 0.000000 |
1 | 0.000000 | 0.992102 | 0.774417 | -1.464686 | 0.000000 |
2 | -0.856293 | 0.000000 | -1.682493 | 0.906734 | -0.557047 |
3 | -0.657480 | 1.008181 | 0.000000 | 1.426829 | -0.699446 |
4 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
5 | 0.000000 | -1.001249 | 0.000000 | 0.000000 | 1.882605 |
6 | 1.258316 | 1.228853 | 1.789500 | -1.373181 | -3.245509 |
7 | 1.090277 | 0.000000 | 0.000000 | 1.250911 | -0.683330 |
8 | 0.666747 | 0.000000 | 0.654979 | 0.573183 | 0.000000 |
9 | -3.076390 | -1.082092 | -1.573686 | 0.742459 | 3.025879 |
y_df = dummy_data.y_df
y_df.head(10) # please view the first 10 rows
y | |
---|---|
0 | 0.711674 |
1 | 0.000000 |
2 | -1.001871 |
3 | 0.000000 |
4 | 0.000000 |
5 | -1.141293 |
6 | 2.654407 |
7 | 1.440605 |
8 | 0.000000 |
9 | -3.121532 |
# M = 7,000 samples (cells) for training data (used to train and fit NetREm model)
X_train = dummy_data.view_X_train_df()
y_train = dummy_data.view_y_train_df()
# 3,000 samples (cells) for testing data
X_test = dummy_data.view_X_test_df()
y_test = dummy_data.view_y_test_df()
Our generated single-cell gene expression data (X, y) looks like this:
X_train.corr() # pairwise correlations among the training samples
TF1 | TF2 | TF3 | TF4 | TF5 | |
---|---|---|---|---|---|
TF1 | 1.000000 | 0.438854 | 0.360903 | -0.269164 | -0.708270 |
TF2 | 0.438854 | 1.000000 | 0.176441 | -0.139661 | -0.391539 |
TF3 | 0.360903 | 0.176441 | 1.000000 | -0.128266 | -0.329087 |
TF4 | -0.269164 | -0.139661 | -0.128266 | 1.000000 | 0.234778 |
TF5 | -0.708270 | -0.391539 | -0.329087 | 0.234778 | 1.000000 |
X_test.corr() # pairwise correlations among the testing samples
TF1 | TF2 | TF3 | TF4 | TF5 | |
---|---|---|---|---|---|
TF1 | 1.000000 | 0.426107 | 0.336718 | -0.266472 | -0.759382 |
TF2 | 0.426107 | 1.000000 | 0.153369 | -0.136285 | -0.379380 |
TF3 | 0.336718 | 0.153369 | 1.000000 | -0.088676 | -0.294519 |
TF4 | -0.266472 | -0.136285 | -0.088676 | 1.000000 | 0.243514 |
TF5 | -0.759382 | -0.379380 | -0.294519 | 0.243514 | 1.000000 |
dummy_data.combined_correlations_df # breakdown of the correlations in training, testing, and overall data
i | predictor | expected_corr_with_Y | actual_corr | difference | X_group | num_samples | sparsity_factor_perc | |
---|---|---|---|---|---|---|---|---|
0 | 0 | TF1 | 0.9 | 0.922158 | 0.022158 | Overall | unique 10000 | 70 |
1 | 1 | TF2 | 0.5 | 0.462752 | 0.037248 | Overall | unique 10000 | 70 |
2 | 2 | TF3 | 0.4 | 0.363821 | 0.036179 | Overall | unique 10000 | 70 |
3 | 3 | TF4 | -0.3 | -0.272976 | 0.027024 | Overall | unique 10000 | 70 |
4 | 4 | TF5 | -0.8 | -0.802524 | 0.002524 | Overall | unique 10000 | 70 |
0 | 0 | TF1 | 0.9 | 0.920876 | 0.020876 | Training | unique 7000 | 70 |
1 | 1 | TF2 | 0.5 | 0.459479 | 0.040521 | Training | unique 7000 | 70 |
2 | 2 | TF3 | 0.4 | 0.365248 | 0.034752 | Training | unique 7000 | 70 |
3 | 3 | TF4 | -0.3 | -0.267220 | 0.032780 | Training | unique 7000 | 70 |
4 | 4 | TF5 | -0.8 | -0.795406 | 0.004594 | Training | unique 7000 | 70 |
0 | 0 | TF1 | 0.9 | 0.925128 | 0.025128 | Testing | unique 3000 | 70 |
1 | 1 | TF2 | 0.5 | 0.470230 | 0.029770 | Testing | unique 3000 | 70 |
2 | 2 | TF3 | 0.4 | 0.360628 | 0.039372 | Testing | unique 3000 | 70 |
3 | 3 | TF4 | -0.3 | -0.285976 | 0.014024 | Testing | unique 3000 | 70 |
4 | 4 | TF5 | -0.8 | -0.818860 | 0.018860 | Testing | unique 3000 | 70 |
We note that NetREm will make this input TF-TF Protein-Protein Interaction (PPI) Network (PPIN) fully connected by adding a default edge weight ŋ = 0.01 for missing edges. This is for numerical stability and to also help propel the discovery of novel TF-TF coordination links.
# prior network edge_list (missing edges or edges with no edge weight will be added with the default_edge_list so the network is fully-connected):
edge_list = [["TF1", "TF2", 0.8], ["TF4", "TF5", 0.95], ["TF1", "TF3"], ["TF1", "TF4"], ["TF1", "TF5"],
["TF2", "TF3"], ["TF2", "TF4"], ["TF2", "TF5"], ["TF3", "TF4"], ["TF3", "TF5"]]
beta_network_val = 1
alpha_lasso_val = 0.1 # the default, so we do not need to specify it.
# by default, model_type is Lasso, so alpha_lasso_val will be specified for the alpha_lasso parameter.
# However, if we specify model_type = Lasso, so our alpha_lasso is determined by cross-validation on training data).
# Building the network regularized regression model:
# By default, edges are constructed between all of the nodes; nodes with a missing edge are assigned the default_edge_weight.
netrem_demo = netrem(edge_list = edge_list,
beta_net = beta_network_val,
alpha_lasso = alpha_lasso_val, # the default is 0.1.
model_type = "Lasso",
view_network = True)
# Fitting the NetREm model on training data: X_train and y_train:
netrem_demo.fit(X_train, y_train)
To view and extract the predicted model coefficients c for the predictors:
netrem_demo.model_coef_df
y_intercept | TF1 | TF2 | TF3 | TF4 | TF5 | |
---|---|---|---|---|---|---|
0 | None | 0.376611 | 0.249198 | 0.034235 | -0.091983 | -0.188037 |
In the context of gene regulation (in biology), we predict that predictors with negative NetREm coefficients for target gene (TG)
To view the TG-specific TF-TF coordination (of direct and/or indirect interactions among TFs) that NetREm learned for this target gene B_interaction_df
.
netrem_demo.B_interaction_df
TF1 | TF2 | TF3 | TF4 | TF5 | |
---|---|---|---|---|---|
TF1 | 0.000000 | 71.466718 | 41.657612 | -38.149829 | -97.923796 |
TF2 | 71.466718 | 0.000000 | 16.547463 | -20.520985 | -54.808259 |
TF3 | 41.657612 | 16.547463 | 0.000000 | -24.335893 | -51.672959 |
TF4 | -38.149829 | -20.520985 | -24.335893 | 0.000000 | 100.000000 |
TF5 | -97.923796 | -54.808259 | -51.672959 | 100.000000 | 0.000000 |
Based on the NetREm model, we predict:
- TF predictors with positive
$c^{*} > 0$ coefficient signs are activators of TG: linked to increased gene expression levels$y$ of the TG. - TF predictors with negative
$c^{*} < 0$ coefficient signs are repressors of TG: linked to decreased gene expression levels$y$ of the TG. - TF predictors with negative
$c^{*} = 0$ coefficient signs are not selected as final TFs for the TG based on the input data.
Our TG-specific TF-TF coordination network (i.e. coordination network among predictors for this
These tend to reflect the grouped variable selection property of network-regularized regression models that tend to select TF predictors that also group together in biological pathways in the original Protein-Protein Interaction Network (PPIN).
We predict positive scores 0 <
- TF1, TF2, TF3 have a common goal of co-activating TG expression levels
$y$ ➡️ increased$y$ - TF4 and TF5 have a common goal of co-repressing TG expression levels
$y$ ➡️ decreased$y$
We predict negative scores -100 ≤
- TF1, TF2, TF3 each have antagonistic relations with TF4 and TF5.
We can test the performance of our data on testing data
pred_y_test = netrem_demo.predict(X_test) # predicted values for y_test
mse_test = netrem_demo.test_mse(X_test, y_test)
print(f"The testing Mean Square Error (MSE) is {mse_test}")
The testing Mean Square Error (MSE) is 0.21834495195611514
We can analyze more metrics about our NetREm model results as below, where:
- low NMSE values are great
- high values for SNR and PSNR are ideal:
nmse_test = em.nmse(y_test.values.flatten(), pred_y_test)
print(f"The testing Normalized Mean Square Error (NMSE) is {nmse_test}")
The testing Normalized Mean Square Error (NMSE) is 0.22013278683483584
snr_test = em.snr(y_test.values.flatten(), pred_y_test)
print(f"The testing Signal-to-Noise Ratio (SNR) is {snr_test}")
The testing Signal-to-Noise Ratio (SNR) is 6.573152683008163
psnr_test = em.psnr(y_test.values.flatten(), pred_y_test)
print(f"The testing Peak Signal-to-Noise Ratio (PSNR) is {psnr_test}")
The testing Peak Signal-to-Noise Ratio (PSNR) is 17.629710652383
These are additional metrics:
netrem_demo.final_corr_vs_coef_df
info | input_data | TF1 | TF2 | TF3 | TF4 | TF5 | |
---|---|---|---|---|---|---|---|
0 | network regression coeff. with y: y | X_train | 0.376611 | 0.249198 | 0.034235 | -0.091983 | -0.188037 |
0 | corr (r) with y: y | X_train | 0.893875 | 0.494397 | 0.401819 | -0.295085 | -0.792749 |
0 | Absolute Value NetREm Coefficient Ranking | X_train | 1 | 2 | 5 | 4 | 3 |
In the context of gene regulation, our results may thereby be interpreted in the following way:
Nonetheless, NetREm can be applied to solve a suite of regression problems where there is an underlying connection among the predictors (that group together in meaningful subnetworks to influence the outcome
We also provide a suite of evaluation functions and explanations of more advanced functionalities in our User Guide. For instance, we provide more details on our netrem and netremcv functions in terms of predicting gene regulation in myelinating Schwann cells here.
netrem_demo.combined_df
coef | TF | TG | info | train_mse | train_nmse | train_snr | train_psnr | beta_net | alpha_lasso | AbsoluteVal_coefficient | Rank | final_model_TFs | TFs_input_to_model | original_TFs_in_X | standardized_X | standardized_y | centered_y | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | None | y_intercept | y | netrem_no_intercept | 0.219221 | 0.219221 | 6.591179 | 17.777226 | 1 | 0.1 | NaN | 6 | 5 | 5 | 5 | True | True | False |
1 | 0.376611 | TF1 | y | netrem_no_intercept | 0.219221 | 0.219221 | 6.591179 | 17.777226 | 1 | 0.1 | 0.376611 | 1 | 5 | 5 | 5 | True | True | False |
2 | 0.249198 | TF2 | y | netrem_no_intercept | 0.219221 | 0.219221 | 6.591179 | 17.777226 | 1 | 0.1 | 0.249198 | 2 | 5 | 5 | 5 | True | True | False |
3 | 0.034235 | TF3 | y | netrem_no_intercept | 0.219221 | 0.219221 | 6.591179 | 17.777226 | 1 | 0.1 | 0.034235 | 5 | 5 | 5 | 5 | True | True | False |
4 | -0.091983 | TF4 | y | netrem_no_intercept | 0.219221 | 0.219221 | 6.591179 | 17.777226 | 1 | 0.1 | 0.091983 | 4 | 5 | 5 | 5 | True | True | False |
5 | -0.188037 | TF5 | y | netrem_no_intercept | 0.219221 | 0.219221 | 6.591179 | 17.777226 | 1 | 0.1 | 0.188037 | 3 | 5 | 5 | 5 | True | True | False |
netrem_demo.TF_coord_scores_pairwise_df
23 | TF4 | TF5 | 100.000000 | 100.000000 |
---|---|---|---|---|
19 | TF5 | TF4 | 100.000000 | 100.000000 |
20 | TF1 | TF5 | -97.923796 | 97.923796 |
4 | TF5 | TF1 | -97.923796 | 97.923796 |
1 | TF2 | TF1 | 71.466718 | 71.466718 |
5 | TF1 | TF2 | 71.466718 | 71.466718 |
9 | TF5 | TF2 | -54.808259 | 54.808259 |
21 | TF2 | TF5 | -54.808259 | 54.808259 |
14 | TF5 | TF3 | -51.672959 | 51.672959 |
22 | TF3 | TF5 | -51.672959 | 51.672959 |
10 | TF1 | TF3 | 41.657612 | 41.657612 |
2 | TF3 | TF1 | 41.657612 | 41.657612 |
15 | TF1 | TF4 | -38.149829 | 38.149829 |
3 | TF4 | TF1 | -38.149829 | 38.149829 |
17 | TF3 | TF4 | -24.335893 | 24.335893 |
13 | TF4 | TF3 | -24.335893 | 24.335893 |
8 | TF4 | TF2 | -20.520985 | 20.520985 |
16 | TF2 | TF4 | -20.520985 | 20.520985 |
7 | TF3 | TF2 | 16.547463 | 16.547463 |
11 | TF2 | TF3 | 16.547463 | 16.547463 |
organize_predictor_interaction_network(netrem_demo)
node_1 | node_2 | coord_score_cs | sign | potential_interaction | absVal_coordScore | model_type | info | candidate_TFs_N | target_gene_y | ... | cs_magnitude_rank | cs_magnitude_percentile | TF_TF | original_corr | standardized_corr | PPI_score | novel_link | absVal_diff_cs_and_originalCorr | c_1 | c_2 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | TF4 | TF5 | 100.000000 | :) | :) cooperative (+) | 100.000000 | Lasso | matrix of TF-TF interactions | 5 | y | ... | 1.0 | 95.0 | TF4_TF5 | 0.234778 | 0.234778 | 0.95 | no | 99.765222 | -0.091983 | -0.188037 |
1 | TF5 | TF4 | 100.000000 | :) | :) cooperative (+) | 100.000000 | Lasso | matrix of TF-TF interactions | 5 | y | ... | 1.0 | 95.0 | TF5_TF4 | 0.234778 | 0.234778 | 0.95 | no | 99.765222 | -0.188037 | -0.091983 |
2 | TF1 | TF5 | -97.923796 | :( | :( competitive (-) | 97.923796 | Lasso | matrix of TF-TF interactions | 5 | y | ... | 3.0 | 85.0 | TF1_TF5 | -0.708270 | -0.708270 | 0.01 | yes | 97.215526 | 0.376611 | -0.188037 |
3 | TF5 | TF1 | -97.923796 | :( | :( competitive (-) | 97.923796 | Lasso | matrix of TF-TF interactions | 5 | y | ... | 3.0 | 85.0 | TF5_TF1 | -0.708270 | -0.708270 | 0.01 | yes | 97.215526 | -0.188037 | 0.376611 |
4 | TF2 | TF1 | 71.466718 | :) | :) cooperative (+) | 71.466718 | Lasso | matrix of TF-TF interactions | 5 | y | ... | 5.0 | 75.0 | TF2_TF1 | 0.438854 | 0.438854 | 0.80 | no | 71.027863 | 0.249198 | 0.376611 |
5 | TF1 | TF2 | 71.466718 | :) | :) cooperative (+) | 71.466718 | Lasso | matrix of TF-TF interactions | 5 | y | ... | 5.0 | 75.0 | TF1_TF2 | 0.438854 | 0.438854 | 0.80 | no | 71.027863 | 0.376611 | 0.249198 |
6 | TF5 | TF2 | -54.808259 | :( | :( competitive (-) | 54.808259 | Lasso | matrix of TF-TF interactions | 5 | y | ... | 7.0 | 65.0 | TF5_TF2 | -0.391539 | -0.391539 | 0.01 | yes | 54.416720 | -0.188037 | 0.249198 |
7 | TF2 | TF5 | -54.808259 | :( | :( competitive (-) | 54.808259 | Lasso | matrix of TF-TF interactions | 5 | y | ... | 7.0 | 65.0 | TF2_TF5 | -0.391539 | -0.391539 | 0.01 | yes | 54.416720 | 0.249198 | -0.188037 |
8 | TF5 | TF3 | -51.672959 | :( | :( competitive (-) | 51.672959 | Lasso | matrix of TF-TF interactions | 5 | y | ... | 9.0 | 55.0 | TF5_TF3 | -0.329087 | -0.329087 | 0.01 | yes | 51.343872 | -0.188037 | 0.034235 |
9 | TF3 | TF5 | -51.672959 | :( | :( competitive (-) | 51.672959 | Lasso | matrix of TF-TF interactions | 5 | y | ... | 9.0 | 55.0 | TF3_TF5 | -0.329087 | -0.329087 | 0.01 | yes | 51.343872 | 0.034235 | -0.188037 |
10 | TF3 | TF1 | 41.657612 | :) | :) cooperative (+) | 41.657612 | Lasso | matrix of TF-TF interactions | 5 | y | ... | 11.0 | 45.0 | TF3_TF1 | 0.360903 | 0.360903 | 0.01 | yes | 41.296709 | 0.034235 | 0.376611 |
11 | TF1 | TF3 | 41.657612 | :) | :) cooperative (+) | 41.657612 | Lasso | matrix of TF-TF interactions | 5 | y | ... | 11.0 | 45.0 | TF1_TF3 | 0.360903 | 0.360903 | 0.01 | yes | 41.296709 | 0.376611 | 0.034235 |
12 | TF1 | TF4 | -38.149829 | :( | :( competitive (-) | 38.149829 | Lasso | matrix of TF-TF interactions | 5 | y | ... | 13.0 | 35.0 | TF1_TF4 | -0.269164 | -0.269164 | 0.01 | yes | 37.880665 | 0.376611 | -0.091983 |
13 | TF4 | TF1 | -38.149829 | :( | :( competitive (-) | 38.149829 | Lasso | matrix of TF-TF interactions | 5 | y | ... | 13.0 | 35.0 | TF4_TF1 | -0.269164 | -0.269164 | 0.01 | yes | 37.880665 | -0.091983 | 0.376611 |
14 | TF3 | TF4 | -24.335893 | :( | :( competitive (-) | 24.335893 | Lasso | matrix of TF-TF interactions | 5 | y | ... | 15.0 | 25.0 | TF3_TF4 | -0.128266 | -0.128266 | 0.01 | yes | 24.207627 | 0.034235 | -0.091983 |
15 | TF4 | TF3 | -24.335893 | :( | :( competitive (-) | 24.335893 | Lasso | matrix of TF-TF interactions | 5 | y | ... | 15.0 | 25.0 | TF4_TF3 | -0.128266 | -0.128266 | 0.01 | yes | 24.207627 | -0.091983 | 0.034235 |
16 | TF4 | TF2 | -20.520985 | :( | :( competitive (-) | 20.520985 | Lasso | matrix of TF-TF interactions | 5 | y | ... | 17.0 | 15.0 | TF4_TF2 | -0.139661 | -0.139661 | 0.01 | yes | 20.381324 | -0.091983 | 0.249198 |
17 | TF2 | TF4 | -20.520985 | :( | :( competitive (-) | 20.520985 | Lasso | matrix of TF-TF interactions | 5 | y | ... | 17.0 | 15.0 | TF2_TF4 | -0.139661 | -0.139661 | 0.01 | yes | 20.381324 | 0.249198 | -0.091983 |
18 | TF3 | TF2 | 16.547463 | :) | :) cooperative (+) | 16.547463 | Lasso | matrix of TF-TF interactions | 5 | y | ... | 19.0 | 5.0 | TF3_TF2 | 0.176441 | 0.176441 | 0.01 | yes | 16.371022 | 0.034235 | 0.249198 |
19 | TF2 | TF3 | 16.547463 | :) | :) cooperative (+) | 16.547463 | Lasso | matrix of TF-TF interactions | 5 | y | ... | 19.0 | 5.0 | TF2_TF3 | 0.176441 | 0.176441 | 0.01 | yes | 16.371022 | 0.249198 | 0.034235 |
20 rows × 25 columns
We also note that since w_transform_for_d
= "none", then to calculate the degree of each TF predictor node, we calculate the sum of all edges
Instead of self-loops, we have:
We preprocess the input TF-TF PPI network to make it fully-connected for all pairwise TF-TF edges (default_edge_weight
of 0.01) and diagonals reflecting averaged connectivity values.
Please note that the manuscript associated with NetREm is available as a pre-print on biorxiv: Link to Paper
Citation:
- Saniya Khullar, Xiang Huang, Raghu Ramesh, John Svaren, Daifeng Wang, NetREm: Network Regression Embeddings reveal cell-type transcription factor coordination for gene regulation, bioRxiv 2023.10.25.563769; doi: https://doi.org/10.1101/2023.10.25.563769
[1]: Caiyan Li, Hongzhe Li, Network-constrained regularization and variable selection for analysis of genomic data, Bioinformatics, Volume 24, Issue 9, May 2008, Pages 1175–1182, https://doi.org/10.1093/bioinformatics/btn081
[2]: Wilkinson, M.D., Dumontier, M., Aalbersberg, I.J. et al. The FAIR Guiding Principles for scientific data management and stewardship. Sci Data 3, 160018 (2016). https://doi.org/10.1038/sdata.2016.18
[3]: Zhang, Y., Akutsu, T., & Ching, W. K. (2016). Incorporating network topology and phylogenetic constraints for protein-protein interaction prediction. BMC bioinformatics, 17(1), 1-15. https://doi.org/10.1186/s12859-016-1310-4
[4]: Jia, C., Zhang, Y., Chen, K., & Zhang, S. (2018). A novel feature extraction method with improved consistency for identifying cell cycle regulated genes. Bioinformatics, 34(5), 896-903. https://doi.org/10.1093/bioinformatics/btx657
[5]: Lu, Y., Chen, X., & Hu, Z. (2017). Recognition of protein/gene names from text using an ensemble of classifiers and effective abbreviation resolution. BMC bioinformatics, 18(1), 1-11. https://doi.org/10.1186/s12859-017-1515-1
[6]: Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., ... & Vanderplas, J. (2011). Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12(Oct), 2825-2830.