Skip to content

Commit 6c1118f

Browse files
authored
Add vertical federated learning solution support for Azure deployments (#140)
Update vertical_fl submodule to the latest commit from fedlearner fix_dev_sgx branch. Fix port numbers in test-ps-sgx.sh. Update VFL documentation.
1 parent f122de4 commit 6c1118f

File tree

8 files changed

+773
-122
lines changed

8 files changed

+773
-122
lines changed

cczoo/vertical_fl/README.md

Lines changed: 90 additions & 62 deletions
Original file line numberDiff line numberDiff line change
@@ -3,35 +3,41 @@
33

44
- Ubuntu 18.04. This solution should work on other Linux distributions as well, but for simplicity we provide the steps for Ubuntu 18.04 only.
55

6-
- Docker Engine. Docker Engine is an open source containerization technology for building and containerizing your applications. In this solution, Gramine, Fedlearner, gRPC will be built in Docker images. Please follow [this guide](https://docs.docker.com/engine/install/ubuntu/#install-using-the-convenience-script) to install Docker engine.
6+
- Docker Engine. Docker Engine is an open source containerization technology for building and containerizing your applications. In this solution, Gramine, Fedlearner, gRPC will be built in a Docker image. Please follow [this guide](https://docs.docker.com/engine/install/ubuntu/#install-using-the-convenience-script) to install Docker Engine. The Docker daemon's storage location (/var/lib/docker for example) should have at least 32GB available.
77

88
- SGX capable platform. Intel SGX Driver and SDK/PSW. You need a machine that supports Intel SGX and FLC/DCAP. Please follow [this guide](https://download.01.org/intel-sgx/latest/linux-latest/docs/) to install the Intel SGX driver and SDK/PSW. One way to verify SGX enabling status in your machine is to run [QuoteGeneration](https://github.com/intel/SGXDataCenterAttestationPrimitives/blob/master/QuoteGeneration) and [QuoteVerification](https://github.com/intel/SGXDataCenterAttestationPrimitives/blob/master/QuoteVerification) successfully.
99

10-
Here, we will demonstrate how to run leader and follower from two containers.
11-
10+
Here, we will demonstrate vertical federated learning using a leader container and a follower container.
1211

1312

1413
## Executing Fedlearner in SGX
1514

1615
### 1. Download source code
1716

17+
Download the [Fedlearner source code](https://github.com/bytedance/fedlearner/tree/fix_dev_sgx) which is a git submodule of CCZoo.
18+
1819
```
19-
git clone -b fix_dev_sgx https://github.com/bytedance/fedlearner.git
20-
cd fedlearner
2120
git submodule init
2221
git submodule update
22+
cd cczoo/vertical_fl
23+
./apply_overlay.sh
24+
cd vertical_fl
2325
```
2426

2527
### 2. Build Docker image
2628

29+
`build_dev_docker_image.sh` provides the parameter `proxy_server` to specify the network proxy. `build_dev_docker_image.sh` also accepts an optional argument to specify the docker image tag.
30+
31+
For deployments on Microsoft Azure:
2732
```
28-
img_tag=Your_defined_tag
29-
./sgx/build_dev_docker_image.sh ${img_tag}
33+
AZURE=1 ./sgx/build_dev_docker_image.sh
34+
```
35+
For other cloud deployments:
36+
```
37+
./sgx/build_dev_docker_image.sh
3038
```
3139

32-
*Note:* `build_dev_docker_image.sh` provides parameter `proxy_server` to help you set your network proxy. It can be removed from this script if it is not needed.
33-
34-
You will get the built image:
40+
Example of built image:
3541

3642
```
3743
REPOSITORY TAG IMAGE ID CREATED SIZE
@@ -40,34 +46,39 @@ fedlearner-sgx-dev latest 8c3c7a05f973 45 hours ago 15.2GB
4046

4147
### 3. Start Container
4248

43-
In terminal 1, start container to run leader:
49+
Start the leader and follower containers:
4450

4551
```
46-
docker run -it \
47-
--name=fedlearner_leader \
48-
--restart=unless-stopped \
49-
-p 50051:50051 \
50-
--device=/dev/sgx_enclave:/dev/sgx/enclave \
51-
--device=/dev/sgx_provision:/dev/sgx/provision \
52-
fedlearner-sgx-dev:latest \
53-
bash
52+
docker run -itd --name=fedlearner_leader --restart=unless-stopped -p 50051:50051 \
53+
--device=/dev/sgx_enclave:/dev/sgx/enclave --device=/dev/sgx_provision:/dev/sgx/provision fedlearner-sgx-dev:latest bash
54+
55+
docker run -itd --name=fedlearner_follower --restart=unless-stopped -p 50052:50052 \
56+
--device=/dev/sgx_enclave:/dev/sgx/enclave --device=/dev/sgx_provision:/dev/sgx/provision fedlearner-sgx-dev:latest bash
5457
```
5558

56-
In terminal 2, start container to run follower:
59+
Take note of the container IP addresses for later steps:
5760

5861
```
59-
docker run -it \
60-
--name=fedlearner_follwer \
61-
--restart=unless-stopped \
62-
-p 50052:50052 \
63-
--device=/dev/sgx_enclave:/dev/sgx/enclave \
64-
--device=/dev/sgx_provision:/dev/sgx/provision \
65-
fedlearner-sgx-dev:latest \
66-
bash
62+
docker inspect --format '{{ .NetworkSettings.IPAddress }}' fedlearner_leader
63+
docker inspect --format '{{ .NetworkSettings.IPAddress }}' fedlearner_follower
64+
```
65+
66+
In terminal 1, enter the leader container shell:
67+
68+
```
69+
docker exec -it fedlearner_leader bash
70+
```
71+
72+
In terminal 2, enter the follower container shell:
73+
74+
```
75+
docker exec -it fedlearner_follower bash
6776
```
6877

6978
#### 3.1 Configure PCCS
7079

80+
- For deployments on Microsoft Azure, skip this section, as configuring the PCCS is not necessary on Azure.
81+
7182
- If you are using public cloud instance, please replace the PCCS url in `/etc/sgx_default_qcnl.conf` with the new pccs url provided by the cloud.
7283

7384
```
@@ -84,15 +95,15 @@ docker run -it \
8495

8596
#### 3.2 Start aesm service
8697

87-
Execute below script in both leader and follower container:
98+
Start the aesm service in both the leader and follower containers:
8899

89100
```
90101
/root/start_aesm_service.sh
91102
```
92103

93104
#### 4. Prepare data
94105

95-
Generate data in both leader and follower container:
106+
Generate data in both the leader and follower containers:
96107

97108
```
98109
cd /gramine/CI-Examples/wide_n_deep
@@ -101,14 +112,15 @@ cd /gramine/CI-Examples/wide_n_deep
101112

102113
#### 5. Compile applications
103114

104-
Compile applications in both leader and follower container:
115+
Compile applications in both the leader and follower containers:
105116

106117
```
107118
cd /gramine/CI-Examples/wide_n_deep
108119
./test-ps-sgx.sh make
109120
```
110121

111-
Please find `mr_enclave`,`mr_signer` from the print log as below:
122+
Take note of the `mr_enclave` and `mr_signer` values from the resulting log from the leader container.
123+
The following is an example log:
112124

113125
```
114126
+ make
@@ -121,7 +133,7 @@ Please find `mr_enclave`,`mr_signer` from the print log as below:
121133
isv_svn: 0
122134
```
123135

124-
Then, update the leader's `dynamic_config.json` under current folder with follower's `mr_enclave`,`mr_signer`. Also, update follower's `dynamic_config.json` with leader's `mr_enclave`,`mr_signer`.
136+
In both the leader and follower containers, in `dynamic_config.json`, confirm that `mr_enclave` and `mr_signer` are set to the values from the leader container's log. Use the actual values from the leader container's log, not the values from the example log above.
125137

126138
```
127139
dynamic_config.json:
@@ -140,60 +152,76 @@ dynamic_config.json:
140152
141153
```
142154

143-
#### 6. Config leader and follower's IP
155+
#### 6. Run the distributing training
144156

145-
In leader's `test-ps-sgx.sh`, for `--peer-addr` , please replace `localhost` with `follower_contianer_ip`
157+
Start the training process in the follower container:
146158

147159
```
148-
elif [ "$ROLE" == "leader" ]; then
149-
make_custom_env
150-
rm -rf model/leader
151-
......
152-
taskset -c 4-7 stdbuf -o0 gramine-sgx python -u leader.py \
153-
--local-addr=localhost:50051 \
154-
--peer-addr=follower_contianer_ip:50052
160+
cd /gramine/CI-Examples/wide_n_deep
161+
peer_ip=REPLACE_WITH_LEADER_IP_ADDR
162+
./test-ps-sgx.sh follower $peer_ip
155163
```
156164

157-
In follower's `test-ps-sgx.sh`, for `--peer-addr` , please replace `localhost` with `leader_contianer_ip`
165+
Wait until the follower training process is ready, when the following log is displayed:
158166

159167
```
160-
elif [ "$ROLE" == "follower" ]; then
161-
make_custom_env
162-
rm -rf model/follower
163-
......
164-
taskset -c 12-15 stdbuf -o0 gramine-sgx python -u follower.py \
165-
--local-addr=localhost:50052 \
166-
--peer-addr=leader_container_ip:50051
168+
2022-10-12 02:53:47,002 [INFO]: waiting master ready... (fl_logging.py:95)
167169
```
168170

169-
*Note:* Get the container ip under your host:
171+
Start the training process in the leader container:
170172

171173
```
172-
docker inspect --format '{{ .NetworkSettings.IPAddress }}' container_id
174+
cd /gramine/CI-Examples/wide_n_deep
175+
peer_ip=REPLACE_WITH_FOLLOWER_IP_ADDR
176+
./test-ps-sgx.sh leader $peer_ip
173177
```
174178

175-
#### 7.Run the distributing training
179+
The following logs occur on the leader when the leader and follower have established communication:
176180

177-
Under leader container:
181+
```
182+
2022-10-12 05:22:27,056 [INFO]: [Channel] state changed from CONNECTING_UNCONNECTED to CONNECTING_CONNECTED, event: PEER_CONNECTED (fl_logging.py:95)
183+
2022-10-12 05:22:27,067 [INFO]: [Channel] state changed from CONNECTING_CONNECTED to READY, event: CONNECTED (fl_logging.py:95)
184+
2022-10-12 05:22:27,068 [DEBUG]: [Bridge] stream transmit started (fl_logging.py:98)
185+
```
186+
187+
The following logs on the leader are an example of a training iteration:
178188

179189
```
180-
cd /gramine/CI-Examples/wide_n_deep
181-
./test-ps-sgx.sh leader
190+
2022-10-12 05:23:52,356 [DEBUG]: [Bridge] send start iter_id: 123 (fl_logging.py:98)
191+
2022-10-12 05:23:52,483 [DEBUG]: [Bridge] receive peer commit iter_id: 122 (fl_logging.py:98)
192+
2022-10-12 05:23:52,484 [DEBUG]: [Bridge] received peer start iter_id: 123 (fl_logging.py:98)
193+
2022-10-12 05:23:52,736 [DEBUG]: [Bridge] received data iter_id: 123, name: act1_f (fl_logging.py:98)
194+
2022-10-12 05:23:52,737 [DEBUG]: [Bridge] Data: received iter_id: 123, name: act1_f after 0.117231 sec (fl_logging.py:98)
195+
2022-10-12 05:23:52,739 [DEBUG]: [Bridge] Data: send iter_id: 123, name: act1_f_grad (fl_logging.py:98)
196+
2022-10-12 05:23:52,817 [DEBUG]: [Bridge] receive peer commit iter_id: 123 (fl_logging.py:98)
197+
2022-10-12 05:23:52,818 [DEBUG]: [Bridge] received peer start iter_id: 124 (fl_logging.py:98)
198+
2022-10-12 05:23:53,070 [DEBUG]: [Bridge] received data iter_id: 124, name: act1_f (fl_logging.py:98)
199+
2022-10-12 05:23:53,168 [DEBUG]: [Bridge] send commit iter_id: 123 (fl_logging.py:98)
200+
2022-10-12 05:23:53,170 [DEBUG]: after session run. time: 0.814208 sec (fl_logging.py:98)
182201
```
183202

184-
Under follower container:
203+
When the training processes are done, the leader will display these logs:
204+
185205

186206
```
187-
cd /gramine/CI-Examples/wide_n_deep
188-
./test-ps-sgx.sh follower
207+
**************export model hook**************
208+
sess : <tensorflow.python.client.session.Session object at 0x7e8fb898>
209+
model: <fedlearner.trainer.estimator.FLModel object at 0x8ee60f98>
210+
export_dir: model/leader/saved_model/1665552233
211+
inputs: {'examples': <tf.Tensor 'examples:0' shape=<unknown> dtype=string>, 'act1_f': <tf.Tensor 'act1_f:0' shape=<unknown> dtype=float32>}
212+
outpus: {'output': <tf.Tensor 'MatMul_2:0' shape=(None, 2) dtype=float32>}
213+
*********************************************
214+
2022-10-12 05:24:07,675 [INFO]: export_model done (fl_logging.py:95)
215+
2022-10-12 05:24:07,676 [INFO]: Trainer Master status transfer, from WORKER_COMPLETED to COMPLETED (fl_logging.py:95)
216+
2022-10-12 05:24:09,017 [INFO]: master completed (fl_logging.py:95)
189217
```
190218

191-
Finally, the model file will be placed at
219+
The updated model files are saved in these locations:
192220

193221
```
194-
./model/leader/id/save_model.pd
222+
./model/leader/saved_model/<id>/saved_model.pb
195223
```
196224

197225
```
198-
./model/follower/id/save_model.pd
226+
./model/follower/saved_model/<id>/saved_model.pb
199227
```

cczoo/vertical_fl/apply_overlay.sh

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,5 @@
1+
#!/bin/bash
2+
3+
cp overlay/fedlearner-sgx-dev.dockerfile vertical_fl/
4+
cp overlay/build_dev_docker_image.sh vertical_fl/sgx/
5+
cp overlay/test-ps-sgx.sh vertical_fl/sgx/gramine/CI-Examples/wide_n_deep/
Lines changed: 32 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,32 @@
1+
#!/bin/bash
2+
set -e
3+
4+
if [ ! -n "$1" ] ; then
5+
image_tag=latest
6+
else
7+
image_tag=$1
8+
fi
9+
10+
if [ -z "$AZURE" ] ; then
11+
azure=
12+
else
13+
azure=1
14+
fi
15+
16+
cd `dirname "$0"`/..
17+
18+
# You can remove build-arg http_proxy and https_proxy if your network doesn't need it
19+
#no_proxy="localhost,127.0.0.1"
20+
#proxy_server="http://test-proxy:port"
21+
22+
DOCKER_BUILDKIT=0 docker build \
23+
-f fedlearner-sgx-dev.dockerfile \
24+
-t fedlearner-sgx-dev:${image_tag} \
25+
--network=host \
26+
--build-arg http_proxy=${proxy_server} \
27+
--build-arg https_proxy=${proxy_server} \
28+
--build-arg no_proxy=${no_proxy} \
29+
--build-arg AZURE=${azure} \
30+
.
31+
32+
cd -

0 commit comments

Comments
 (0)