Open
Description
What happened + What you expected to happen
If I set the experiment_mode to "standalone" for example, which is not "simulation", FedScale fails to run. The femnist_cluster.yml
is:
# Configuration file of FAR training experiment
# ========== Cluster configuration ==========
# ip address of the parameter server (need 1 GPU process)
ps_ip: 192.168.124.102
# ip address of each worker:# of available gpus process on each gpu in this node
# Note that if we collocate ps and worker on same GPU, then we need to decrease this number of available processes on that GPU by 1
# E.g., master node has 4 available processes, then 1 for the ps, and worker should be set to: worker:3
worker_ips:
- 192.168.124.104:[1]
- 192.168.124.105:[1]
- 192.168.124.106:[1]
exp_path: $FEDSCALE_HOME/fedscale/cloud
# Entry function of executor and aggregator under $exp_path
executor_entry: execution/executor.py
aggregator_entry: aggregation/aggregator.py
auth:
ssh_user: "whr"
ssh_private_key: ~/.ssh/id_rsa
# cmd to run before we can indeed run FAR (in order)
setup_commands:
- source /usr/local/miniconda3/bin/activate fedscale
# ========== Additional job configuration ==========
# Default parameters are specified in config_parser.py, wherein more description of the parameter can be found
job_conf:
- job_name: femnist_cluster # Generate logs under this folder: log_path/job_name/time_stamp
- log_path: $FEDSCALE_HOME/benchmark # Path of log files
- num_participants: 2 # Number of participants per round, we use K=100 in our paper, large K will be much slower
- data_set: femnist # Dataset: openImg, google_speech, stackoverflow
- data_dir: $FEDSCALE_HOME/benchmark/dataset/data/femnist # Path of the dataset
- data_map_file: $FEDSCALE_HOME/benchmark/dataset/data/femnist/client_data_mapping/train.csv # Allocation of data to each client, turn to iid setting if not provided
- device_conf_file: $FEDSCALE_HOME/benchmark/dataset/data/device_info/client_device_capacity # Path of the client trace
- device_avail_file: $FEDSCALE_HOME/benchmark/dataset/data/device_info/client_behave_trace
- model: resnet18 # NOTE: Please refer to our model zoo README and use models for these small image (e.g., 32x32x3) inputs
# - model_zoo: fedscale-torch-zoo
- eval_interval: 10 # How many rounds to run a testing on the testing set
- rounds: 1000 # Number of rounds to run this training. We use 1000 in our paper, while it may converge w/ ~400 rounds
- filter_less: 21 # Remove clients w/ less than 21 samples
- num_loaders: 2
- local_steps: 5
- learning_rate: 0.05
- batch_size: 20
- test_bsz: 20
- use_cuda: True
- save_checkpoint: False
- experiment_mode: standalone
- overcommitment: 1.0
The log is:
2023-12-27 14:39:19.056225: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 AVX_VNNI FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-12-27 14:39:19.152964: I tensorflow/core/util/port.cc:104] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2023-12-27 14:39:19.480238: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory
2023-12-27 14:39:19.480270: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory
2023-12-27 14:39:19.480272: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.
(12-27) 14:39:19 INFO [aggregator.py:44] Job args Namespace(adam_epsilon=1e-08, backbone='./resnet50.pth', backend='gloo', batch_size=20, bidirectional=True, blacklist_max_len=0.3, blacklist_rounds=-1, block_size=64, cfg_file='./utils/rcnn/cfgs/res101.yml', clf_block_size=32, clip_bound=0.9, clip_threshold=3.0, clock_factor=2.4368231046931412, conf_path='~/dataset/', connection_timeout=60, cuda_device=None, cut_off_util=0.05, data_cache='', data_dir='/home/whr/code/FedScale/benchmark/dataset/data/femnist', data_map_file='/home/whr/code/FedScale/benchmark/dataset/data/femnist/client_data_mapping/train.csv', data_set='femnist', decay_factor=0.98, decay_round=10, device_avail_file='/home/whr/code/FedScale/benchmark/dataset/data/device_info/client_behave_trace', device_conf_file='/home/whr/code/FedScale/benchmark/dataset/data/device_info/client_device_capacity', dump_epoch=10000000000.0, embedding_file='glove.840B.300d.txt', engine='pytorch', epsilon=0.9, eval_interval=10, executor_configs='192.168.124.104:[1]=192.168.124.105:[1]=192.168.124.106:[1]', experiment_mode='standalone', exploration_alpha=0.3, exploration_decay=0.98, exploration_factor=0.9, exploration_min=0.3, filter_less=21, filter_more=1000000000000000.0, finetune=False, gamma=0.9, gradient_policy=None, hidden_layers=7, hidden_size=256, input_dim=0, input_shape=[1, 3, 28, 28], job_name='femnist_cluster', labels_path='labels.json', learning_rate=0.05, line_by_line=False, local_steps=5, log_path='/home/whr/code/FedScale/benchmark', loss_decay=0.2, malicious_factor=1000000000000000.0, max_concurrency=10, max_staleness=5, memory_capacity=2000, min_learning_rate=5e-05, mlm=False, mlm_probability=0.15, model='resnet18', model_size=65536, model_zoo='torchcv', n_actions=2, n_states=4, noise_dir=None, noise_factor=0.1, noise_max=0.5, noise_min=0.0, noise_prob=0.4, num_class=62, num_classes=35, num_executors=3, num_loaders=2, num_participants=3, output_dim=0, overcommitment=1.0, overwrite_cache=False, pacer_delta=5, pacer_step=20, proxy_mu=0.1, ps_ip='192.168.124.102', ps_port='29500', qfed_q=1.0, rnn_type='lstm', round_penalty=2.0, round_threshold=30, rounds=1000, sample_mode='random', sample_rate=16000, sample_seed=233, sample_window=5.0, save_checkpoint=True, spec_augment=False, speed_volume_perturb=False, target_delta=0.0001, target_replace_iter=15, task='cv', test_bsz=20, test_manifest='data/test_manifest.csv', test_output_dir='./logs/server', test_ratio=1.0, test_size_file='', this_rank=0, time_stamp='1227_143917', train_manifest='data/train_manifest.csv', train_size_file='', train_uniform=False, use_cuda=True, vocab_tag_size=500, vocab_token_size=10000, wandb_token='', weight_decay=0, window='hamming', window_size=0.02, window_stride=0.01, yogi_beta=0.9, yogi_beta2=0.99, yogi_eta=0.003, yogi_tau=1e-08)
(12-27) 14:39:20 INFO [aggregator.py:164] Initiating control plane communication ...
(12-27) 14:39:20 INFO [aggregator.py:188] %%%%%%%%%% Opening aggregator server using port [::]:29500 %%%%%%%%%%
(12-27) 14:39:20 INFO [fllibs.py:97] Initializing the model ...
(12-27) 14:39:20 INFO [aggregator.py:967] Start monitoring events ...
2023-12-27 14:39:31.090474: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 AVX_VNNI FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-12-27 14:39:31.169358: I tensorflow/core/util/port.cc:104] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2023-12-27 14:39:31.478808: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory
2023-12-27 14:39:31.478836: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory
2023-12-27 14:39:31.478838: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.
(12-27) 14:39:31 INFO [fllibs.py:97] Initializing the model ...
(12-27) 14:39:31 INFO [executor.py:77] (EXECUTOR:1) is setting up environ ...
(12-27) 14:39:32 INFO [executor.py:123] Data partitioner starts ...
(12-27) 14:39:32 INFO [divide_data.py:62] Partitioning data by profile /home/whr/code/FedScale/benchmark/dataset/data/femnist/client_data_mapping/train.csv...
(12-27) 14:39:32 INFO [divide_data.py:74] Trace names are client_id, sample_path, label_name, label_id
(12-27) 14:39:32 INFO [divide_data.py:105] Randomly partitioning data, 81674 samples...
(12-27) 14:39:32 INFO [executor.py:141] Data partitioner completes ...
(12-27) 14:39:32 INFO [channel_context.py:21] %%%%%%%%%% Opening grpc connection to 192.168.124.102 %%%%%%%%%%
(12-27) 14:39:32 INFO [executor.py:404] Start monitoring events ...
(12-27) 14:39:32 INFO [aggregator.py:318] Received executor 1 information, 1/3
(12-27) 14:39:32 INFO [aggregator.py:274] Loading 2800 client traces ...
(12-27) 14:39:32 INFO [aggregator.py:304] Info of all feasible clients {'total_feasible_clients': 2799, 'total_num_samples': 637858}
2023-12-27 14:39:33.925569: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 AVX_VNNI FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-12-27 14:39:34.012208: I tensorflow/core/util/port.cc:104] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2023-12-27 14:39:34.334770: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory
2023-12-27 14:39:34.334812: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory
2023-12-27 14:39:34.334815: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.
(12-27) 14:39:34 INFO [fllibs.py:97] Initializing the model ...
(12-27) 14:39:34 INFO [executor.py:77] (EXECUTOR:2) is setting up environ ...
2023-12-27 14:39:35.087146: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 AVX_VNNI FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-12-27 14:39:35.167337: I tensorflow/core/util/port.cc:104] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
(12-27) 14:39:35 INFO [executor.py:123] Data partitioner starts ...
(12-27) 14:39:35 INFO [divide_data.py:62] Partitioning data by profile /home/whr/code/FedScale/benchmark/dataset/data/femnist/client_data_mapping/train.csv...
(12-27) 14:39:35 INFO [divide_data.py:74] Trace names are client_id, sample_path, label_name, label_id
2023-12-27 14:39:35.479452: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory
2023-12-27 14:39:35.479481: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory
2023-12-27 14:39:35.479484: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.
(12-27) 14:39:35 INFO [divide_data.py:105] Randomly partitioning data, 81674 samples...
(12-27) 14:39:35 INFO [executor.py:141] Data partitioner completes ...
(12-27) 14:39:35 INFO [channel_context.py:21] %%%%%%%%%% Opening grpc connection to 192.168.124.102 %%%%%%%%%%
(12-27) 14:39:35 INFO [executor.py:404] Start monitoring events ...
(12-27) 14:39:35 INFO [aggregator.py:318] Received executor 2 information, 2/3
(12-27) 14:39:35 INFO [aggregator.py:274] Loading 2800 client traces ...
(12-27) 14:39:35 INFO [aggregator.py:304] Info of all feasible clients {'total_feasible_clients': 5598, 'total_num_samples': 1275716}
(12-27) 14:39:35 INFO [fllibs.py:97] Initializing the model ...
(12-27) 14:39:35 INFO [executor.py:77] (EXECUTOR:3) is setting up environ ...
(12-27) 14:39:36 INFO [executor.py:123] Data partitioner starts ...
(12-27) 14:39:36 INFO [divide_data.py:62] Partitioning data by profile /home/whr/code/FedScale/benchmark/dataset/data/femnist/client_data_mapping/train.csv...
(12-27) 14:39:36 INFO [divide_data.py:74] Trace names are client_id, sample_path, label_name, label_id
(12-27) 14:39:36 INFO [divide_data.py:105] Randomly partitioning data, 81674 samples...
(12-27) 14:39:36 INFO [executor.py:141] Data partitioner completes ...
(12-27) 14:39:36 INFO [channel_context.py:21] %%%%%%%%%% Opening grpc connection to 192.168.124.102 %%%%%%%%%%
(12-27) 14:39:36 INFO [executor.py:404] Start monitoring events ...
(12-27) 14:39:36 INFO [aggregator.py:318] Received executor 3 information, 3/3
(12-27) 14:39:36 INFO [aggregator.py:274] Loading 2800 client traces ...
(12-27) 14:39:36 INFO [aggregator.py:304] Info of all feasible clients {'total_feasible_clients': 8397, 'total_num_samples': 1913574}
(12-27) 14:39:36 INFO [aggregator.py:583] Wall clock: 0 s, round: 1, Planned participants: 0, Succeed participants: 0, Training loss: 0.0
(12-27) 14:39:36 INFO [client_manager.py:195] Wall clock time: 0, 0 clients online, 8397 clients offline
(12-27) 14:39:36 INFO [aggregator.py:605] Selected participants to run: []
Apparently, it selects no participants to run and the program is stuck here.
Versions / Dependencies
FedScale: 7ec441c
Python: 3.7.16
OS: Ubuntu20.04
Reproduction script
I put the aforementioned yml under $WORKDIR. So, the starting command is python $WORKDIR/docker/driver.py submit $WORKDIR/femnist_cluster.yml
.
Issue Severity
None
Activity