Open
Description
Discussed in #217
Originally posted by li1553770945 March 13, 2023
I have successfully installed the fedscale framework and downloaded the femnist dataset. I am trying to follow the information on this page Deploment to complete my first sample. I entered the fedscale driver start conf.yml
command and the contents of my conf.yml are as follows.
# Configuration file of FAR training experiment
# ========== Cluster configuration ==========
# ip address of the parameter server (need 1 GPU process)
ps_ip: localhost
# ip address of each worker:# of available gpus process on each gpu in this node
# Note that if we collocate ps and worker on same GPU, then we need to decrease this number of available processes on that GPU by 1
# E.g., master node has 4 available processes, then 1 for the ps, and worker should be set to: worker:3
worker_ips:
- localhost:[4]
exp_path: $FEDSCALE_HOME/fedscale/cloud
# Entry function of executor and aggregator under $exp_path
executor_entry: execution/executor.py
aggregator_entry: aggregation/aggregator.py
auth:
ssh_user: ""
ssh_private_key: ~/.ssh/id_rsa
# cmd to run before we can indeed run FAR (in order)
setup_commands:
- source $HOME/anaconda3/bin/activate fedscale
# ========== Additional job configuration ==========
# Default parameters are specified in config_parser.py, wherein more description of the parameter can be found
job_conf:
- job_name: femnist # Generate logs under this folder: log_path/job_name/time_stamp
- log_path: $FEDSCALE_HOME/benchmark # Path of log files
- num_participants: 50 # Number of participants per round, we use K=100 in our paper, large K will be much slower
- data_set: femnist # Dataset: openImg, google_speech, stackoverflow
- data_dir: $FEDSCALE_HOME/benchmark/dataset/data/femnist # Path of the dataset
- data_map_file: $FEDSCALE_HOME/benchmark/dataset/data/femnist/client_data_mapping/train.csv # Allocation of data to each client, turn to iid setting if not provided
- device_conf_file: $FEDSCALE_HOME/benchmark/dataset/data/device_info/client_device_capacity # Path of the client trace
- device_avail_file: $FEDSCALE_HOME/benchmark/dataset/data/device_info/client_behave_trace
- model: resnet18 # NOTE: Please refer to our model zoo README and use models for these small image (e.g., 32x32x3) inputs
# - model_zoo: fedscale-torch-zoo
- eval_interval: 10 # How many rounds to run a testing on the testing set
- rounds: 1000 # Number of rounds to run this training. We use 1000 in our paper, while it may converge w/ ~400 rounds
- filter_less: 21 # Remove clients w/ less than 21 samples
- num_loaders: 2
- local_steps: 5
- learning_rate: 0.05
- batch_size: 20
- test_bsz: 20
- use_cuda: False
- save_checkpoint: False
After that the message displayed is
starting aggregator on localhost...
Aggregator local PID 2767758. run kill -9 2767758 to kill the job.
Starting workers on localhost ...
Submitted job, please check your logs $FEDSCALE_HOME/benchmark/logs/femnist/0314_015101 for status
I checked the log file directory and didn't find the log file, and I checked the PID of his output and there is no process.
Metadata
Assignees
Labels
No labels
Activity