Conversation
| test('cleartext', clients.decentralized.ClearText) | ||
| test('secure', clients.decentralized.SecAgg) | ||
| test('cleartext', clients.decentralized.Base) | ||
| test('secure', clients.decentralized.Base) |
There was a problem hiding this comment.
do both tests still work if called on the same?
There was a problem hiding this comment.
I should actually delete the 'secure' one for the time being. As @swaglid explained to us, the current sec_agg is not working properly. My goal is to get this PR merged quick enough for him to begin re-implementing secure aggregation.
There was a problem hiding this comment.
the secure decentralized one should be fine, it has unit tests and end2end tests. would be better to keep it or migrate it during the refactor (maybe walid can help?)
did you check with him how it would have to be modified to become compatible with the new aggregator concept here (does the one here include decentralized and federated actually?)
| export { titanic } from './titanic' | ||
| export { simpleFace } from './simple_face' | ||
| export { geotags } from './geotags' | ||
| export { mnistBandit } from './mnist_bandit' |
There was a problem hiding this comment.
would you mind having the bandit later in a separate PR, after the refactor (the refactor is bigger)
There was a problem hiding this comment.
That's a good idea. It'll make the testing a little bit easier and reduce the scope of the PR a little
discojs/discojs-core/src/training/trainer/distributed_trainer.ts
Outdated
Show resolved
Hide resolved
|
thanks very much for the changes! let's try and fix the CI this week so that we can tackle the merge conflicts the next one :) there still remains a few changes to make server-side, right? such as making use of the aggregator classes |
LucasTrg
left a comment
There was a problem hiding this comment.
Overall LGTM, looking forwrd to test the edge case we might run into wth the bandit !
| this._averageNumberOfParticipants = this.totalNumberOfParticipants / this.round | ||
| this._totalNumberOfParticipants += this.currentNumberOfParticipants | ||
| } else { | ||
| this._round = this.aggregator.round |
There was a problem hiding this comment.
Not quite sure of this edge case, or could we get a number of round bigger than the number of aggregation round ?
Maybe some comments in the code could help
There was a problem hiding this comment.
this is here because the update method is called once per communication round, meaning possibly multiple times during a single training round, but the stats are over training rounds
but yes the code is unclear 😁
There was a problem hiding this comment.
This is making me think that there is a small distinction that we did not capture for participants. Namely, the participation graph can be directed ("I give you my model, but you did not give me yours"), and I don't think this current implementation would fit.
It's not a hard requirement yet, more of a food or thoughts comment
There was a problem hiding this comment.
you are right, the current code allows one to define "what" kind of contribution the client should send to its neighbors, but not "who" it should send its contributions to
similarly, the client will expect a contribution (possibly many) from every active neighbor, with an eventual non-failing timeout in the case where no contribution was received, which is is not ideal because it adds peer idleness
both issues can be fully handled by the aggregator by keeping lists of desirable nodes to receive from/send to (subsets of active nodes), in order to differ from the set of nodes usually kept in sync between the client and aggregator
do you need these changes for bandit? would you like to make these changes yourself? if not, I'll make the changes in a separate PR
There was a problem hiding this comment.
I think for the moment I can keep up with this discrepancy by sending empty payloads to the nodes that were not selected for the current round.
We can clean it up if the bandit happens to be a promising project
martinjaggi
left a comment
There was a problem hiding this comment.
amazing work, and excellent tests added also. thanks a lot. only minor comments above
|
|
||
| protected informant?: AsyncInformant<T> | ||
| /** | ||
| * The result promise which, on resolve, will contain the current aggregation result. |
There was a problem hiding this comment.
want to say if it's a model (params) or model difference, or could be either?
There was a problem hiding this comment.
it can be anything really, since the class is generic, but in the case of the subclasses, it can either be the model weights or model weights difference
currently, I'm pretty sure the weights passed to the client are the entire model, which will go back to model difference in the future polishing PR(s): write docs, support a directed communication graph, re-include DP (includes model diff & clipping), re-write byzantine-robustness, etc.
| */ | ||
| log (step: AggregationStep, from?: client.NodeID): void { | ||
| switch (step) { | ||
| case AggregationStep.ADD: |
There was a problem hiding this comment.
add might be maybe misinterpreted as addition potentially, but here you mean more like save or register?
There was a problem hiding this comment.
it corresponds to the aggregator's add method and will have a dedicated docstring in #580 so it'll make more sense then :)
| case AggregationStep.ADD: | ||
| console.log(`> Adding contribution from node ${from ?? '"unknown"'} for round (${this.communicationRound}, ${this.round})`) | ||
| return | ||
| case AggregationStep.UPDATE: |
There was a problem hiding this comment.
it corresponds to a node overriding its previous contribution with a new one, and will be clearer in #580 with its dedicated docstring
In the following PR, we separate the aggregation logic to the communication one by splitting
ClientsintoClient,AggregatorandPreparator.This will allow to develop more advanced aggregation techniques, different encryption scheme, etc... as well as different network topologies in the future.