Skip to content Skip to sidebar Skip to footer

Distributed Tensorflow [async, Between-graph Replication]: Which Are The Exactly Interaction Between Workers And Servers Regarding Variables Update

I've read Distributed TensorFlow Doc and this question on StackOverflow but I still have some doubt about the dynamics behind the distributed training that can be done with TensorF

Solution 1:

Let's go in reverse order and start from your last question: what is the difference between compute a gradient and apply a gradient?

Computing the gradients means running the backward pass on the network, after having computed the loss. For gradient descent, this means estimating the gradients value in the formula beneath (note: this is a huge simplification of what computing gradients actually entails, look up more about backpropagation and gradient descent fora proper explanation of how this works). Applying the gradients means updating the parameters according to the gradients you just computed. For gradient descent, this (roughly) means executing the following:

weights = weights - (learning_step * gradients)

Note that, depending on the value of learning_step, the new value of weights depends on both the previous value and the computed weights.

With this in mind, it's easier to understand the PS/worker architecture. Let's make the simplifying assumption that there is only one PS (we'll see later how to extend to multi-PS)

A PS (parameter server) keeps in memory the weights (i.e. the parameters) and receives gradients, running the update step I wrote in the code above. It does this every time it receives gradients from a worker.

A worker, on the other hand, looks up what's the current value of weights in the PS, makes a copy of it locally, runs a forward and a backward pass of the network on a batch of data and gets new gradients, which then sends back to the PS.

Note the emphasis on "current": there is no locking or inter-process synchronization between workers and the PS. If a worker reads weights in the middle of an update (and, for example, half already have the new value and half are still being updated), that's the weights he'll use for the next iteration. This keeps things fast.

What if there's more PSs? No problem! The parameters of the network are partitioned among the PSs, the worker simply contacts all of them to get the new values for each chunk of the parameters and sends back only the gradients relevant to each specific PS.

Post a Comment for "Distributed Tensorflow [async, Between-graph Replication]: Which Are The Exactly Interaction Between Workers And Servers Regarding Variables Update"