Skip to main content

What parallelization methods can make neural nets train faster? [Resolved]

A fairly straightforward way (on a theoretical level, at least) of parallelizing artifical neural networks (ANNs) would be to divy up the batches of training examples during every epoch so that several workers can calculate their respective contribution to the error gradient in parallel.

This certainly helps learning when the mini-batch size is large enough, but I was wondering if there were other venues of parallelization to exploit, especially ones that would make the epoch time faster (that is, since batch-wise parallelization will take just as long per-epoch compared to a serial algorithm with stochastic gradient descent on smaller batches).

I would figure that for certain cases (such as CNNs), vectorized architectures would enable faster propagation, but I'm referring to parallelization on a higher level - for instance, if the ANN is sparsely connected then perhaps workers can run forward and backpropogation at the same time, respectively, where each worker is responsible for some densely connected component of the ANN, and use some message passing for edges along cuts.

Do any of these ideas generalize? Have ANNs been successfully parallelized, perhaps using other approaches, on this high of a level?

Apparently, something in the manner I described is implemented in SPANN. From an overview of the paper it seems like the message-passing approach is indeed appropriate for a massive ANN with many "separable" components. I would still appreciate insight in other approaches for paralleliztion, especially for smaller nets (perhaps even those that would fit onto a single machine with a lot of cores).


Question Credit: VF1
Question Reference
Asked October 6, 2019
Posted Under: Programming
18 views
1 Answers

Well, someone upvoted this question and I guess that made it pop up for me. It's nice to know I learned something in 4 years :)

Turns out the answer to this question is, in many wonderful and general ways, affirmative.

  • Decoupled Neural Interfaces make NNs layer-wise asynchronous, allowing multiple concurrent forward and backward passes through a network, in theory. Note that this means that sometimes you're getting values from stale layers, and you're training off of stale gradients. Eyeballing their paper shows a 4x speedup on some architectures.
  • The above can maybe even be viewed as a finer-grained verision of Hogwild!, a generic approach for parallel SGD where you just run a bunch of mini-batches concurrently and update the parameters atomically without locks (again, possibly on stale gradients). I'm calling DNI "more generic" since it lets you have even the forward pass be stale (for better parallelism).
  • Some recent work on polyhedral compilers automatically detects opportunities for loop skew in RNNs. Usually, with a stacked RNN, you run through L layers across T timesteps, LT serial operations total. Loop skew in this case is pretty simple, where you can actually reduce this to about T serial operations by evaluating diagonal bands in parallel.

E.g., for a 6 timestep 3-layer RNN, evaluation is typically:

L0 T0 -> L1 T0 -> L2 T0 -> L0 T1 -> L1 T1 -> L2 T1 -> ... -> L2 T5

Instead, we can recognize that Li Tj only requires inputs from layers <= i and timesteps <= j in stacked RNNs. So Li Tj can run in parallel with L(i+1) T(j-1), giving rise to parallel "bands" of evaluation

L0 T0 -> (L1 T0 || L0 T1) -> (L2 T0 || L1 T1 || L0 T2) -> (L2 T1 || L1 T2 || L0 T3) -> ... 

credit: VF1
Answered October 6, 2019
Your Answer
D:\Adnan\Candoerz\CandoProject\vQA