Paper TL;DR

Divide hidden layers in blocks, use RL to take binary actions to only activate some of the blocks, save computation time!

Poor choices we made

  • I think we were too focused on staying within the our-model-must-be-a-normal-deep-net mindset. This meant we were limited by how terrible sparse operators are for deep nets. Instead we could have looked at tree structures and such.
  • Looking a scatter plot of all the runs, our hyperparameter search + reporting of results was not very honest. We did a random search of hyperparameters, which led to some of them being very good, and picked those as our “best results”. Clearly though, looking at the variance of results, we should have rerun our best hyperparameters many times. I guess I hadn’t really internalized how massive the variance of RL methods can be.
  • Our approach shined when not using minibatches (i.e. 1 x at a time), but for some reason we didn’t focus on that and spent a lot of time focusing on minibatches and making them be fast.
  • I don’t know why I was against modern momentum methods. Who knows, Adam or RMSprop could have made our results much better.
  • We gave up the multi-pass RL dream (à la ACT of Graves, 1603.08983) too early. If I had to work on this again I would definitely try this approach.

Some limitations

  • Our approach required fairly specialized code to be of any use. At the time I spent 100+ hours writing my own CUDA kernels. It paid off for the paper, but if we’d thought of “how can anyone use this, using a normal deep learning library” I think we would have found better sparse computation structures
  • We wanted to make learning faster, but in the end our method is really expensive to train (there are advantages at prediction time but that’s not what we really wanted).
  • Variance. Variance makes learning so slow.

Now that I think about it…

  • It’s not clear that RL was the best choice. We wanted to use it because we are RL researchers, but we overlooked classical search and structured prediction methods which we should have tried first, even just as baselines.
  • Discrete stochasticity isn’t great. A little gaussian noise here and there can be nice, but massive discontinuities in learning greatly affected our method, as it did not pair well with SGD. If I work on something like this in the future, I think it may be worth trying local-search-style methods with less variance, less stochasticity (to deal with this, we forced our policies to be very low-entropy, which in retrospective may have hurt a lot.)