ML Retrospectives | FiLM- Visual Reasoning with a General Conditioning Layer

Paper TL;DR

Many machine learning tasks involve multiple inputs: visual question-answering (image + language), instruction-following (video + language), class-conditional image generation (image + class label), style transfer (style image + context image), etc. How can we extend single-input neural models like CNNs or RNNs to multi-input tasks? Many multi-input models work well on a particular task, but ideally, we’d have an architecture that does well on many multi-input tasks. We formulated a few successful approaches as a single, generic neural network layer (FiLM). We showed that a network using the layer can learn to answer questions about images in CLEVR, a particularly challenging multi-input task.

Here’s how the layer works. If you were to process each input separately, you’d just use a neural network for each input (i.e., a CNN for visual input or an RNN for language input). To process a pair of inputs X_i together, just have one network predict a gain γ_i,c and bias β_i,c for each feature F_i,c in a layer of the another network:

New F_i,c = (γ_i,c * F_i,c) + β_i,c

Each feature F_i,c can be a scalar activation in an MLP or e.g. a 2D feature map in a CNN. Since the affine transformation is independent for each feature, we call the operation “Feature-wise Linear Modulation” (FiLM). Here’s a visualization of FiLM conditioning the activations in a CNN:

You can add multiple FiLM layers throughout the architecture. As a full model example, here’s the architecture we used to answer questions about images:

Overall Outlook

Since our paper, FiLM has indeed proven quite general. Here are a few of the diverse tasks where FiLM has seen success:

BigGAN generates (impressive) images by conditioning a GAN’s generator on a class label using FiLM.
Meta-Learning with FiLM: For example, during the inner training loop of meta-learning, you can do SGD on only FiLM’s gains and biases instead of on all parameters (a form of regularization). Here, FiLM’s gains and biases act like a low-dimensional, task-dependent representation that influences the whole network’s behavior.
Answering Questions about Audio: “Are there an equal number of loud cello sounds and quiet clarinet sounds?” Here, FiLM conditions the audio processing (CNN-based) on a question representation (RNN-based).
Language-Guided Image Segmentation: Correcting a predicted image segmentation based on language feedback from a person (“There should be water in the middle”). Here, FiLM conditions an image segmentation network on a neural representation of the language feedback.

You can find more examples in our review article in Distill.

When writing the FiLM paper, the general nature of FiLM seemed strictly beneficial to me. Since then, later experiments, papers, and reflection have made the picture more nuanced in my eyes. In particular, FiLM seems to work on such diverse tasks because it has great flexibility and capacity, which has its downsides (listed below).

1. Lack of Inductive Bias

For any given task, follow-on work often finds it possible to outperform FiLM by incorporating some task-relevant inductive bias. In the earlier example of Language-Guided Image Segmentation, the authors achieve better segmentations by adding spatial conditioning to FiLM (which is spatially-invariant). In instruction following, Chaplot et al. achieve good results by using FiLM to condition a image/CNN-based policy on language instructions, but the authors propose a Dual Attention unit that incorporates inductive biases to perform even better. On CLEVR, the MAC network incorporated inductive biases related to multi-hop reasoning, and MAC was able to outperform FiLM on CLEVR after our paper.

2. Limited Data Efficiency

Some methods can learn from fewer examples by incorporating more priors or inductive biases relevant to the task. For example, on CLEVR, the MAC network outperforms FiLM especially in low-data regimes. Here’s the plot I get if I plot MAC’s data efficiency against FiLM’s numbers:

I got FiLM’s number from comments I had in our paper’s LaTeX source while writing the paper. At the time, the results didn’t seem surprising, until the later MAC network showed better data efficiency curves. In hindsight, it seems worth noting that CLEVR is quite a large dataset (700,000 examples); the size of CLEVR seems to be at least part of the reason that FiLM works well.

3. Requires Careful Regularization

FiLM only worked after I heavily regularized the model. We started with FiLM as a baseline, thinking (hoping?) its performance would plateau so that we could move on to “more interesting” models. However, my experiments showed that any FiLM model could easily overfit the training set, despite achieving mediocre validation performance. Over a few weeks, I added more and more regularization, eventually achieving state-of-the-art. For example, two key additions were: 1) L2 weight decay and 2) decoding FiLM’s gains and biases with a linear layer instead of an RNN. In particular, regularization seemed crucial for the network that predicts FiLM’s gains and biases.

We highlighted the importance of regularization in our paper, but in hindsight, the observation about regularization could have been a key contribution of the paper. After the FiLM paper, Harm de Vries (one of the co-authors) found that removing L2 weight decay hurt CLEVR accuracy significantly (roughly 10%). We conducted many ablations in the paper, but we didn’t revisit high-capacity architectures. I wish that we had, as it would have highlighted a key pitfall in using FiLM and helped others get FiLM to work in different settings.

As one other quick example, let’s look at how BigGAN generates class-conditional images with FiLM. BigGAN transforms class embeddings labels into FiLM’s gains and biases using a simple linear projection. In the authors words, “We experimented with using MLPs instead of linear projections from [the generator’s] class embeddings to its [FiLM] gains and biases, but did not find any benefit to doing so.” Here, the authors explored different architectures and found that a simple linear layer worked well. In other papers, FiLM is used a quick or drop-in baseline to compare a method against. In these cases, I sometimes wonder if FiLM would have performed better if the authors had swept over weight decay or tried a simpler network to predict FiLM’s gains and biases.

Fortunately, adding regularization is simple to try, and there may even be more principled, straightforward ways to regularize FiLM. In the paper introducing TADAM, a few-shot learning method, the authors parametrize γ and β as deviations from 1 and 0, respectively (default, no-op values); the authors then regularize towards zero-valued deviations. Without enough regularization on γ and β deviations, the authors found considerably overfitting. With such regularization, the authors are able to effectively use a deep, high capacity network to predict γ and β. Such forms of regularization may make FiLM more robust across architectures and hyperparameters.

Why is FiLM high capacity?

FiLM is just a simple affine transformation. What gives FiLM its high capacity? It seems to be the deep network after a FiLM layer. Later neural network layers can transform a simple, linear change into a complex, non-linear one. For example, our ablations showed that using a single FiLM layer early in the CNN achieves roughly the same performance as using 4 FiLM layers throughout the network. Without any neural layers following a FiLM layer, FiLM would have very limited capacity—exactly the capacity of a simple affine transformation.

Conclusion

If you’re thinking of trying FiLM on a task, here are my overall recommendations:

FiLM is a great method to start with—just watch for overfitting. If the model overfits, tune any regularization hyperparameters and try a simpler architecture, especially for the network that predicts FiLM’s gains and biases.
If you need better performance or data efficiency, you may well outperform FiLM by adding task-relevant inductive biases.

Happy FiLMing!

Retrospectives @NeurIPS 2019

A Retrospective for "FiLM- Visual Reasoning with a General Conditioning Layer"