Paper TL;DR

We collect a dataset of human ratings (from 1-5) of dialogue response quality on Twitter. We train a dialogue evaluation model to predict these scores, and find that the model correlates well with human scores on our test sets.

Overall outlook

Quite a few people have referenced this work, and it won a best paper award at ACL 2017. However, to my knowledge nobody has actually used it to evaluate their dialogue system. While I think the direction was promising, the size and biases in our dataset led to a model that doesn’t generalize very well. Main drawbacks are listed below.

Paper flaws

I take full responsibility for all of the flaws and mistakes involved in the experimentation and writing of this paper. I list these below.

The model doesn’t generalize very well

In the paper, we show that our model (ADEM) somewhat generalizes to evaluating dialogue models that were unseen during training (Table 4) on our test set. However, I now believe the generalization performance of our ADEM model is quite poor. About a year ago, we received an email from a student who attempted a ‘sanity check’ on ADEM using some responses that were hand-crafted. Their summary is below:

We did some sanity check of ADEM. We evaluated 84 marked responses with ADEM and separate them into 7 groups. We computed mean score for each group. We got next results:

  • ground truth 3.1
  • relevant, short 3.5
  • relevant, long 3.1
  • irrelevant, short 3.5
  • irrelevant, long 2.7
  • general 2.9
  • trash 2.9

It seems that ADEM tends to give higher scores for shorter responses and lower scores for long ones. Also, there is no difference in mean score between relevant short and irrelevant short responses.

An example from their hand-crafted dataset is shown below:

It is clear that this is a small dataset, that isn’t representative of average Twitter data. The results in our paper only show that, for a large enough number of samples, there is some correlation between the model scores and human scores. However, it is disconcerting to see the model fail this sanity check. I suspect the reason for the ‘trash’ responses getting reasonable scores is because these get replaced by tokens (due to vocabulary truncation).

The reason for the lack of generalization, I believe, is two-fold: (1) the dataset really isn’t that big, at only 1000 contexts and 4000 labeled responses; (2) there is a bias in the human responses towards being shorter. This is because we got the human responses from Mechanical Turkers, who satisfied the minimum criteria by generating reasonable responses that were as short as possible. (Note, this is not their fault: we did not pay the Turkers very well, and did not incentivize longer responses with a bonus. Based on the Turker IDs, I believe a majority of the response generation was done by a small set of Turkers who were trying to do it as fast as possible — although we did have attention checks. Given the chance, I would have run this procedure differently by paying the Turkers more and explicitly incentivizing the behaviour we wanted.)

The result of this bias is that, in our early experiments, ADEM was performing very well by giving high scores to shorter responses. We tried to fix this by oversampling longer responses, but this likely did not fully resolve the issue. We do talk about this in the paper.

One thing that is not mentioned in the paper is that we performed a preliminary experiment testing our ADEM model on the Twitter and Ubuntu evaluation data from our ‘How not to evaluate your dialogue system paper’, and found that, on this data, ADEM didn’t correlate with human judgements at all. While this was expected for Ubuntu, it was unexpected for Twitter — we chalked this up to our data collection procedure being different, along with the fact that we used Turkers to evaluate responses in our ADEM paper and CS students to do the evaluations in the “How not to evaluate” paper. But this result should have been a red flag for us to probe more deeply into the model, and for full transparency should have been mentioned in the paper.

The model is sensitive to adversarial attacks

There’s a great recent paper that dissects ADEM in more detail, including submitting it to adversarial attacks, and more: https://arxiv.org/pdf/1902.08832.pdf. The conclusion is inevitably that ADEM is not very robust to relatively small changes in the data distribution.

Cherry-picking of qualitative results

This is something that I didn’t remember until I started writing this retrospective: the qualitative results in Table 5 were cherry-picked, to show 2 examples where the ADEM scores made sense, and 1 example where ADEM scores did poorly. We did not state that these results were cherry-picked — although it was ‘implied’, the reader may be left with the impression that this was a representative sampling from our data. I no longer advocate for cherry-picking qualitative results at all, especially if they are made to seem ‘representative of the data’, and if an extended (non-cherry-picked) dump of model examples aren’t also provided. I remember trying to select for examples where ADEM wasn’t predicting values in the 2-3 range, which was most common (although we do state this in the paper).

Data was not publicly available

We’ve received quite a few requests to make our dataset available to the public. Unfortunately, for reasons completely within our control, this was not allowed by the McGill Ethics Board. Essentially, in order to release the data, the board required that we tell MTurkers that their data would be made publicly available, and we forgot to do this.

New perspectives

The flaws with learning an automatic dialogue evaluation model

On the whole, it’s unclear how well we should expect a dialogue scoring model to perform. I do believe that, if we had lots of data from various dialogue models, along with a distribution of reference human scores (instead of just a single one), it should be pretty feasible to compare a new models’ response to this distribution over responses and to give it a score. The problem is, getting multiple references to a single response is pretty hard, unless you use Reddit (which is generally against having its user data used for research).

There are some other potential drawbacks worthy of consideration: is the Likert 1-5 scale really the right way to go, given that humans have so much variance with it? Are neural-network based evaluation models biased towards giving neural-network based dialogue models high scores? These are definitely worthy of evaluation in future work. Overall though, I’d say the highest priority in this line of research is collecting a better-quality dataset.