Prior work in visual dialog has focused on training deep neural models on the
VisDial dataset in isolation, which has led to great progress, but is limiting
and wasteful. In this work, following recent trends in representation learning
for language, we introduce an approach to leverage pretraining on related
large-scale vision-language datasets before transferring to visual dialog.
Specifically, we adapt the recently proposed ViLBERT (Lu et al., 2019) model
for multi-turn visually-grounded conversation sequences. Our model is
pretrained on the Conceptual Captions and Visual Question Answering datasets,
and finetuned on VisDial with a VisDial-specific input representation and the
masked language modeling and next sentence prediction objectives (as in BERT).
Our best single model achieves state-of-the-art on Visual Dialog, outperforming
prior published work (including model ensembles) by more than 1% absolute on
NDCG and MRR.
Next, we carefully analyse our model and find that additional finetuning
using ‘dense’ annotations i.e. relevance scores for all 100 answer options
corresponding to each question on a subset of the training set, leads to even
higher NDCG — more than 10% over our base model — but hurts MRR — more than
17% below our base model! This highlights a stark trade-off between the two
primary metrics for this task — NDCG and MRR. We find that this is because
dense annotations in the dataset do not correlate well with the original
ground-truth answers to questions, often rewarding the model for generic
responses (e.g. “can’t tell”).