Evaluating Prerequisite Qualities for Learning End-to-end Dialog Systems

Introduction

The paper presents a suite of benchmark tasks to evaluate end-to-end dialogue systems such that performing well on the tasks is a necessary (but not sufficient) condition for a fully functional dialogue agent.
Link to the paper

Created using large-scale real-world sources - OMDB (Open Movie Database), MovieLens and Reddit.
Consists of ~75K movie entities and ~3.5M training examples.

Answering Factoid Questions without relation to the previous dialogue.
KB(Knowledge Base) created using OMDB and stored as triplets of the form (Entity, Relation, Entity).
Question (in Natural Language Form) generated by creating templates using SimpleQuestions
Instead of giving out just 1 response, the system ranks all the answers in order of their relevance.

Providing personalised responses to the user via recommendation instead of providing universal facts as in case 1.
MovieLens dataset with a user x item matrix of ratings.
Statements (for any user) are generated by sampling highly ranked movies by the user and forming a statement about these movies using natural language templates.
Like the previous case, a list of ranked responses is generated.

Maintaining short dialogues involving both factoid and personalised content.
Dataset consists of short conversations of 3 exchanges (3 from each participant).

Identify most likely response is discussions on Reddit.
Data processed to flatten the potential conversation so that it appears to be a two participant conversation.

Combines all the previous tasks into one single task to test all the skills at once.

Memory Networks - Comprises of a memory component that includes both long term memory and short term context.
Supervised Embedding Models - Sum the word embeddings of the input and the target independently and compare them with a similarity metric.
Recurrent Language Models - RNN, LSTM, SeqToSeq
Question Answering Systems - Systems that answer natural language questions by converting them into search queries over a KB.
SVD(Singular Value Decomposition) - Standard benchmark for recommendation.
Information Retrieval Models - Given a message, find the most similar message in the training dataset and report its output or find a most similar response to input directly.

Supervised word embeddings perform very poorly even when using a large number of dimensions (2000 dimensions).
Memory Networks perform better than embedding models as they can utilise the local context and the long-term memory. But they do not perform as well on standalone QA tasks.