Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Save shagunsodhani/f05748b6339ceff26420ceecfc79d58d to your computer and use it in GitHub Desktop.
Save shagunsodhani/f05748b6339ceff26420ceecfc79d58d to your computer and use it in GitHub Desktop.
Notes for paper titled "How NOT To Evaluate Your Dialogue System: An Empirical Study of Unsupervised Evaluation Metrics for Dialogue Response Generation"

How NOT To Evaluate Your Dialogue System: An Empirical Study of Unsupervised Evaluation Metrics for Dialogue Response Generation

Introduction

  • The paper explores the strengths and weaknesses of different evaluation metrics for end-to-end dialogue systems(in unsupervised setting).
  • Link to the paper

Evaluation Metrics Considered

Word Based Similarity Metric

BLEU

  • Analyses the co-occurrences of n-grams in the ground truth and the proposed responses.
  • BLEU-N: N-gram precision for the entire dataset.
  • Brevity penalty added to avoid bias towards short sentences.

METEOR

  • Create explicit alignment between candidate and target response (using Wordnet, stemmed token etc).
  • Compute the harmonic mean of precision and recall between proposed and ground truth.

ROGUE

  • F-measure based on Longest Common Subsequence (LCS) between candidate and target response.

Embedding Based Metric

Greedy Matching

  • Each token in actual response is greedily matched with each token in predicted response based on cosine similarity of word embedding (and vice-versa).
  • Total score is averaged over all words.

Embedding Average

  • Calculate sentence level embedding by averaging word level embeddings
  • Compare sentence level embeddings between candidate and target sentences.

Vector Extrema

  • For each dimension in the word vector, take the most extreme value amongst all word vectors in the sentence, and use that value in the sentence-level embedding.
  • Idea is that by taking the maxima along each dimension, we can ignore the common words (which will be pulled towards the origin in the vector space).

Dialogue Models Considered

Retrieval Models

TF-IDF

  • Compute the TF-IDF vectors for each context and response in the corpus.
  • C-TFIDF computes the cosine similarity between an input context and all other contexts in the corpus and returns the response with the highest score.
  • R-TFIDF computes the cosine similarity between the input context and each response directly.

Dual Encoder

  • Two RNNs which respectively compute the vector representation of the input context and response.
  • Then calculate the probability that given response is the ground truth response given the context.

Generative Models

LSTM language model

  • LSTM model trained to predict the next word in the (context, response) pair.
  • Given a context, model encodes it with the LSTM and generates a response using a greedy beam search procedure.

Hierarchical Recurrent Encoder-Decoder (HRED)

  • Uses a hierarchy of encoders.
  • Each utterance in the context passes through an ‘utterance-level’ encoder and the output of these encoders is passed through another 'context-level' decoder.
  • Better handling of long-term dependencies as compared to the conventional Encoder-Decoder.

Observations

  • Human survey to determine the correlation between human judgement on the quality of responses, and the score assigned by each metric.
  • Metrics (especially BLEU-4 and BLEU-3) correlate poorly with human evaluation.
  • Best performing metric:
    • Using word-overlaps - BLEU-2 score
    • Using word embeddings - vector average
  • Embedding-based metrics would benefit from a weighting of word saliency.
  • BLEU could still be a good evaluation metric in constrained tasks like mapping dialogue acts to natural language sentences.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment