- Open-domain Question Answering (Open QA) - efficiently querying large-scale knowledge base(KB) using natural language.
- Two main approaches:
- Information Retrieval
- Transform question (in natural language) into a valid query(in terms of KB) to get a broad set of candidate answers.
- Perform fine-grained detection on candidate answers.
- Semantic Parsing
- Interpret the correct meaning of the question and convert it into an exact query.
- Information Retrieval
- Limitations:
- Human intervention to create lexicon, grammar, and schema.
- This work builds upon the previous work where an embedding model learns low dimensional vector representation of words and symbols.
- Link to the paper.
- Input - Training set of questions (paired with answers).
- KB providing a structure among the answers.
- Answers are entities in KB and questions are strings with one identified KB entity.
- The paper has used FREEBASE as the KB.
- Datasets
- WebQuestions - Built using FREEBASE, Google Suggest API, and Mechanical Turk.
- FREEBASE triplets transformed into questions.
- Clue Web Extractions dataset with entities linked with FREEBASE triplets.
- Dataset of paraphrased questions using WIKIANSWERS.
- Model learns low-dimensional vector embeddings of words in question entities and relation types of FREEBASE such that questions and their answers are represented close to each other in the joint embedding space.
- Scoring function S(q, a), where q is a question and a is an answer, generates high score if a answers q.
- S(q, a) = f(q)T.g(a)
- f(q) maps question to embedding space.
- f(q) = Wφ(q)
- W is a matrix of dimension K * N
- K - dimension of embedding space (hyper parameter).
- N - total number of words/entities/relation types.
- ψ(q) - Sparse Vector encoding the number of times a word appears in q.
- Similarly, g(a) = Wψ(a) maps answer to embedding space.
- &psi(a) gives answer representation, as discussed below.
- Answer represented as a single entity from FREEBASE and TBD is a one-of-N encoded vector.
- Answer represented as a path from question to answer. The paper considers only one or two hop paths resulting in 3-of-N or 4-of-N encoded vectors(middle entities are not recorded).
- Encode the above two representations using subgraph representation which represents both the path and the entire subgraph of entities connected to answer entity as a subgraph. Two embedding representations are used to differentiate between entities in path and entities in the subgraph.
- SubGraph approach is based on the hypothesis that including more information about the answers would improve results.
- Minimize margin based ranking loss to learn matrix W.
- Stochastic Gradient Descent, multi-threaded with Hogwild.
- To account for a large number of synthetically generated questions, the paper also multi-tasks the training of model with paraphrased prediction.
- Scoring function Sprp(q1, q2) = f(q1)Tf(q2), where f uses the same weight matrix W as before.
- High score is assigned if q1 and q2 belong to same paraphrase cluster.
- Additionally, the model multitasks the task of mapping embeddings of FREEBASE entities (mids) to actual words.
- For each question, a candidate set is generated.
- The answer (from candidate set) with the highest set is reported as the correct answer.
- Candidate set generation strategy
- C1 - All KB triplets containing the KB entity from the question forms a candidate set. Answers would be limited to 1-hop paths.
- C2 - Rank all relation types and keep top 10 types and add only those 2-hop candidates where the selected relations appear in the path.
- C2 strategy outperforms C1 approach supporting the hypothesis that a richer representation for answers can store more information.
- Proposed approach outperforms the baseline methods but is outperformed by an ensemble of proposed approach with semantic parsing via paraphrasing model.