Deep Dive with WiTQA: When Does Retrieval Augmentation Help (or Hurt) Language Models?
To build a question-answering system, Retrieval Augmented Language Models (RALMs) have been used as the de facto standard for generating responses based on externally retrieved knowledge relevant to the query. However, when incorrect external knowledge is retrieved, the RALM’s responses can be misguided.
On the other hand, with the increase in model scale and the amount of pre-training data, the capabilities of language models themselves have improved significantly since they memorize a vast amount of knowledge in their parameters.
This raises an important question for building a reliable RALM-based QA system: When is retrieval helpful, and when does it hinder the language model’s performance?
To address this question, we built a new question-answering dataset called WiTQA and comprehensively evaluated language models of varying sizes in conjunction with retrieval models. Through this extensive evaluation, we gained valuable insights for building RALM-based QA systems in real-world use cases in which we need to decide whether to use language models with or without retrieval augmentation for better QA accuracy.
Let’s first explore the dataset creation process in detail!
WiTQA Dataset: A New Frontier for Analyzing RALMs with Fact-level Popularity
To analyze the interplay between LMs and retrieval systems effectively, we introduced the WiTQA (Wikipedia Triple Question Answers) dataset. We give an example below:
- Triple: (Subject: ”Nausicaä of the Valley of the Wind”, Relation: published in, Object: Animage)
- Question: “What Japanese anime and entertainment magazine was “Nausicaä of the Valley of the Wind” published in?”
- Answer: “Animage”
- Supporting passage in Wikipedia: “… Hayao Miyazaki’s internationally renowned manga, “Nausicaä of the Valley of the Wind”, was serialized in “Animage” from 1982 through 1994…”
The WiTQA dataset is unique in the following aspects:
- For each question, the WiTQA provides two popularity scores:
- The frequency count of the subject-entity (question entity) in Wikipedia
- The frequency count of the specific subject-relation pair (entity-relation pair) in Wikipedia
- Each QA pair is associated with a supporting passage from Wikipedia
The subject-relation popularity score allows for the analysis of the factual knowledge capabilities of language models through a fine-grained, fact-centric lens. In contrast, subject-entity popularity considers all facts associated with the same entity to be of equal popularity. The gold-supporting passages enable the isolating reasoning abilities from retrieval errors when evaluating models. These enable us to conduct deep analysis on evaluating LLMs’ capability from various aspects.
Creating WiTQA involved several steps, starting with the extraction of triples from Wikipedia. We then applied a meticulous sampling process to ensure a diverse representation of entities and relations based on their occurrence frequencies. Our goal was to capture the real-world challenge LMs face: recalling facts across a wide spectrum. With 14,837 QA pairs (13,251 unique subject entities, 32 relations, and 7,642 unique object entities), WiTQA offers a comprehensive playground for evaluating the performance of RALMs in various scenarios. We demonstrate that the distributions of the subject-relation popularity (S-R counts) in WiTQA are more diverse than those of existing QA datasets, EntityQuestions and PopQA.
Insights from the WiTQA Dataset
Our extensive experiments with WiTQA shed light on several critical aspects of RALMs. We observed that:
- Recall vs. Retrieval: LMs demonstrate a high ability to recall popular facts without needing retrieval augmentation. The larger the LM, the better its recall capabilities. Notably, for popular facts, larger LMs exhibit better QA accuracy than RALMs due to retrieval errors. To substantiate this assertion, we demonstrated a strong correlation between RALM performance and retrieval errors.
- When Retrieval Helps: For questions involving less common entities and relations, retrievers consistently outperform the recall abilities of LMs. This suggests that retrieval augmentation is particularly beneficial for answering questions about obscure or rarely mentioned facts. For rare entity-relation pairs about popular entities, the retrieval accuracy drops because accurately identifying relevant passages from a large pool of passages containing the entity becomes challenging. Even the most advanced models like GPT-4 struggle with less common entity-relation pairs, highlighting a crucial area where retrieval augmentation could play a significant role.
- Adaptive Retrieval Systems: Leveraging insights from our analysis, we proposed a selective memory integration that adaptively decides whether to engage retrieval based on the frequencies of entities and relations in the question. This approach enhances QA performance by up to 10.1%, demonstrating the potential of more nuanced, context-aware RALMs.
Conclusion
Our exploration into the efficacy of retrieval augmentation using the WiTQA dataset offers valuable insights into the strengths and limitations of current QA systems. By highlighting when retrieval helps and when it might hurt, we provide insights into developing more sophisticated and nuanced RALMs. As we continue to push the boundaries of NLP, datasets like WiTQA will play a crucial role in guiding our journey towards more intelligent and versatile language models.
Check out the Github repository for WiTQA and experiment with the future of question-answering today!
Are you intrigued by the possibilities of adaptive retrieval and want to dive deeper into our findings? Don’t miss out on our detailed research paper, and join us in advancing the state-of-the-art in question answering and language model augmentation.
Written by Seiji Maekawa, Hayate Iso, and Megagon Labs.
Follow us on LinkedIn and X to stay up to date with new research and projects.