Table of Links
2 LongFact: Using LLMs to generate a multi-topic benchmark for long-form factuality
3 Safe:LLM agents as factuality autoraters
4 LLMs agents can be better factuality annotators than humans
5 F1@k: Extending F1 with recall from human-preferred length
6 Larger LLMs are more factual
9 Conclusion, Acknowledgments, Author Contribution, and References
Appendix
3 SAFE: LLM AGENTS AS FACTUALITY AUTORATERS
A key obstacle in benchmarking long-form factuality is that a benchmark requires a reliable method of evaluating model responses. Existing automated evaluation methods such as BLEURT (Sellam et al., 2020), ROUGE (Lin, 2004; Ganesan, 2018), and prompting language models (Min et al., 2023; Vu et al., 2023) operate by comparing a model response to a preset reference answer or knowledge source. This works well for prompts that only require a short-answer response, as there is a definite and singular correct answer. However, this setting is suboptimal for prompts that require a long-form response, as it is difficult to compile predetermined answers/knowledge that comprehensively and nonredundantly cover all facts that a model response may contain.
Instead, we propose that there are two key insights that are crucial to accurately evaluate factuality in a long-form setting. First, we agree with the idea that long-form responses should be evaluated at the granularity of individual facts, following Min et al. (2023). This is more fine-grained than the granularity of sentences, as a single sentence may contain multiple individual facts (see Figure 1 for an example), and it is essential for optimal performance since isolating each individual fact allows for precise and focused evaluation. Second, in lieu of predetermined reference answers or knowledge sources, Google Search can be leveraged to obtain dynamic reference answers for each fact in a response with the best tradeoff among accuracy, accessibility, and timeliness (Augenstein et al., 2023). This ensures that evaluation is done with respect to reference answers that specifically cover the particular set of facts that a long-form response can contain.
For these reasons, we propose Search-Augmented Factuality Evaluator (SAFE),[4] which joins these insights by using a language model to (a) split a long-form response into individual self-contained facts, (b) determine whether each individual fact is relevant to answering the prompt in the context of the response, and (c) for each relevant fact, iteratively issue Google Search queries in a multi-step process and reason about whether the search results support or do not support the fact. We consider the key innovation in SAFE to be using a language model as an agent to generate multi-step Google Search queries and carefully reason about whether facts are supported by search results (Figure 3 and Appendix C.2 shows examples of these reasoning chains).
As shown in Figure 1, to split a long-form response into individual self-contained facts, we first prompt the language model to split each sentence in a long-form response into individual facts. We then revise each individual fact to be self-contained by instructing the model to replace vague references (e.g., pronouns) with the proper entities that they are referring to in the context of the response. To rate each self-contained individual fact, we use the language model to reason about whether the fact is relevant to answering the prompt in the context of the response.[5] Each remaining relevant fact is then rated as “supported” or “not supported” using a multi-step approach. In each step, the model generates a search query based on the fact to rate and the search results that have previously been obtained. After a set number of steps, the model performs reasoning to determine whether the fact is supported by the search results, as shown in Figure 3. Once all facts are rated, the outputted metrics from SAFE for a given prompt–response pair are the number of “supported” facts, the number of “irrelevant” facts, and the number of “not-supported” facts. Our detailed implementation of SAFE is shown in Appendix C.1, and SAFE is publicly available at https://github.com/ google-deepmind/long-form-factuality/tree/main/eval/safe.
This paper is
[4] SAFE is similar to and inspired by prior work such as Gao et al. (2023) and Wang et al. (2023).
[5] This identifies statements that should not be rated (e.g., “I don’t know the answer to that”).
Authors:
(1) Jerry Wei, Google DeepMind and a Lead contributors;
(2) Chengrun Yang, Google DeepMind and a Lead contributors;
(3) Xinying Song, Google DeepMind and a Lead contributors;
(4) Yifeng Lu, Google DeepMind and a Lead contributors;
(5) Nathan Hu, Google DeepMind and Stanford University;
(6) Jie Huang, Google DeepMind and University of Illinois at Urbana-Champaign;
(7) Dustin Tran, Google DeepMind;
(8) Daiyi Peng, Google DeepMind;
(9) Ruibo Liu, Google DeepMind;
(10) Da Huang, Google DeepMind;
(11) Cosmo Du, Google DeepMind;
(12) Quoc V. Le, Google DeepMind.