Table of Links
2 LongFact: Using LLMs to generate a multi-topic benchmark for long-form factuality
3 Safe:LLM agents as factuality autoraters
4 LLMs agents can be better factuality annotators than humans
5 F1@k: Extending F1 with recall from human-preferred length
6 Larger LLMs are more factual
9 Conclusion, Acknowledgments, Author Contribution, and References
Appendix
2 LONGFACT: USING LLMS TO GENERATE A MULTI-TOPIC BENCHMARK FOR LONG-FORM FACTUALITY
While there are many existing prompt sets for factuality, there are no comprehensive prompt sets for examining factuality in long-form responses on the scale of several paragraphs in length (as shown in the right side of Figure 2). For example, many established benchmarks such as TruthfulQA (Lin et al., 2022), HaluEval (Li et al., 2023), FreshQA (Vu et al., 2023), HalluQA (Cheng et al., 2023b), and FELM (Chen et al., 2023) mostly consist of prompts that test knowledge of a single factoid (e.g., “How old is the world’s oldest verified living person?”) and thus only require a short response rather than several paragraphs to be fully answered. Other factuality datasets such as FActScore (Min et al., 2023) may require long-form responses but do not cover a broad range of topics.
For this reason, we present LongFact—a prompt set designed to probe a model’s factuality when generating long-form responses that may span multiple paragraphs. We generate LongFact by prompting GPT-4 to generate questions that ask about a specific concept or object within a given topic (e.g., “biology”) and that require a long-form response containing multiple detailed factoids. LongFact consists of two tasks, LongFact-Concepts and LongFact-Objects, separated based on whether the questions ask about concepts or objects. For both tasks, we use the same set of 38 manually-selected topics listed in Appendix B.2 (the left side of Figure 2 shows the breakdown of topics with respect to four supercategories). We generate 30 unique prompts per topic for a total of 1,140 prompts per task. Additional details about our data-generation process can be found in Appendix B.1, and canary text for LongFact is shown in Appendix A.11. LongFact is publicly available at https://github. com/google-deepmind/long-form-factuality/tree/main/longfact.
This paper is
Authors:
(1) Jerry Wei, Google DeepMind and a Lead contributors;
(2) Chengrun Yang, Google DeepMind and a Lead contributors;
(3) Xinying Song, Google DeepMind and a Lead contributors;
(4) Yifeng Lu, Google DeepMind and a Lead contributors;
(5) Nathan Hu, Google DeepMind and Stanford University;
(6) Jie Huang, Google DeepMind and University of Illinois at Urbana-Champaign;
(7) Dustin Tran, Google DeepMind;
(8) Daiyi Peng, Google DeepMind;
(9) Ruibo Liu, Google DeepMind;
(10) Da Huang, Google DeepMind;
(11) Cosmo Du, Google DeepMind;
(12) Quoc V. Le, Google DeepMind.