New Story

How LongFact Helps AI Models Improve Their Accuracy Across Multiple Topics

by Language Models (dot tech)April 8th, 2025
Read on Terminal Reader
Read this story w/o Javascript
tldt arrow

Too Long; Didn't Read

LongFact is a new dataset designed to evaluate AI’s factuality in long-form responses across 38 topics. It includes 1,140 prompts per task, focusing on multi-paragraph, detailed answers.

Company Mentioned

Mention Thumbnail

Coin Mentioned

Mention Thumbnail
featured image - How LongFact Helps AI Models Improve Their Accuracy Across Multiple Topics
Language Models (dot tech) HackerNoon profile picture
0-item

Abstract and 1 Introduction

2 LongFact: Using LLMs to generate a multi-topic benchmark for long-form factuality

3 Safe:LLM agents as factuality autoraters

4 LLMs agents can be better factuality annotators than humans

5 F1@k: Extending F1 with recall from human-preferred length

6 Larger LLMs are more factual

7 Related Work

8 Limitations

9 Conclusion, Acknowledgments, Author Contribution, and References

Appendix

A. Frequently asked questions

B. LongFact details

C. SAFE details

D. Metric details

E. Further analysis

2 LONGFACT: USING LLMS TO GENERATE A MULTI-TOPIC BENCHMARK FOR LONG-FORM FACTUALITY

While there are many existing prompt sets for factuality, there are no comprehensive prompt sets for examining factuality in long-form responses on the scale of several paragraphs in length (as shown in the right side of Figure 2). For example, many established benchmarks such as TruthfulQA (Lin et al., 2022), HaluEval (Li et al., 2023), FreshQA (Vu et al., 2023), HalluQA (Cheng et al., 2023b), and FELM (Chen et al., 2023) mostly consist of prompts that test knowledge of a single factoid (e.g., “How old is the world’s oldest verified living person?”) and thus only require a short response rather than several paragraphs to be fully answered. Other factuality datasets such as FActScore (Min et al., 2023) may require long-form responses but do not cover a broad range of topics.


Figure 2: We present LongFact, a prompt set designed to probe a model’s factuality when generating long-form responses. Left: LongFact consists of 38 topics (listed in Appendix B.2) categorized into four supercategories. Right: compared to existing factuality datasets, LongFact (a) tests long-form responses rather than short answers and (b) covers a broad range of topics.


For this reason, we present LongFact—a prompt set designed to probe a model’s factuality when generating long-form responses that may span multiple paragraphs. We generate LongFact by prompting GPT-4 to generate questions that ask about a specific concept or object within a given topic (e.g., “biology”) and that require a long-form response containing multiple detailed factoids. LongFact consists of two tasks, LongFact-Concepts and LongFact-Objects, separated based on whether the questions ask about concepts or objects. For both tasks, we use the same set of 38 manually-selected topics listed in Appendix B.2 (the left side of Figure 2 shows the breakdown of topics with respect to four supercategories). We generate 30 unique prompts per topic for a total of 1,140 prompts per task. Additional details about our data-generation process can be found in Appendix B.1, and canary text for LongFact is shown in Appendix A.11. LongFact is publicly available at https://github. com/google-deepmind/long-form-factuality/tree/main/longfact.


This paper is available on arxiv under CC by 4.0 Deed license.


Authors:

(1) Jerry Wei, Google DeepMind and a Lead contributors;

(2) Chengrun Yang, Google DeepMind and a Lead contributors;

(3) Xinying Song, Google DeepMind and a Lead contributors;

(4) Yifeng Lu, Google DeepMind and a Lead contributors;

(5) Nathan Hu, Google DeepMind and Stanford University;

(6) Jie Huang, Google DeepMind and University of Illinois at Urbana-Champaign;

(7) Dustin Tran, Google DeepMind;

(8) Daiyi Peng, Google DeepMind;

(9) Ruibo Liu, Google DeepMind;

(10) Da Huang, Google DeepMind;

(11) Cosmo Du, Google DeepMind;

(12) Quoc V. Le, Google DeepMind.


Trending Topics

blockchaincryptocurrencyhackernoon-top-storyprogrammingsoftware-developmenttechnologystartuphackernoon-booksBitcoinbooks