Prior Approaches to Text Mining: Taxonomy, Clustering, and LLM Annotation

Table of Links

2 RELATED WORK

Taxonomy Generation. Prior work in taxonomy generation falls into manual and automatic approaches. Handcrafted taxonomies, beyond being expensive to construct, tend to be developed for specific downstream tasks (e.g., web search intent analysis [4, 20], chatbot intent detection [31]), or tied to the development of specific datasets [22, 25]. On the other hand, automated approaches scale better but either rely on term extraction from corpora to obtain labels, which may hinder interpretability and/or coverage [24, 33], or else require a set of seeds for the taxonomy in order to generate new labels [32]. TnT-LLM, in contrast, is automatic, abstractive (i.e., labels describe the corpus but need not be directly extracted from it), and does not require any seed labels. Moreover, TnT-LLM treats taxonomy generation and text classification as interrelated problems in an end-to-end pipeline, whereas prior work has tended to focus mainly on the quality of the taxonomy produced, without considering its downstream utility for classification.

Text Clustering and Topic Modeling. Text clustering and topic modeling “invert” the traditional approach of defining a label set, then applying the labels on the corpus. Given a set of documents, such approaches first group the documents into topical clusters using various definitions of textual similarity, then post-hoc label or summarize the clusters [1, 29]. While traditional approaches in theory accomplish the same goals as TnT-LLM, they suffer due to a lack of interpretability [5], as they typically do not assign intelligible labels to clusters. More recently, attempts have been made to overcome these problems by using LLMs for topic modeling [18, 30], though these approaches still require supervision through either a predefined taxonomy [30] or a seed set of topics [18].

LLMs as Annotators. Recent work has explored using LLMs to replace human annotation for labor-intensive tasks such as search relevance quality labeling [28], topic and stance detection [8], and various computational social science labeling tasks [34]. These studies have found that, in general, LLMs perform on par or even better than crowd-workers [15], often at a fraction of the cost. In the same vein, we explore using LLMs as annotators for text classification, although our main goal is to scale the process by distilling LLMs’ label-specific capabilities into more efficient, lightweight classifiers.

This paper is available on arxiv under CC BY-NC-ND 4.0 DEED license.

Authors:

(1) Mengting Wan, Microsoft Corporation and Microsoft Corporation;

(2) Tara Safavi (Corresponding authors), Microsoft Corporation;

(3) Sujay Kumar Jauhar, Microsoft Corporation;

(4) Yujin Kim, Microsoft Corporation;

(5) Scott Counts, Microsoft Corporation;

(6) Jennifer Neville, Microsoft Corporation;

(7) Siddharth Suri, Microsoft Corporation;

(8) Chirag Shah, University of Washington and Work done while working at Microsoft;

(9) Ryen W. White, Microsoft Corporation;

(10) Longqi Yang, Microsoft Corporation;

(11) Reid Andersen, Microsoft Corporation;

(12) Georg Buscher, Microsoft Corporation;

(13) Dhruv Joshi, Microsoft Corporation;

(14) Nagu Rangan, Microsoft Corporation.