TnT-LLM: Automating Text Taxonomy Generation and Classification With Large Language Models

Table of Links

3 METHOD

We begin with a high-level overview of TnT-LLM, our proposed two-phase framework for 1) LLM-powered taxonomy generation and 2) LLM-augmented text classification. In the first phase, we sample a small-scale representative subset of a corpus and perform zero-shot multi-stage taxonomy generation in an iterative manner inspired by stochastic gradient descent [3]. In the second phase, we sample a larger dataset and leverage LLMs with the taxonomy produced by Phase 1 to classify each instance. These LLM labels are then treated as “pseudo-labels” for training a lightweight text classifier. Once training is complete, the lightweight classifier is deployed to label the entire corpus offline, and may also serve for online real-time classification.

3.1 Phase 1: Taxonomy Generation

Phase 1 of TnT-LLM is inspired by the classic mixture model clustering process [17], but implemented in a prompt-based manner. We leverage a “stochastic optimization” approach [19] to iteratively update the intermediate taxonomy outcome, so that a large and dynamic corpus sample can be effectively handled. Depending on the desired granularity of the taxonomy, we suggest using a “smallto-medium” corpus sample that is representative of the corpus in this phase, such that the sample size is sufficient to capture the diversity of the corpus, but not too large to incur unnecessary costs.

• Stage 1: Summarization. In order to normalize all text samples and extract their most salient information, we first generate concise and informative summaries of each document in the sample. Specifically, we prompt an LLM to summarize each document by providing a short blurb about the intended use-case for the summary (e.g., intent detection) and a target summary length (e.g., 20 words); the full prompt template is provided in Figure 8 in the supplemental details. This stage helps reduce the size and variability of the input documents while also extracting the aspects of the document most relevant to the use-case, which we find is especially important for label spaces that are not evident from surface-level semantics (e.g., user intent). Note that this stage is relatively fast, as it may be executed concurrently for each input document with a cost-efficient LLM like GPT-3.5-Turbo.

• Stage 2: Taxonomy Creation, Update, and Review. We next create and refine a label taxonomy using the summaries from the previous stage. Similar to SGD, we divide the summaries into equal-sized minibatches. We then process these minibatches with three types of zero-shot LLM reasoning prompts in sequence. The first, an initial generation prompt, takes the first minibatch and produces an initial label taxonomy as output. The second, a taxonomy update prompt, iteratively updates the intermediate label taxonomy with new minibatches, performing three main tasks in each step: 1) evaluating the given taxonomy on the new data; 2) identifying issues and suggestions based on the evaluation; and 3) modifying the taxonomy accordingly. Finally, after the taxonomy has been updated a specified number of times, we apply a review prompt that checks the formatting and quality of the output taxonomy, of which the output is regarded as the final taxonomy output by Stage 1. In all three prompts, we provide the use-case instruction, which specifies the goal and the format of the desired label taxonomy (e.g., the desired number of labels, the target number of words per label), alongside the minibatch. The full prompt templates are provided in Figure 10 in the supplemental details.

Notice that this process naturally lends itself to hierarchy: After a first round of taxonomy generation, we can rerun Stage 2 for each subgroup of categorized samples to create new, more granular levels in the taxonomy. An overview of our proposed approach is presented in Figure 2.

• Stage 1: Feature Representation. Our summarization stage is analogous to the featurization step in classic machine learning, where raw text inputs are projected onto a vector space via a feature transformation such as an embedding model. In our case, the output summary of each data point can be viewed as a concise and informative feature representation of the original text (𝒙𝑖 ).

• Stage 2: Stochastic Gradient Descent. The main taxonomy creation and update stage resembles prompt optimization with Stochastic Gradient Descent (SGD) [19], where the generation prompt is used to initialize the taxonomy (i.e., the parameters 𝚯0), which is then optimized via SGD through the update prompt-chain. In each update prompt, we assess how the current taxonomy (𝚯𝒎) fits the given batch of data (i.e., calculating the loss function defined in Eq. (1)), then analyze and “backpropagate” the errors to update the taxonomy, i.e., 𝚯𝑚+1 = 𝚯𝑚 −𝜂∇L (𝚯𝑚), where 𝜂 refers to the learning rate which we assume is implicitly adjusted by the LLM.

This paper is available on arxiv under CC BY-NC-ND 4.0 DEED license.

Authors:

(1) Mengting Wan, Microsoft Corporation and Microsoft Corporation;

(2) Tara Safavi (Corresponding authors), Microsoft Corporation;

(3) Sujay Kumar Jauhar, Microsoft Corporation;

(4) Yujin Kim, Microsoft Corporation;

(5) Scott Counts, Microsoft Corporation;

(6) Jennifer Neville, Microsoft Corporation;

(7) Siddharth Suri, Microsoft Corporation;

(8) Chirag Shah, University of Washington and Work done while working at Microsoft;

(9) Ryen W. White, Microsoft Corporation;

(10) Longqi Yang, Microsoft Corporation;

(11) Reid Andersen, Microsoft Corporation;

(12) Georg Buscher, Microsoft Corporation;

(13) Dhruv Joshi, Microsoft Corporation;

(14) Nagu Rangan, Microsoft Corporation.