Evaluating TnT-LLM: Automatic, Human, and LLM-Based Assessment

Table of Links

4 EVALUATION SUITE

Due to the unsupervised nature of the problem we study and the lack of a benchmark standard, performing quantitative evaluation on end-to-end taxonomy generation and text classification can be challenging. We therefore design a suite of strategies to evaluate TnT-LLM. Our evaluation strategies may be categorized into three buckets, depending on the type and source of the evaluation criteria. The three categories are as follows:

• Deterministic automatic evaluation: This type of approach is scalable and consistent, but requires well-defined, gold standard rules and annotations. It is less applicable for evaluating the abstract aspects studied in this paper, such as the quality and usefulness of a label taxonomy.

• Human evaluation: These approaches are useful for evaluating the abstract aspects that the automatic evaluations cannot address. However, they are also time-consuming, expensive, and may encounter data privacy and compliance constraints.

• LLM-based evaluations: Here, LLMs are used to perform the same or similar tasks as human evaluators. This type of evaluation is more scalable and cost-effective than human evaluation, albeit potentially subject to biases and errors if not applied properly. We therefore aim to combine and validate LLM-based evaluation with human evaluation metrics on small corpora so that we can extrapolate conclusions with sufficient statistical power.

4.1 Phase 1 Evaluation Strategies

Following prior studies [23, 30], we evaluate a label taxonomy on three criteria: Coverage, accuracy, and relevance to the use-case instruction. Note that we require implementing the native primary label assignment to apply these metrics. For clustering-based methods, this is instantiated through the clustering algorithm. For TnTLLM, this is done by a label assignment prompt as described in Section 3.2. We also note that the label accuracy and use-case relevance metrics discussed here are applicable to both human and LLM raters.

Taxonomy Coverage. This metric measures the comprehensiveness of the generated label taxonomy for the corpus. Conventional text clustering approaches (e.g., embedding-based k-means) often achieve 100% coverage by design. In our LLM-based taxonomy generation pipeline, we add an ‘Other’ or ‘Undefined’ category in the label assignment prompt by design and measure the proportion of data points assigned to this category. The lower this proportion, the higher the taxonomy coverage.

Label Accuracy. This metric quantifies how well the assigned label reflects the text data point, relative to other labels in the same taxonomy. Analogous to mixture model clustering, the primary label should be the most probable one given the text. We assume human and LLM raters can assess the label fit by its name and description. We treat accuracy as a pairwise comparison task: for each text, we obtain the primary label and a random negative label from the same taxonomy, and ask a rater to choose the more accurate label based on their names and descriptions.[1] If the rater correctly identifies the positive label, we consider it as a "Hit" and report the average hit rate as the label accuracy metric. We do not explicitly evaluate the overlap across category labels and rather expect it to be implicitly reflected in the pairwise label accuracy metric.

Relevance to Use-case Instruction. This metric measures how relevant the generated label taxonomy is to the use-case instruction. For example, “Content Creation” is relevant to an instruction to “understand user intent in a conversation”, while “History and Culture” is not. We operationalize this as a binary rating task: for each instance, we provide its primary label name and description to a human or LLM rater, and ask them to decide if the label is relevant to the given use-case instruction or not. Note that we instruct the rater to use the presented instance as the context, and rate the relevance conditioned on the label’s ability to accurately describe some aspect of the text input. The goal of this metric is not to evaluate the label accuracy, but rather to rule out the randomness introduced by taxonomies that are seemingly relevant to the use-case instruction, but irrelevant to the corpus sample – and therefore useless for downstream applications.

This paper is available on arxiv under CC BY-NC-ND 4.0 DEED license.

[1] Raters are also offered a "None" option besides the pair, but are instructed to minimize the use of it.

Authors:

(1) Mengting Wan, Microsoft Corporation and Microsoft Corporation;

(2) Tara Safavi (Corresponding authors), Microsoft Corporation;

(3) Sujay Kumar Jauhar, Microsoft Corporation;

(4) Yujin Kim, Microsoft Corporation;

(5) Scott Counts, Microsoft Corporation;

(6) Jennifer Neville, Microsoft Corporation;

(7) Siddharth Suri, Microsoft Corporation;

(8) Chirag Shah, University of Washington and Work done while working at Microsoft;

(9) Ryen W. White, Microsoft Corporation;

(10) Longqi Yang, Microsoft Corporation;

(11) Reid Andersen, Microsoft Corporation;

(12) Georg Buscher, Microsoft Corporation;

(13) Dhruv Joshi, Microsoft Corporation;

(14) Nagu Rangan, Microsoft Corporation.

Evaluating TnT-LLM: Automatic, Human, and LLM-Based Assessment

Too Long; Didn't Read

People Mentioned

Coin Mentioned

Table of Links

4 EVALUATION SUITE

4.1 Phase 1 Evaluation Strategies

About Author

TOPICS

THIS ARTICLE WAS FEATURED IN...

Categories

Trending Topics

Evaluating TnT-LLM: Automatic, Human, and LLM-Based Assessment

Too Long; Didn't Read

People Mentioned

Coin Mentioned

Table of Links

4 EVALUATION SUITE

4.1 Phase 1 Evaluation Strategies

About Author

TOPICS

THIS ARTICLE WAS FEATURED IN...

RELATED STORIES

Categories

Trending Topics