Evaluation Metrics for Assessing LLM Performance on Syllogistic Tasks

Too Long; Didn't Read

The evaluation metrics for LLM performance include accuracy, F1-score, recall, precision for Task 1, and reasoning accuracy (RA) for Task 2. Supplementary metrics like non-empty output, irrelevant text, and faithfulness are also used to ensure compliance with task guidelines and meaningful responses.

Table of Links

A. Formalization of the SylloBio-NLI Resource Generation Process

B. Formalization of Tasks 1 and 2

C. Dictionary of gene and pathway membership

D. Domain-specific pipeline for creating NL instances and E Accessing LLMs

F. Experimental Details

G. Evaluation Metrics

H. Prompting LLMs - Zero-shot prompts

I. Prompting LLMs - Few-shot prompts

J. Results: Misaligned Instruction-Response

K. Results: Ambiguous Impact of Distractors on Reasoning

L. Results: Models Prioritize Contextual Knowledge Over Background Knowledge

M Supplementary Figures and N Supplementary Tables

G Evaluation Metrics

Authors:

(1) Magdalena Wysocka, National Biomarker Centre, CRUK-MI, Univ. of Manchester, United Kingdom;

(2) Danilo S. Carvalho, National Biomarker Centre, CRUK-MI, Univ. of Manchester, United Kingdom and Department of Computer Science, Univ. of Manchester, United Kingdom;

(3) Oskar Wysocki, National Biomarker Centre, CRUK-MI, Univ. of Manchester, United Kingdom and ited Kingdom 3 I;

(4) Marco Valentino, Idiap Research Institute, Switzerland;

(5) André Freitas, National Biomarker Centre, CRUK-MI, Univ. of Manchester, United Kingdom, Department of Computer Science, Univ. of Manchester, United Kingdom and Idiap Research Institute, Switzerland.

This paper is available on arxiv under CC BY-NC-SA 4.0 license.

Evaluation Metrics for Assessing LLM Performance on Syllogistic Tasks

Too Long; Didn't Read

Table of Links

G Evaluation Metrics

About Author

TOPICS

THIS ARTICLE WAS FEATURED IN...

Categories

Trending Topics

Evaluation Metrics for Assessing LLM Performance on Syllogistic Tasks

Too Long; Didn't Read

Table of Links

G Evaluation Metrics

About Author

TOPICS

THIS ARTICLE WAS FEATURED IN...

RELATED STORIES

Categories

Trending Topics