Hyperparameters and Baseline Experiments in Dialog Systems

Written by feedbackloop | Published 2024/01/17
Tech Story Tags: dataset-annotation | dialog-systems | dialog-datasets | ai-research | conversational-ai | ai-training-data | ai-training-datasets | free-text-human-feedback

TLDRBaseline experiments in dialog systems unfold with key hyperparameter settings. Models trained for five epochs, extended to ten for erroneous dialogs, featured a batch size of 32, learning rate of 5e − 5, and AdamW optimizer. LLAMA adopted unique finetuning parameters. Results, reflected in Table 17, elucidate the interplay of data quality, system errors, and model performance through F1-Score and BLEU metrics.via the TL;DR App

Authors:

(1) Dominic Petrak, UKP Lab, Department of Computer Science, Technical University of Darmstadt, Germany;

(2) Nafise Sadat Moosavi, Department of Computer Science, The University of Sheffield, United Kingdom;

(3) Ye Tian, Wluper, London, United Kingdom;

(4) Nikolai Rozanov, Wluper, London, United Kingdom;

(5) Iryna Gurevych, UKP Lab, Department of Computer Science, Technical University of Darmstadt, Germany.

Table of Links

Abstract & Introduction

Related Work

Datasets Examined

Manual Error Type Analysis and Taxonomies

Automatic Filtering for Potentially Relevant Dialogs

Statistical Analysis

Evaluation and Experiments

Discussion

Conclusion, Limitation, Acknowledgments, and References

A Integrated Error Taxonomy – Details

B Error-Indicating Sentences And Phrases

C Automatic Filtering – Implementation

D Automatic Filtering – Sentence-Level Analysis

E Task-Oriented Dialogs – Examples

F Effectiveness Of Automatic Filtering – A Detailed Analysis

G Inter-Annotator Agreement – Detailed Analysis

H Annotation Guidelines

I Hyperparameters and Baseline Experiments

J Human-Human Dialogs – Examples

I Hyperparameters and Baseline Experiments

Hyperparameters All baseline models were trained for five epochs. For the experiment using erroneous dialogs, we trained the models for ten epochs. We used a batch size of 32 and a learning rate of 5e − 5 with no warmup steps. As optimizer, we used the implementation of AdamW 11 (Loshchilov and Hutter, 2019) in Pytorch. Except for LLAMA (Touvron et al., 2023), we fully-finetuned all models. For LLAMA, we only finetuned the LoRA (Hu et al., 2022) weights, using a rank of 8, an alpha of 16, and a dropout rate of 0.05.

Results Table 17 shows the results of our baseline experiments using word-overlapping F1-Score and BLEU (Papineni et al., 2002).

Compared to the results of our experiment using erroneous dialogs (Table 11), both F1-Score and BLEU are higher in case of GPT-2 and LLAMA. We attribute this to the comparatively bad quality of the erroneous dialogs. The majority of these

dialogs comes from the SFC (Hancock et al., 2019) dataset, where most errors are due to out-of-context system utterances (see Section 6.1) and were ignored by users (see Section 6.2).

This paper is available on arxiv under CC BY-NC-SA 4.0 DEED license.


Written by feedbackloop | The FeedbackLoop offers premium product management education, research papers, and certifications. Start building today!
Published by HackerNoon on 2024/01/17