Table of Links
3 Methodology and 3.1 Causal language model as a classification model
3.4 Model development and training
4 Experiments and 4.1 Android function calls
4.2 Extension to Vehicle, Yelp, and DoorDash function sets
4.3 Full and partial training datasets and 4.4 Full training and LoRA training
4.5 Parallel and nested function call and 4.6 Weighted loss function for special tokens
5 Discussion and future works and References
Appendix
4.5 Parallel and nested function call
For the benchmark test above, we indicate that they are intended for the single function call. To enable the parallel function call and the nested function call, we need to prepare 4K data points for each API so that the accuracy can reach the same level as the single function call.
4.6 Weighted loss function for special tokens
A distinctive aspect of our approach involves incorporating numerous special tokens into the tokenizer and expanding the language model’s head. The loss function is defined as follows:
where T represents the sequence length, and V denotes the vocabulary size.
Given the introduction of special tokens ranging from to , along with the distinct token , which are absent in the Gemma-2B pretrained dataset, we confront an imbalanced dataset challenge during model training. To address this, we adopt a weighted cross-entropy loss as a surrogate loss to improve convergence:
In our configuration, non-special tokens are assigned a weight of 1, while special tokens receive elevated weights. Early-stage training experiments indicate that increasing token weight can expedite convergence. The validation loss, based on Equation (3) with varying surrogate losses for training, is illustrated in Figure (6). Our findings suggest that employing a surrogate training loss early in the training process aids convergence. Nonetheless, experiments reveal no performance disparity in the fine-tuned model nor significant differences in wall-clock time. Therefore, utilizing an equal weighted token loss is recommended for a small number of function tokens. In our benchmark tests, the evaluated model is trained by equal token weights.
This paper is available on arxiv under CC BY-NC-SA 4.0 DEED license.
Authors:
(1) Wei Chen, Stanford University, with equal contribution and a corresponding author {weichen6}@stanford.edu;
(2) Zhiyuan Li, Stanford University and a corresponding author {zhiyuan8}@stanford.edu.