New Story

Enabling Parallel and Nested Function Calls in Language Models: Dataset Requirements

by Language Models (dot tech)April 8th, 2025
Read on Terminal Reader
Read this story w/o Javascript
tldt arrow

Too Long; Didn't Read

To enable the parallel function call and the nested function call, we need to prepare 4K data points for each API so that the accuracy can reach the same level as the single function call.

Company Mentioned

Mention Thumbnail
featured image - Enabling Parallel and Nested Function Calls in Language Models: Dataset Requirements
Language Models (dot tech) HackerNoon profile picture
0-item

Abstract and 1. Introduction

2 Related works

3 Methodology and 3.1 Causal language model as a classification model

3.2 Functional token

3.3 Dataset collection

3.4 Model development and training

4 Experiments and 4.1 Android function calls

4.2 Extension to Vehicle, Yelp, and DoorDash function sets

4.3 Full and partial training datasets and 4.4 Full training and LoRA training

4.5 Parallel and nested function call and 4.6 Weighted loss function for special tokens

5 Discussion and future works and References


Appendix

A.1 Android function examples

A.2 Vehicle function examples

4.5 Parallel and nested function call

For the benchmark test above, we indicate that they are intended for the single function call. To enable the parallel function call and the nested function call, we need to prepare 4K data points for each API so that the accuracy can reach the same level as the single function call.

4.6 Weighted loss function for special tokens

A distinctive aspect of our approach involves incorporating numerous special tokens into the tokenizer and expanding the language model’s head. The loss function is defined as follows:



where T represents the sequence length, and V denotes the vocabulary size.


Given the introduction of special tokens ranging from to , along with the distinct token , which are absent in the Gemma-2B pretrained dataset, we confront an imbalanced dataset challenge during model training. To address this, we adopt a weighted cross-entropy loss as a surrogate loss to improve convergence:



In our configuration, non-special tokens are assigned a weight of 1, while special tokens receive elevated weights. Early-stage training experiments indicate that increasing token weight can expedite convergence. The validation loss, based on Equation (3) with varying surrogate losses for training, is illustrated in Figure (6). Our findings suggest that employing a surrogate training loss early in the training process aids convergence. Nonetheless, experiments reveal no performance disparity in the fine-tuned model nor significant differences in wall-clock time. Therefore, utilizing an equal weighted token loss is recommended for a small number of function tokens. In our benchmark tests, the evaluated model is trained by equal token weights.


This paper is available on arxiv under CC BY-NC-SA 4.0 DEED license.

Authors:

(1) Wei Chen, Stanford University, with equal contribution and a corresponding author {weichen6}@stanford.edu;

(2) Zhiyuan Li, Stanford University and a corresponding author {zhiyuan8}@stanford.edu.


Trending Topics

blockchaincryptocurrencyhackernoon-top-storyprogrammingsoftware-developmenttechnologystartuphackernoon-booksBitcoinbooks