New Story

Benchmarking Large Language Models for Function Calling: GPT-4, GPT-3.5, Llama, and Octopus

by Language Models (dot tech)April 8th, 2025
Read on Terminal Reader
Read this story w/o Javascript
tldt arrow

Too Long; Didn't Read

Our study conducts a comprehensive evaluation of language model capabilities via an extensive benchmarking approach, aimed at assessing their effectiveness in generating accurate function calls.

Companies Mentioned

Mention Thumbnail
Mention Thumbnail
featured image - Benchmarking Large Language Models for Function Calling: GPT-4, GPT-3.5, Llama, and Octopus
Language Models (dot tech) HackerNoon profile picture
0-item

Abstract and 1. Introduction

2 Related works

3 Methodology and 3.1 Causal language model as a classification model

3.2 Functional token

3.3 Dataset collection

3.4 Model development and training

4 Experiments and 4.1 Android function calls

4.2 Extension to Vehicle, Yelp, and DoorDash function sets

4.3 Full and partial training datasets and 4.4 Full training and LoRA training

4.5 Parallel and nested function call and 4.6 Weighted loss function for special tokens

5 Discussion and future works and References


Appendix

A.1 Android function examples

A.2 Vehicle function examples

4 Experiments

Our study conducts a comprehensive evaluation of language model capabilities via an extensive benchmarking approach, aimed at assessing their effectiveness in generating accurate function calls. Initially, we compare our model’s accuracy and response time against premier models in the field, namely GPT-4 (checkpoint: gpt-4-0125-preview) and GPT-3.5 (checkpoint: gpt-3.5-turbo-0125).


In the next phase, we explore the efficacy of the RAG technique, renowned for its ability to reduce incorrect outputs (hallucinations) and latency by equipping language models with a concise selection of potential functions. Through the integration of Meta’s FAISS for semantic search, we enhance the function call description retrieval process, opting for the top 5 descriptions to navigate context length constraints seen in models like Meta’s Llama-7B and OpenAI’s GPT-3.5.


Subsequently, we analyze the impact of training dataset size and model training methods on performance metrics.

4.1 Android function calls

To illustrate our model’s application, we select Android system function calls as a case study, focusing on accuracy and latency in function call generation. Initially, we chose 20 Android APIs as our dataset foundation. We adopted two distinct methods for generating function call commands. For details on API design, see Appendix. The first method involves a RAG approach to identify top similar function descriptions based on user queries, which the language model then uses, along with the user query, to generate the expected function call commands. We detail the various models employed during this evaluation.


Utilizing Google Gemini, we sample relevant queries for the selected Android function calls and manually label the ground truth as the evaluation dataset. We document our benchmark results, focusing on two critical metrics: accuracy and latency, as illustrated in Figure (4) and Figure (5) respectively.


Llama-7B RAG Evaluation Initially, the pretrained Llama-7B model showed limited ability in generating the expected outcomes, leading us to employ a Llama-7B variant fine-tuned for function call generation [48]. For the Llama-7B assessment, we applied the RAG method without strict output format requirements, considering responses with missing parentheses as correct. This evaluation was conducted on a single NVIDIA A100 machine, with all results from Llama-7B compared against the ground truth. The primary errors were incorrect function name selection and erroneous parameter generation. Despite employing few-shot learning to guide the model towards accurate function generation, the performance was modest, with an accuracy rate of 68.095% when overlooking format requirements and a latency of 13.46 seconds, excluding model loading time. To improve latency, we implemented optimizations such as flash attention and a fast tokenizer.


GPT-3.5 RAG Evaluation Similar to the approach with Llama-7B, we utilized GPT-3.5 for response generation, employing the same semantic search strategy for context acquisition. To enhance GPT-3.5’s performance, we designed a specific prompt style, incorporating one-shot learning to improve accuracy further.


In this benchmark test, an impressive accuracy of 98.095% was achieved, leveraging the gpt-3.5-turbo0125 checkpoint known for its optimization in function calling tasks. The latency was significantly improved to 1.97 seconds for generating a single function call, a notable enhancement over the Llama-7B model’s performance. This improvement in speed is primarily attributed to the efficiency of the language model inference, as the RAG component remained consistent. GPT-3.5’s quicker response may be due to OpenAI’s use of multiple GPUs or a more advanced inference infrastructure. Further analysis revealed that a significant portion of the time was spent on content retrieval, despite only needing to fetch 5 function descriptions from a pool of 20. To optimize latency, all function descriptions’ embeddings were precomputed using OpenAI’s text-embedding-3-large endpoint, with IndexFlatL2 employed for search indexing and parallel computation on multicore CPUs to enhance speed.



GPT-3.5 and GPT-4 Evaluations In efforts to further reduce latency for GPT-3.5 and GPT-4, we included all 20 function descriptions directly in the context, bypassing the RAG method to avoid microservices interactions and their associated IO-bound overheads. This adjustment yielded a latency reduction to 1.18 seconds for GPT-3.5. The prompt template mirrored that of the GPT-3.5 RAG, with the addition of more candidate functions. However, accuracy slightly declined to 97.143%, possibly due to diminished language model effectiveness with longer text inputs. Conversely, GPT-4 exhibited superior accuracy at 98.571% and even lower latency than GPT-3.5, despite expectations of GPT-4 being a larger model. This performance, evaluated on March 18 at 2 PM PDT, might reflect variances in API traffic or hardware configurations between the two models. GPT-4’s enhanced performance suggests OpenAI could be allocating more GPU resources to it or that it experiences less demand compared to GPT-3.5.


Octopus model Now, we present the octopus model with 1000 data points sampled for each API. And we observe a 99.524% accuracy in our evaluation dataset. Moreover, the prompt used for this method is as simple as:



In our approach, incorporating function information directly into the context is unnecessary, as the Octopus model has already learned to mapping functional tokens to corresponding function descriptions, thereby conserving a significant number of tokens for processing. Given its compact size and the brevity of the context required, the Octopus model demonstrates a reduced latency of 0.38 seconds. To maintain an equitable comparison, we adhered to the same benchmark settings used for the Llama7B evaluation, such as incorporating flash attention and not using quantization. Furthermore, we explored the deployment of our Octopus 2B model on mobile devices through quantization. By precomputing the state for the fixed prefix—"Below is the query from the users, please call the correct function and generate the parameters to call the function. Query:"—our on-device model achieves remarkable performance, completing a function call within 1.1 to 1.7 seconds for typical queries of 20 to 30 tokens using a standard Android phone.


This paper is available on arxiv under CC BY-NC-SA 4.0 DEED license.

Authors:

(1) Wei Chen, Stanford University, with equal contribution and a corresponding author {weichen6}@stanford.edu;

(2) Zhiyuan Li, Stanford University and a corresponding author {zhiyuan8}@stanford.edu.


Trending Topics

blockchaincryptocurrencyhackernoon-top-storyprogrammingsoftware-developmenttechnologystartuphackernoon-booksBitcoinbooks