Toolformer outperforms GPT-3 on zero-shot NLP tasks by using APIs

Published on April 26, 2023

The Meta AI Research company announced Toolformer, a language model that learns to call APIs to facilitate natural language processing (NLP). Using Toolformer, a training dataset is automatically annotated, which is used to fine-tune the model and can outperform the much larger GPT-3 model on several zero-shot NLP tasks.

The Toolformer is based on a pre-trained GPT-J large language model that has 6.7B parameters. The model receives human-written examples of API calls, along with input and output, as prompts, which are prefixed to training data samples. By feeding these into the model, annotated samples are produced indicating where API calls need to be inserted to generate a result; for example, a calculator API can be called to answer an arithmetic question. On the basis of the annotated dataset, the model is then fine-tuned. This finely tuned model outperformed larger models, such as the 175B parameter GPT-3, on several zero-shot NLP benchmarks by using API calls.

GPT-3, a large language model (LLM), performs well on a wide range of NLP tasks. The larger the model, the better its performance. However, LLMs often have difficulty with some tasks, such as arithmetic, regardless of their scale. Additionally, regardless of scale, they will incorrectly answer questions regarding events that occurred after the model was trained, such as “Which team does Cristiano Ronaldo play for?” Using external tools, or APIs, such as a web search engine or calculator, Meta’s solution to this problem involves teaching the LLM to assist it in tasks where it would otherwise perform poorly, such as searching the web or calculating.

The key concept is to use the language model to generate a training dataset for itself. This dataset is generated by taking a subset of the Common Crawl dataset, then prepending each example with a prompt asking the model to add API calls and their results. Moreover, the researchers developed a loss metric or “fitness score” for the API calls: if adding their results to the text results in a worse prediction for the next tokens in the text, the edit is discarded.

Toolformer was trained to utilize five different tools: a question-answering API, a Wikipedia search engine, a machine translation system, a calculator, and a calendar. Meta conducted several experiments in which it compared its performance with baseline GPT-J models, 66B parameter OPT models, and 175B parameter GPT-3 models. GPT-3 performed better on question answering, and GPT-J performed better on some non-English languages in multilingual question answering than Toolformer. Researchers attribute this to Toolformer’s finetuning on the annotated English-only dataset.

Although Meta has not released the source code for Toolformer, independent AI developers Phil Wang and Enrico Shippole have each released their own implementations.

