Toolformer

Toolformer is an approach for creating large language models (LLMs) that enables them to independently use external tools through API calls^[1]. This method was proposed in 2023 by a group of researchers from Meta AI Research in collaboration with Pompeu Fabra University (Spain). In their paper, "Toolformer: Language Models Can Teach Themselves to Use Tools" (Timo Schick et al., 2023), the authors note a paradox: modern LLMs demonstrate impressive abilities to solve complex new tasks from text examples, yet they are often unable to reliably perform basic operations, such as arithmetic or fact retrieval^[1]. To overcome these limitations, the team developed Toolformer, a model that self-learns to select and call external tools (such as search engines, calculators, or translation services) to improve its performance on a variety of tasks^[1]. In February 2023, this model was presented as a preprint on arXiv, and it later gained recognition and was accepted at the NeurIPS 2023 conference^[2].

Core Idea and Model Capabilities

Toolformer is a fine-tuned language model capable of deciding which tool to call, when to call it, what parameters to pass, and how to incorporate the result into the generated text^[3]. The training is conducted in a self-supervised manner—the model generates and evaluates examples of API usage on its own, requiring only a small set of demonstrations (literally a few examples) for each tool^[3]. Unlike previous approaches, it does not require extensive annotations or human-curated templates to integrate tools; the model learns when and how to use a particular API on its own, while retaining its general language abilities and not being limited to narrowly defined tasks^[1].

The authors integrated a wide range of tools into Toolformer, accessible through simple API calls. The experimental version utilized the following utilities^[3]:

Calculator – for performing arithmetic calculations.
Question Answering (Q&A) system – for finding answers to factual questions from a knowledge base.
Search engines (2 different ones) – for searching for up-to-date information on the internet.
Machine Translation system – for translating texts between languages.
Calendar – for retrieving information about dates and times.

Each tool is represented as a text snippet (a special text label), allowing the model to embed the API call directly into the text it generates^[4]. For instance, the model can insert a construct like `[Calculator(...)]` or `[Search("query")]` into the current context, signaling the need for an external call. During inference, Toolformer generates a special token (an arrow →), which causes the system to pause generation and execute the corresponding API call; the received result is then substituted into the text, and generation continues^[4]. This mechanism allows the model to dynamically leverage the capabilities of external services while remaining a single language module without architectural changes.

Model Training (Methodology)

The developers described the Toolformer training process in several stages^[4], using the technique of in-context learning to generate synthetic data^[1]:

Generating API Call Candidates. First, texts are taken from a large corpus (e.g., articles or web pages), and tool calls that could potentially help continue or supplement the text are artificially inserted. The model generates these insertions itself using few-shot prompting, based on a few manually provided examples of each API's usage^[1]. For example, the model might receive a fragment like: "In 2024, the city's population was [QA("What is the population of this city?") → ...] people," where QA() is a call to the question-answering system expected to return the missing data. Suitable contexts are selected for each tool; for instance, for the calculator, the model chooses sentences containing several numbers and words like "equals" or "total"^[4]—where an arithmetic result would genuinely be needed.
Executing API Calls and Augmenting Data. Next, all the API calls generated by the model are actually executed—for example, queries are sent to a search engine, or expressions are computed by the calculator. The resulting answers are substituted back into the texts, forming complete sentence variations with insertions like `{answer}`^[4]^[4]. Simultaneously, "empty" versions (without the substituted answer) and the original texts without any calls are also retained for later comparison.
Filtering and Self-Evaluating Utility. At this stage, Toolformer independently assesses which of the generated API insertions are genuinely useful for predicting the continuation of the text^[4]. A comparison is made of the language model's probability of continuing the text in three scenarios: (a) with no call, (b) with a tool call but without the result substituted, and (c) with both the call and the substituted result^[4]. If adding a specific API response increases the model's probability of correctly continuing the sentence (i.e., it genuinely helps the model predict the subsequent words), that example is considered useful. Only insertions that provide a probability gain are kept in the dataset. This process filters out cases where a tool call was redundant or did not contribute new information. Finally, the model is fine-tuned on the resulting filtered dataset, which contains real text examples with optimally inserted API calls^[1]. The training is performed using the standard language modeling objective: to predict the next token in a sequence, including tokens that represent the results of tool calls.

It is important to emphasize that integrating each new tool required only a few manual examples of its use—the generation of training data proceeded automatically thereafter^[3]. Thanks to this, the Toolformer approach is almost independent of specialized annotated corpora and minimizes the labor required for data annotation. The model learns the format and appropriateness of API calls on its own while preserving its universality: it uses tools only when they are genuinely needed to solve the task at hand^[1].

Experimental Results

To experimentally validate the method, the researchers took an existing language model, GPT-J (6.7 billion parameters)—an open LLM trained on The Pile corpus^[4]. This model was fine-tuned using the described procedure to create a GPT-J-based Toolformer. The new approach's performance was evaluated in a zero-shot setting (without examples) on a range of standard tasks, including mathematical word problems, fact-finding and question answering (QA), as well as translation and text infilling. The test data included open datasets such as Natural Questions, TriviaQA (for assessing factual knowledge), and math word problem datasets like ASDiv, MAWPS, and SVAMP, as well as multilingual QA benchmarks like MLQA and LAMA^[5].

The results showed that Toolformer significantly outperforms the original model of the same size on many tasks^[1]. Moreover, by connecting to tools, the relatively small model (6.7 billion parameters) was able to surpass the much larger GPT-3 model (175 billion parameters) on several benchmarks^[1]^[6]. For example, in mathematical word problems requiring precise calculations, Toolformer demonstrated particularly notable progress compared to standard LLMs, solving such problems accurately by using the calculator, whereas even GPT-3 made errors^[1]. In factual question-answering (QA) tests, the GPT-J-based Toolformer also showed response quality comparable to or better than GPT-3, as it could perform internet searches for up-to-date information^[1]. Importantly, the new approach did not degrade the general capabilities of the language model, such as coherently continuing text on regular data: Toolformer maintained the same level of natural language generation without tools as the original GPT-J^[3]. In other words, adding the API-calling functionality did not diminish the model's core skills but rather expanded them with additional capabilities.

Significance and Further Research

The development of Toolformer demonstrated the fundamental possibility of training a model to use external tools with virtually no manual annotation, by generating synthetic data and self-evaluating its utility. This achievement paves the way for more efficient and reliable large language models: instead of scaling up to billions of parameters to memorize facts or mathematical rules, a relatively compact model can dynamically leverage external resources (knowledge bases, calculators, services) to fill its own gaps^[1]. This approach helps reduce the number of "hallucinations" (when the model invents non-existent information), increase the accuracy of answers, and keep knowledge current without completely retraining the entire model. In essence, Toolformer achieves "the best of both worlds"—combining the linguistic skills of a large model with the precision of specialized tools^[3].

The work by Schick and colleagues was one of the first to demonstrate an LLM's autonomous mastery of external APIs and generated significant interest in the community. It has inspired further research developing this idea. For example, in 2023, the Graph-ToolFormer model was proposed, adapting the principles of Toolformer for working with graph data. Relying on prompts generated by ChatGPT, Graph-ToolFormer teaches a language model to call external tools for solving graph analysis tasks (e.g., retrieving graph properties, working with knowledge networks, etc.)^[7]. Another notable development is the Gorilla model (Univ. of California, Berkeley, 2023), which focuses on the precise use of numerous third-party software APIs^[8]. Gorilla is a LLaMA-based model fine-tuned on an extensive dataset of documented functions; it can generate correct API calls from a task description so successfully that in experiments, it surpassed even GPT-4 in the accuracy of generating calls to libraries and services^[8]. Gorilla integrates a documentation retrieval mechanism, allowing it to stay updated on API changes and significantly reducing errors and hallucinations when using tools^[8]. These works confirm the importance of the direction opened by Toolformer: combining LLMs with external tools is seen as a promising path toward creating more powerful and reliable AI systems.

It is noted that the idea of integrating tools into language models is evolving not only in research but also in industry. In March 2023, OpenAI introduced a plugin system for ChatGPT—a special interface that allows the model to connect to external services (a web browser, knowledge bases, computational engines, etc.) to obtain up-to-date information and perform calculations^[9]. Users had long requested such functionality, and the introduction of plugins effectively implements a concept similar to Toolformer in practice: the model is augmented with "tools" to expand its capabilities in handling diverse user requests^[9]. Thus, Toolformer aligns with the general trend in AI development where large language models are becoming a platform capable of leveraging external knowledge and computation on demand. The research by the Meta AI and UPF team laid an important foundation for this direction, showing how LLMs can learn to work with tools with almost no manual guidance, thereby moving closer to a more versatile and safe intelligent assistant^[1]^[3].

Links

References

Schick, T. et al. (2023). Toolformer: Language Models Can Teach Themselves to Use Tools. arXiv:2302.04761.
Schick, T. et al. (2023). Toolformer: Language Models Can Teach Themselves to Use Tools. NeurIPS 2023. OpenReview.
Li, M. et al. (2023). API-Bank: A Comprehensive Benchmark for Tool-Augmented LLMs. arXiv:2304.08244.
Zhang, T. et al. (2023). Graph-ToolFormer: To Empower LLMs with Graph Reasoning Ability via Prompt Augmented by ChatGPT. arXiv:2304.11116.
Patil, S. G. et al. (2023). Gorilla: Large Language Model Connected with Massive APIs. arXiv:2305.15334.
Qin, Y. et al. (2023). ToolLLM: Facilitating Large Language Models to Master 16 000+ Real-World APIs. arXiv:2307.16789.
Li, Y. et al. (2024). Tool Learning with Large Language Models: A Survey. arXiv:2405.17935.
OpenAI (2023). ChatGPT Plugins. OpenAI Blog.
Brown, T. B. et al. (2020). Language Models Are Few-Shot Learners. arXiv:2005.14165.
Schick, T. et al. (2023). Meta AI & UPF’s Toolformer: Enabling Language Models to Teach Themselves to Use External Tools. Synced Review.

Notes

↑ ^1.00 ^1.01 ^1.02 ^1.03 ^1.04 ^1.05 ^1.06 ^1.07 ^1.08 ^1.09 ^1.10 ^1.11 ^1.12 ^1.13 Schick, T. et al. "Meta AI & UPF's Toolformer: Enabling Language Models to Teach Themselves to Use External Tools". Synced. [1]
↑ Schick, T. et al. "Toolformer: Language Models Can Teach Themselves to Use Tools". OpenReview. [2]
↑ ^3.0 ^3.1 ^3.2 ^3.3 ^3.4 ^3.5 ^3.6 Schick, T. et al. "Toolformer: Language Models Can Teach Themselves to Use Tools". arXiv. [3]
↑ ^4.0 ^4.1 ^4.2 ^4.3 ^4.4 ^4.5 ^4.6 ^4.7 ^4.8 OXEN AI. "Arxiv Dives Toolformer: Language models can teach themselves to use tools". Medium. [4]
↑ "Toolformer: Language Models Can Teach Themselves to Use Tools". Papers With Code. [5]
↑ Brown, Tom B. et al. "Language Models are Few-Shot Learners". arXiv. [6]
↑ Zhang, Tingkai et al. "Graph-ToolFormer: To Empower LLMs with Graph Reasoning Ability via Prompt Augmented by ChatGPT". arXiv. [7]
↑ ^8.0 ^8.1 ^8.2 Patil, Shishir G. et al. "Gorilla: Large Language Model Connected with Massive APIs". arXiv. [8]
↑ ^9.0 ^9.1 "ChatGPT plugins". OpenAI. [9]

[synced-review-1] 1.00 ^1.01 ^1.02 ^1.03 ^1.04 ^1.05 ^1.06 ^1.07 ^1.08 ^1.09 ^1.10 ^1.11 ^1.12 ^1.13 Schick, T. et al. "Meta AI & UPF's Toolformer: Enabling Language Models to Teach Themselves to Use External Tools". Synced. [1]

[openreview-2] Schick, T. et al. "Toolformer: Language Models Can Teach Themselves to Use Tools". OpenReview. [2]

[arxiv-original-3] 3.0 ^3.1 ^3.2 ^3.3 ^3.4 ^3.5 ^3.6 Schick, T. et al. "Toolformer: Language Models Can Teach Themselves to Use Tools". arXiv. [3]

[medium-oxenai-4] 4.0 ^4.1 ^4.2 ^4.3 ^4.4 ^4.5 ^4.6 ^4.7 ^4.8 OXEN AI. "Arxiv Dives Toolformer: Language models can teach themselves to use tools". Medium. [4]

[paperswithcode-5] "Toolformer: Language Models Can Teach Themselves to Use Tools". Papers With Code. [5]

[arxiv-fewshot-6] Brown, Tom B. et al. "Language Models are Few-Shot Learners". arXiv. [6]

[arxiv-graphtoolformer-7] Zhang, Tingkai et al. "Graph-ToolFormer: To Empower LLMs with Graph Reasoning Ability via Prompt Augmented by ChatGPT". arXiv. [7]

[arxiv-gorilla-8] 8.0 ^8.1 ^8.2 Patil, Shishir G. et al. "Gorilla: Large Language Model Connected with Massive APIs". arXiv. [8]

[openai-plugins-9] 9.0 ^9.1 "ChatGPT plugins". OpenAI. [9]

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]