Automatic Prompt Engineer (APE)
Automatic Prompt Engineer (APE) is a method for the automated generation and optimization of textual instructions (prompts) to control the behavior of large language models (LLMs). The approach was proposed in 2022 by a group of researchers led by Yongchao Zhou[1].
Instead of manual creation and iterative refinement of prompts, APE formalizes prompt engineering as an optimization problem. In this framework, a prompt is treated as a natural language "program" that must be synthesized to maximize a specific scoring function (e.g., the accuracy or factuality of the model's responses)[2].
Core Concept and Method
The APE method uses two language models in tandem: a proposal model and a target model. The process is an iterative search-and-select cycle:
- Proposal Generation. The proposal model receives a few input-output examples for the target task and, based on them, creates a set of possible candidate prompts that could produce such results.
- Scoring. Each generated candidate prompt is passed to the target LLM. The target model executes the instruction on a new set of test data, and its responses are evaluated according to a predefined metric (e.g., accuracy, completeness, F1-score).
- Selection. The prompt that achieved the best score during the evaluation is selected.
- Iteration (optional). The cycle can be repeated. The proposal model is instructed to refine the best-found prompt by creating variations of it, after which the scoring and selection process is repeated to achieve maximum effectiveness[1].
This approach allows for the automatic replication of the manual prompt engineering process, using an LLM to generate hypotheses (prompts) and subsequently evaluate them.
Key Methodologies
The automation of prompt engineering is implemented using various algorithmic approaches.
LLM-Based Automation
This is the classic APE method described above, where one LLM is used to generate and evaluate prompts for another (or the same) LLM. This approach has proven very effective for discrete, text-based prompts[1].
Evolutionary Methods
Genetic algorithms or beam search are used to create and select prompts, especially long and complex ones. For example, the APEX (Automatic Engineering of Long Prompts) framework applies evolutionary algorithms to progressively "grow" and refine complex instructions[3].
Gradient-Based Methods (Soft Prompts)
This approach works with continuous or soft prompts, which are trainable vectors (embeddings) rather than textual instructions. These vectors are optimized using gradient descent directly on the target task. This category includes techniques like Prompt Tuning and Prefix-Tuning.
Reinforcement Learning
In this paradigm, the LLM acts as an agent that generates a prompt (action), and the environment returns a score for the response quality (reward). The goal is to maximize the cumulative reward by finding an optimal prompt generation policy through reinforcement learning (RL) methods[2].
Results and Discoveries
Experiments in the original APE study showed that automatically generated instructions outperformed human-written prompts in most cases.
- In tests on 24 Natural Language Processing (NLP) tasks, APE generated prompts that were more effective than human-written ones in 19 out of 24 cases[1].
- APE was able to automatically "discover" a more effective phrasing for Chain-of-Thought style prompting. Instead of the standard phrase "Let's think step by step," APE generated a more detailed and effective instruction: "Let's work this out in a step by step way to be sure we have the right answer." This phrasing improved accuracy on mathematical reasoning tasks on datasets like MultiArith and GSM8K[1].
Applications and Advantages
Applications
- Improving few-shot learning: Automatically selecting optimal examples and instructions.
- Enhancing model factuality: APE can be configured to find prompts that minimize "hallucinations" and maximize the truthfulness of responses on benchmarks like TruthfulQA.
- Development automation: Accelerating the creation of chatbots, information extraction systems, and other LLM-based applications[4].
Advantages
- Scalability: The ability to automatically generate and evaluate hundreds or thousands of prompts without human intervention.
- Adaptability: Easily fine-tuning LLMs for new, highly specialized domains.
- Resource efficiency: Significant reduction in the time and effort spent on manual prompt engineering.
Evolution and Related Approaches
The APE concept continues to evolve. Fully autonomous systems have emerged, such as APET (Automatic Prompt Engineering Toolbox), which allow an LLM (e.g., GPT-4) to independently apply complex prompting strategies (Expert Prompting, Chain of Thought, Tree of Thoughts) and dynamically improve instructions without external intervention[5].
APE is part of a broader trend toward automating interactions with LLMs, which also includes:
- AutoPrompt: An early method that used a gradient-based search to find discrete "trigger" tokens.
- OPRO (Optimization by PROmpting): An approach from DeepMind similar to APE, which also uses an LLM to optimize prompts.
Links
- Official Website of the Automatic Prompt Engineer (APE) Project
- Research Paper: "Large Language Models Are Human-Level Prompt Engineers"
- AutoPrompt (2020). AutoPrompt – Official GitHub Repository. GitHub.
Further Reading
- Zhou, Y. et al. (2022). Large Language Models Are Human-Level Prompt Engineers. arXiv:2211.01910.
- Li, W. et al. (2025). A Survey of Automatic Prompt Engineering: An Optimization Perspective. arXiv:2502.11560.
- Hsieh, C.-J. et al. (2024). Automatic Engineering of Long Prompts. Findings of ACL 2024. 2024.findings-acl.634.
- Hsieh, C.-J. et al. (2023). Automatic Long Prompt Engineering. arXiv:2311.10117.
- Shin, T. et al. (2020). AutoPrompt: Eliciting Knowledge from Language Models with Automatically Generated Prompts. arXiv:2010.15980.
- Yang, C. et al. (2023). Large Language Models as Optimizers (OPRO). arXiv:2309.03409.
- Liu, Y. et al. (2024). Revisiting OPRO: The Limitations of Small-Scale LLMs as Optimizers. arXiv:2405.10276.
- Kepel, D.; Valogianni, K. (2024). Autonomous Prompt Engineering in Large Language Models (APET). arXiv:2407.11000.
- Yang, C. et al. (2024). Optimizing Instructions and Demonstrations for Multi-Stage LM Programs. arXiv:2406.11695.
- Hsieh, C.-J. et al. (2024). APEX (code repository and results). PDF.
Notes
- ↑ 1.0 1.1 1.2 1.3 1.4 Zhou, Y. et al. "Large Language Models Are Human-Level Prompt Engineers". arXiv:2211.01910, 2022. [1]
- ↑ 2.0 2.1 Li, W. et al. "A Survey of Automatic Prompt Engineering: An Optimization Perspective". arXiv:2502.11560, 2025. [2]
- ↑ Hsieh, C.-J. et al. "Automatic Engineering of Long Prompts". Findings of the Association for Computational Linguistics: ACL 2024. [3]
- ↑ Fernandez-garcia, A. et al. "Automatic Prompt Engineering for Foundation Models: A Survey". MDPI Electronics, 2025. [4]
- ↑ Kepel, D. & Valogianni, K. "Autonomous Prompt Engineering in Large Language Models". arXiv:2407.11000, 2024. [5]