Prompt engineering
Prompt engineering is the discipline of designing and optimizing prompts (queries) for effective interaction with large language models (LLMs). The quality of a prompt directly determines the accuracy, relevance, and safety of the model's response. This field is rapidly evolving, moving from manual instruction crafting to the creation of complex agentic systems and the use of models with built-in reasoning mechanisms.
Basic Principles and Prompt Structure
While no single standard exists, effective prompts are built on common approaches proposed by various researchers and companies (e.g., OpenAI's 6 strategies or practical guides from Anthropic).
- An effective prompt often includes the following components:
- Role (Persona): Sets the context and behavioral style for the model ("You are a senior research scientist...").
- Instructions: Clear, step-by-step directions on what to do.
- Context: Necessary information to complete the task.
- Examples: A demonstration of the desired format or style (few-shot prompting).
- Output Format: Specification of the response structure (e.g., JSON, Markdown).
Techniques for Improving Reasoning
These techniques compel the model to "think" in a more structured way. Important Note (Emergence): The effectiveness of the `Chain‑of‑Thought` technique is primarily evident in large models (≈100 billion parameters and above); smaller models show negligible effects or even a decline in performance.
- Chain-of-Thought (CoT): Instructing the model to generate a step-by-step reasoning process before providing the final answer ("Think step-by-step").
- Variations and alternatives to CoT:
- Self-Consistency: Generating multiple reasoning chains and selecting the most frequent answer through a "vote".
- Tree-of-Thoughts (ToT): Exploring multiple reasoning paths as a tree, with evaluation and backtracking to previous steps.
- Graph-of-Thought (GoT): An advanced technique with two main implementations: one models reasoning as a graph for more flexible logical flows (Besta et al.), while the other focuses on merging reasoning paths (Yao et al.).
- Built-in Reasoning Mechanisms (Reasoning Models): It has been reported that a new generation of models (such as OpenAI's o1 and o3) is being developed, which are trained from the outset with internal reasoning chains. This allows them to perform complex tasks without explicit CoT prompting.
Context Management and Memory
As context windows expand, new challenges and solutions emerge.
- Context Windows (2024-2025):
| Model | Maximum Context Window |
|---|---|
| Google Gemini 2.0 Pro | 2 million tokens |
| Google Gemini 1.5 Pro | 2 million tokens |
| Anthropic Claude 3.5 Sonnet | ~200k tokens |
| OpenAI GPT-4o | ~128k tokens |
Starting March 2025, Google Gemini 2.5 Pro is presented with a 1 million token context window, and the Pro-Experimental version promises 2 million tokens in May 2025.
- Retrieval-Augmented Generation (RAG): Classic RAG supplements the prompt with information from external databases. Modern implementations include:
- GraphRAG: Uses knowledge graphs to retrieve more semantically related data.
- Multimodal RAG: Works not only with text but also with images, audio, and video.
- Agentic RAG: Integrates RAG into agentic loops, where the agent autonomously decides when and what information to search for.
- Long-Context Techniques: To work effectively with large windows, advanced but often proprietary techniques such as Cascading KV Cache and Infinite Retrieval are used.
Advanced Techniques: Agents and Tools
- Tool Usage:
- Function Calling: A built-in capability in models (GPT-4, Claude 3.5) to call external APIs.
- Model Context Protocol (MCP): According to some preliminary reports (requiring confirmation), a new standard for unifying tool APIs, supported by Microsoft, is being developed.
- Agents and Frameworks (2024-2025):
- LangChain (v0.3): With the release of LangChain v0.3 (September 2024), the framework fully transitioned to Pydantic 2 and dropped support for Python 3.8, in line with its EOL date in October 2024.[1]
- AutoGen: Has fully transitioned to an asynchronous, event-driven architecture (actor model).
- CrewAI: A rapidly gaining popularity, high-performance framework for orchestrating multi-agent systems.
- No-code platforms: Tools like AutoGen Studio allow for the creation and configuration of complex agents without writing code.
Techniques for Reducing Hallucinations
Hallucinations (generating factually incorrect information) remain a key problem. As of 2024, their rate varies from 3% to 16% in leading models, and the economic damage from them is estimated in the tens of billions of dollars.
- Classic Methods: RAG, requesting citations, adjusting generation parameters (temperature, top-p).
- Modern Alignment Approaches:
- Constitutional AI (CAI): A method proposed by Anthropic where the model is trained to follow a set of principles ("a constitution") using AI-generated feedback.
- Direct Preference Optimization (DPO): A simpler and more effective alternative to RLHF. Studies of DPO in multimodal VLMs (e.g., for radiology reports) have recorded a 3 to 4.8-fold reduction in hallucinations.
Prompt Patterns: Current State
Many patterns (Persona, Output Customization) remain relevant.
- Require Re-evaluation:
- Fact Check List Pattern: Recognized as unreliable. Models perform poorly at self-checking facts via prompting and require integration with external verification systems.
- New Patterns (2024-2025):
- Meta-prompting: Using one LLM to generate and optimize prompts for another LLM.
- Mixture-of-Experts (MoE) Prompts: Creating prompts that dynamically route to different "expert" parts of the model.
- Multimodal Patterns: Prompt structures that include text, images, and other data types for complex queries.
Links
- Anthropic Prompt Engineering Guide
- OpenAI Prompt Engineering Guide
- Google's Prompting Guide
- PromptingGuide.ai
- Amazon Bedrock
Bibliography
- Radford, A. et al. (2019). Language Models are Unsupervised Multitask Learners. PDF.
- Brown, T. B. et al. (2020). Language Models are Few-Shot Learners. arXiv:2005.14165.
- Li, X. L.; Liang, P. (2021). Prefix-Tuning: Optimizing Continuous Prompts for Generation. arXiv:2101.00190.
- Liu, Y. et al. (2021). Fantastically Ordered Prompts and Where to Find Them: Overcoming Few-Shot Prompt Order Sensitivity. arXiv:2104.08786.
- Bai, Y. et al. (2022). Constitutional AI: Harmlessness from AI Feedback. arXiv:2212.08073.
- Kojima, T. et al. (2022). Large Language Models are Zero-Shot Reasoners. arXiv:2205.11916.
- Wang, X. et al. (2022). Self-Consistency Improves Chain of Thought Reasoning in Language Models. arXiv:2203.11171.
- Wei, J. et al. (2022). Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. arXiv:2201.11903.
- Zhang, Z. et al. (2022). Automatic Chain of Thought Prompting in Large Language Models. arXiv:2210.03493.
- Zhou, D. et al. (2022). Least-to-Most Prompting Enables Complex Reasoning in Large Language Models. arXiv:2205.10625.
- Besta, M. et al. (2023). Graph of Thoughts: Solving Elaborate Problems with Large Language Models. arXiv:2308.09687.
- Madaan, A. et al. (2023). Self-Refine: Iterative Refinement with Self-Feedback. arXiv:2303.17651.
- Rafailov, R. et al. (2023). Direct Preference Optimization: Your Language Model is Secretly a Reward Model. arXiv:2305.18290.
- Schick, T. et al. (2023). Toolformer: Language Models Can Teach Themselves to Use Tools. arXiv:2302.04761.
- Wang, Y. et al. (2023). Self-Instruct: Aligning Language Models with Self-Generated Instructions. arXiv:2212.10560.
- Yao, S. et al. (2023). Tree of Thoughts: Deliberate Problem Solving with Large Language Models. arXiv:2305.10601.
- Chang, K. et al. (2024). Efficient Prompting Methods for Large Language Models: A Survey. arXiv:2404.01077.
- Genkina, D. (2024). AI Prompt Engineering Is Dead. IEEE Spectrum. [2].
- Li, Z. et al. (2024). Prompt Compression for Large Language Models: A Survey. arXiv:2410.12388.
- Liang, X. et al. (2024). Internal Consistency and Self-Feedback in Large Language Models: A Survey. arXiv:2407.14507.
- Li, W. et al. (2025). A Survey of Automatic Prompt Engineering: An Optimization Perspective. arXiv:2502.11560.
- Wu, Z. et al. (2025). The Dark Side of Function Calling: Pathways to Jailbreaking Large Language Models. EMNLP 2025. PDF.
- Yang, B. et al. (2025). Hallucination Detection in Large Language Models with Metamorphic Relations. arXiv:2502.15844.
Notes