# [Prompting for Instruction Following](https://chatgpt.com/canvas/shared/6825ebe022148191bceb9fa5473a34eb) ## Overview GPT-4.1 represents a significant shift in how developers should structure prompts for reliable, deterministic, and consistent behavior. Unlike earlier models which often inferred intent liberally, GPT-4.1 adheres to instructions in a far more literal, detail-sensitive manner. This brings both increased control and greater responsibility for developers: well-designed prompts yield exceptional results, while ambiguous or conflicting instructions may result in brittle or unexpected behavior. This guide outlines best practices, real-world examples, and design patterns to fully utilize GPT-4.1’s instruction-following improvements across a variety of applications. It is structured to help you: * Understand GPT-4.1’s instruction handling behavior * Design high-integrity prompt scaffolds * Debug prompt failures and mitigate ambiguity * Align instructions with OpenAI’s guidance around tool usage, task persistence, and planning This file is designed to stand alone for practical use and is fully aligned with the broader `openai-cookbook-pro` repository. ## Why Instruction-Following Matters Instruction following is central to: * **Agent behavior**: models acting in multi-step environments must reliably interpret commands * **Tool use**: execution hinges on clearly-defined tool invocation criteria * **Support workflows**: factual grounding depends on accurate boundary adherence * **Security and safety**: systems must not misinterpret prohibitions or fail to enforce policy constraints With GPT-4.1’s shift toward literal interpretation, instruction scaffolding becomes the primary control interface. ## GPT-4.1 Instruction Characteristics ### 1. **Literal Compliance** GPT-4.1 follows instructions with minimal assumption. If a step is missing or unclear, the model is less likely to “fill in” or guess the user’s intent. * **Previous behavior**: interpreted vague prompts broadly * **Current behavior**: waits for or requests clarification This improves safety and traceability but also increases fragility in loosely written prompts. ### 2. **Order-Sensitive Resolution** When instructions conflict, GPT-4.1 favors those listed **last** in the prompt. This means developers should order rules hierarchically: * General rules go early * Specific overrides go later Example: ```markdown # Instructions - Do not guess if unsure - Use your knowledge if a tool isn’t available - If both options are available, prefer the tool ``` ### 3. **Format-Aware Behavior** GPT-4.1 performs better with clearly formatted instructions. Prefer structured formats: * Markdown with headers and lists * XML with nested tags * Structured sections like `# Steps`, `# Output Format` Poorly formatted, unsegmented prompts lead to instruction bleed and undesired merging of behaviors. ## Recommended Prompt Structure Organize your prompt using a structure that mirrors OpenAI’s internal evaluation standards. ### 📁 Standard Sections ```markdown # Role and Objective # Instructions ## Sub-categories for Specific Behavior # Workflow Steps (Optional) # Output Format # Examples (Optional) # Final Reminder ``` ### Example Prompt Template ```markdown # Role and Objective You are a customer service assistant. Your job is to resolve user issues efficiently, using tools when needed. # Instructions - Greet the user politely. - Use a tool before answering any account-related question. - If unsure how to proceed, ask the user for clarification. - If a user requests escalation, refer them to a human agent. ## Output Format - Always use a friendly tone. - Format your answer in plain text. - Include a summary at the end of your response. ## Final Reminder Do not rely on prior knowledge. Use provided tools and context only. ``` ## Instruction Categories ### 1. **Task Definition** Clearly state the model’s job in the opening lines. Be explicit: ✅ “You are an assistant that reviews and edits legal contracts.” 🚫 “Help with contracts.” ### 2. **Behavioral Constraints** List what the model must or must not do: * Must call tools before responding to factual queries * Must ask for clarification if user input is incomplete * Must not provide financial or legal advice ### 3. **Response Style** Define tone, length, formality, and structure. * “Keep responses under 250 words.” * “Avoid lists unless asked.” * “Use a neutral tone.” ### 4. **Tool Use Protocols** Models often hallucinate tools unless guided: * “If you don’t have enough information to use a tool, ask the user for more.” * “Always confirm tool usage before responding.” ## Debugging Instruction Failures Instruction-following failures often stem from the following: ### Common Causes * Ambiguous rule phrasing * Conflicting instructions (e.g., both asking to guess and not guess) * Implicit behaviors expected, not stated * Overloaded instructions without formatting ### Diagnosis Steps 1. Read the full prompt in sequence 2. Identify potential ambiguity 3. Reorder to clarify precedence 4. Break complex rules into atomic steps 5. Test with structured evals ## Instruction Layering: The 3-Tier Model When designing prompts for multi-step tasks, layer your instructions in tiers: | Tier | Layer Purpose | Example | | ---- | --------------------------- | ------------------------------------------ | | 1 | Role Declaration | “You are an assistant for legal tasks.” | | 2 | Global Behavior Constraints | “Always cite sources.” | | 3 | Task-Specific Instructions | “In contracts, highlight ambiguous terms.” | Each layer helps disambiguate behavior and provides a fallback structure if downstream instructions fail. ## Long Context Instruction Handling In prompts exceeding 50,000 tokens: * Place **key instructions** both **before and after** the context. * Use format anchors (`# Instructions`, ``) to signal boundaries. * Avoid relying solely on the top-of-prompt instructions. GPT-4.1 is trained to respect these placements, especially when consistent structure is maintained. ## Literal vs. Flexible Models | Capability | GPT-3.5 / GPT-4-turbo | GPT-4.1 | | ---------------------- | --------------------- | --------------- | | Implicit inference | High | Low | | Literal compliance | Moderate | High | | Prompt flexibility | Higher tolerance | Lower tolerance | | Instruction debug cost | Lower | Higher | GPT-4.1 performs better **when prompts are precise**. Treat prompt engineering as API design — clear, testable, and version-controlled. ## Tips for Designing Instruction-Sensitive Prompts ### ✔️ DO: * Use structured formatting * Scope behaviors into separate bullet points * Use examples to anchor expected output * Rewrite ambiguous instructions into atomic steps * Add conditionals explicitly (e.g., “if X, then Y”) ### ❌ DON’T: * Assume the model will “understand what you meant” * Use overloaded sentences with multiple actions * Rely on invisible or implied rules * Assume formatting styles (e.g., bullets) are optional ## Example: Instruction-Controlled Code Agent ```markdown # Objective You are a code assistant that fixes bugs in open-source projects. # Instructions - Always use the tools provided to inspect code. - Do not make edits unless you have confirmed the bug’s root cause. - If a change is proposed, validate using tests. - Do not respond unless the patch is applied. ## Output Format 1. Description of bug 2. Explanation of root cause 3. Tool output (e.g., patch result) 4. Confirmation message ## Final Note Do not guess. If you are unsure, use tools or ask. ``` > For a complete walkthrough, see `/examples/code-agent-instructions.md` ## Instruction Evolution Across Iterations As your prompts grow, preserve instruction integrity using: * Versioned templates * Structured diffs for instruction edits * Commented rules for traceability Example diff: ```diff - Always answer user questions. + Only answer user questions after validating tool output. ``` Maintain a changelog for prompts as you would with source code. This ensures instructional integrity during collaborative development. ## Testing and Evaluation Prompt engineering is empirical. Validate instruction design using: * **A/B tests**: Compare variants with and without behavioral scaffolds * **Prompt evals**: Use deterministic queries to test edge case behavior * **Behavioral matrices**: Track compliance with instruction categories Example matrix: | Instruction Category | Prompt A Pass | Prompt B Pass | | -------------------- | ------------- | ------------- | | Ask if unsure | ✅ | ❌ | | Use tools first | ✅ | ✅ | | Avoid sensitive data | ❌ | ✅ | ## Final Reminders GPT-4.1 is exceptionally effective **when paired with well-structured, comprehensive instructions**. Follow these principles: * Instructions should be modular and auditable. * Avoid unnecessary repetition, but reinforce critical rules. * Use formatting styles that clearly separate content. * Assume literalism — write prompts as if programming a function, not chatting with a person. Every prompt is a contract. GPT-4.1 honors that contract, but only if written clearly. ## See Also * [`Agent Workflows`](../agent_design/swe_bench_agent.md) * [`Prompt Format Reference`](../reference/prompting_guide.md) * [`Long Context Strategies`](../examples/long-context-formatting.md) * [`OpenAI 4.1 Prompting Guide`](https://platform.openai.com/docs/guides/prompting) For questions, suggestions, or prompt design contributions, submit a pull request to `/examples/instruction-following.md` or open an issue in the main repo.