# [Prompting for Instruction Following](https://chatgpt.com/canvas/shared/6825ebe022148191bceb9fa5473a34eb)

## Overview

GPT-4.1 represents a significant shift in how developers should structure prompts for reliable, deterministic, and consistent behavior. Unlike earlier models which often inferred intent liberally, GPT-4.1 adheres to instructions in a far more literal, detail-sensitive manner. This brings both increased control and greater responsibility for developers: well-designed prompts yield exceptional results, while ambiguous or conflicting instructions may result in brittle or unexpected behavior.

This guide outlines best practices, real-world examples, and design patterns to fully utilize GPT-4.1’s instruction-following improvements across a variety of applications. It is structured to help you:

* Understand GPT-4.1’s instruction handling behavior
* Design high-integrity prompt scaffolds
* Debug prompt failures and mitigate ambiguity
* Align instructions with OpenAI’s guidance around tool usage, task persistence, and planning

This file is designed to stand alone for practical use and is fully aligned with the broader `openai-cookbook-pro` repository.


## Why Instruction-Following Matters

Instruction following is central to:

* **Agent behavior**: models acting in multi-step environments must reliably interpret commands
* **Tool use**: execution hinges on clearly-defined tool invocation criteria
* **Support workflows**: factual grounding depends on accurate boundary adherence
* **Security and safety**: systems must not misinterpret prohibitions or fail to enforce policy constraints

With GPT-4.1’s shift toward literal interpretation, instruction scaffolding becomes the primary control interface.


## GPT-4.1 Instruction Characteristics

### 1. **Literal Compliance**

GPT-4.1 follows instructions with minimal assumption. If a step is missing or unclear, the model is less likely to “fill in” or guess the user’s intent.

* **Previous behavior**: interpreted vague prompts broadly
* **Current behavior**: waits for or requests clarification

This improves safety and traceability but also increases fragility in loosely written prompts.

### 2. **Order-Sensitive Resolution**

When instructions conflict, GPT-4.1 favors those listed **last** in the prompt. This means developers should order rules hierarchically:

* General rules go early
* Specific overrides go later

Example:

```markdown
# Instructions
- Do not guess if unsure
- Use your knowledge if a tool isn’t available
- If both options are available, prefer the tool
```

### 3. **Format-Aware Behavior**

GPT-4.1 performs better with clearly formatted instructions. Prefer structured formats:

* Markdown with headers and lists
* XML with nested tags
* Structured sections like `# Steps`, `# Output Format`

Poorly formatted, unsegmented prompts lead to instruction bleed and undesired merging of behaviors.


## Recommended Prompt Structure

Organize your prompt using a structure that mirrors OpenAI’s internal evaluation standards.

### 📁 Standard Sections

```markdown
# Role and Objective
# Instructions
## Sub-categories for Specific Behavior
# Workflow Steps (Optional)
# Output Format
# Examples (Optional)
# Final Reminder
```

### Example Prompt Template

```markdown
# Role and Objective
You are a customer service assistant. Your job is to resolve user issues efficiently, using tools when needed.

# Instructions
- Greet the user politely.
- Use a tool before answering any account-related question.
- If unsure how to proceed, ask the user for clarification.
- If a user requests escalation, refer them to a human agent.

## Output Format
- Always use a friendly tone.
- Format your answer in plain text.
- Include a summary at the end of your response.

## Final Reminder
Do not rely on prior knowledge. Use provided tools and context only.
```


## Instruction Categories

### 1. **Task Definition**

Clearly state the model’s job in the opening lines. Be explicit:

✅ “You are an assistant that reviews and edits legal contracts.”

🚫 “Help with contracts.”

### 2. **Behavioral Constraints**

List what the model must or must not do:

* Must call tools before responding to factual queries
* Must ask for clarification if user input is incomplete
* Must not provide financial or legal advice

### 3. **Response Style**

Define tone, length, formality, and structure.

* “Keep responses under 250 words.”
* “Avoid lists unless asked.”
* “Use a neutral tone.”

### 4. **Tool Use Protocols**

Models often hallucinate tools unless guided:

* “If you don’t have enough information to use a tool, ask the user for more.”
* “Always confirm tool usage before responding.”


## Debugging Instruction Failures

Instruction-following failures often stem from the following:

### Common Causes

* Ambiguous rule phrasing
* Conflicting instructions (e.g., both asking to guess and not guess)
* Implicit behaviors expected, not stated
* Overloaded instructions without formatting

### Diagnosis Steps

1. Read the full prompt in sequence
2. Identify potential ambiguity
3. Reorder to clarify precedence
4. Break complex rules into atomic steps
5. Test with structured evals


## Instruction Layering: The 3-Tier Model

When designing prompts for multi-step tasks, layer your instructions in tiers:

| Tier | Layer Purpose               | Example                                    |
| ---- | --------------------------- | ------------------------------------------ |
| 1    | Role Declaration            | “You are an assistant for legal tasks.”    |
| 2    | Global Behavior Constraints | “Always cite sources.”                     |
| 3    | Task-Specific Instructions  | “In contracts, highlight ambiguous terms.” |

Each layer helps disambiguate behavior and provides a fallback structure if downstream instructions fail.


## Long Context Instruction Handling

In prompts exceeding 50,000 tokens:

* Place **key instructions** both **before and after** the context.
* Use format anchors (`# Instructions`, `<rules>`) to signal boundaries.
* Avoid relying solely on the top-of-prompt instructions.

GPT-4.1 is trained to respect these placements, especially when consistent structure is maintained.


## Literal vs. Flexible Models

| Capability             | GPT-3.5 / GPT-4-turbo | GPT-4.1         |
| ---------------------- | --------------------- | --------------- |
| Implicit inference     | High                  | Low             |
| Literal compliance     | Moderate              | High            |
| Prompt flexibility     | Higher tolerance      | Lower tolerance |
| Instruction debug cost | Lower                 | Higher          |

GPT-4.1 performs better **when prompts are precise**. Treat prompt engineering as API design — clear, testable, and version-controlled.


## Tips for Designing Instruction-Sensitive Prompts

### ✔️ DO:

* Use structured formatting
* Scope behaviors into separate bullet points
* Use examples to anchor expected output
* Rewrite ambiguous instructions into atomic steps
* Add conditionals explicitly (e.g., “if X, then Y”)

### ❌ DON’T:

* Assume the model will “understand what you meant”
* Use overloaded sentences with multiple actions
* Rely on invisible or implied rules
* Assume formatting styles (e.g., bullets) are optional


## Example: Instruction-Controlled Code Agent

```markdown
# Objective
You are a code assistant that fixes bugs in open-source projects.

# Instructions
- Always use the tools provided to inspect code.
- Do not make edits unless you have confirmed the bug’s root cause.
- If a change is proposed, validate using tests.
- Do not respond unless the patch is applied.

## Output Format
1. Description of bug
2. Explanation of root cause
3. Tool output (e.g., patch result)
4. Confirmation message

## Final Note
Do not guess. If you are unsure, use tools or ask.
```

> For a complete walkthrough, see `/examples/code-agent-instructions.md`


## Instruction Evolution Across Iterations

As your prompts grow, preserve instruction integrity using:

* Versioned templates
* Structured diffs for instruction edits
* Commented rules for traceability

Example diff:

```diff
- Always answer user questions.
+ Only answer user questions after validating tool output.
```

Maintain a changelog for prompts as you would with source code. This ensures instructional integrity during collaborative development.


## Testing and Evaluation

Prompt engineering is empirical. Validate instruction design using:

* **A/B tests**: Compare variants with and without behavioral scaffolds
* **Prompt evals**: Use deterministic queries to test edge case behavior
* **Behavioral matrices**: Track compliance with instruction categories

Example matrix:

| Instruction Category | Prompt A Pass | Prompt B Pass |
| -------------------- | ------------- | ------------- |
| Ask if unsure        | ✅             | ❌             |
| Use tools first      | ✅             | ✅             |
| Avoid sensitive data | ❌             | ✅             |


## Final Reminders

GPT-4.1 is exceptionally effective **when paired with well-structured, comprehensive instructions**. Follow these principles:

* Instructions should be modular and auditable.
* Avoid unnecessary repetition, but reinforce critical rules.
* Use formatting styles that clearly separate content.
* Assume literalism — write prompts as if programming a function, not chatting with a person.

Every prompt is a contract. GPT-4.1 honors that contract, but only if written clearly.


## See Also

* [`Agent Workflows`](../agent_design/swe_bench_agent.md)
* [`Prompt Format Reference`](../reference/prompting_guide.md)
* [`Long Context Strategies`](../examples/long-context-formatting.md)
* [`OpenAI 4.1 Prompting Guide`](https://platform.openai.com/docs/guides/prompting)


For questions, suggestions, or prompt design contributions, submit a pull request to `/examples/instruction-following.md` or open an issue in the main repo.