Eighteen months ago, “prompt engineering” sounded like a TikTok trick. Today it’s a discipline with proven patterns, dedicated libraries, and best practices converging across teams. The change isn’t that models have improved — they have — but that we understand them better, and teams integrating them into products have moved from wild experimentation to a shared vocabulary.
The Patterns That Have Consolidated
From the Center for AI Safety annual survey and engineering-team reports (OpenAI Cookbook, Anthropic prompt library), these patterns have stopped being experimental:
- Clear instructions, then context, then question. Order matters: GPT-3.5 and GPT-4 pay more attention to the start and end of the prompt. Begin with the task, insert context in the middle, end with the concrete question.
- Few-shot with representative examples. For structured tasks (extraction, classification, rewriting) giving 2-5 examples of input + expected output boosts quality dramatically. The devil is picking examples that cover edge cases, not just the ideal case.
- Explicit chain-of-thought. Asking the model to “reason step by step before answering” — introduced by Wei et al. 2022 — still measurably improves logical and mathematical reasoning on GSM8K and similar benchmarks.
- Structured output with schema. Telling the model “respond with JSON matching this schema: {…}” produces parseable results with much lower error rates than “give me a JSON”.
- Minimal negative instructions. “Don’t do X” works worse than “instead of X, do Y”. Models follow positive directions better.
Structured Output as Standard
A notable 2023 shift: structured output has moved from emerging pattern to native capability. In June 2023 OpenAI introduced function calling in their APIs — a formal mechanism for the model to return function invocations with JSON arguments validated against a schema. Claude has adopted a similar pattern.
The practical impact is that many applications no longer need fragile regex parsing over free text. You define a function (say, extractInvoiceData(number, date, total, items[])), pass it to the model alongside text, and the response is directly invokable. This has cut a significant amount of “glue” code in LLM + backend integrations.
Libraries like Instructor (Python) or Marvin wrap these patterns over Pydantic, giving typed output with no manual effort.
Self-Consistency and Verification
When high reliability is needed for critical decisions, a robust 2023 pattern is self-consistency: run the same prompt N times (typically 3-5) at high temperature to generate diversity, and majority-vote the answers. Wang et al. 2022 showed this can raise accuracy 10-20 points on complex reasoning benchmarks.
The token cost is real — 3x to 5x per inference — but in flows where an error has consequences (medical diagnosis, legal analysis, financial decisions) the cost/reliability trade is favorable.
A more efficient variant: critic-refine. Generate an initial answer, ask the model to critique its own response (“what problems does this answer have?”), then request a revised version. This typically doubles rather than 5x’s the cost, retaining much of the quality gain.
What’s No Longer Recommended
Some 2022 viral patterns have aged badly:
- “Take a deep breath and work the problem step by step”. Marginally useful with GPT-3.5 but adds little over GPT-4 with explicit chain-of-thought.
- “Act as an expert in X”. Less effective than giving specific instructions about style, rigour, and response format. Current models respond better to “provide technical analysis with citations” than to “you are a cybersecurity expert with 20 years of experience”.
- Jailbreaks and safety manipulation. Even when they work, they produce worse quality than the model in normal mode — and usually violate terms of service.
Observability and Evaluation Tools
With maturity, tools have appeared to treat prompts as production artefacts:
- LangSmith for prompt-chain tracing + automated evaluation.
- PromptLayer for prompt versioning and A/B testing in production.
- Weights & Biases Prompts for structured experimentation.
Equally important: automated evals. As we wrote about GitHub Copilot, the quality of any AI assistant is measured with reproducible test cases — prompt engineering is no exception.
Conclusion
Prompt engineering is no longer a trick: it’s a reproducible engineering layer with patterns, libraries, and observability tools. Teams that treat prompts with the same discipline as code — versioned, tested, monitored — are consistently getting better results than those who “try things in ChatGPT” without process.
Follow us on jacar.es for more on production LLMs, AI engineering, and integration best practices.