Jacar mascot — reading along A laptop whose eyes follow your cursor while you read.
Inteligencia Artificial Metodologías

Prompt Engineering: From Trick to Mature Discipline

Prompt Engineering: From Trick to Mature Discipline

Actualizado: 2026-05-03

Eighteen months ago, “prompt engineering” sounded like a TikTok trick. Today it’s a discipline with proven patterns, dedicated libraries, and best practices converging across teams. The change isn’t only that models have improved — it’s that we understand them better, and teams integrating them into products have moved from undirected experimentation to a shared vocabulary.

Key takeaways

  • Consolidated patterns: instructions-context-question, few-shot, chain-of-thought, structured output, and positive instructions.
  • Structured output with function calling (OpenAI) or tool use (Anthropic) has eliminated most fragile regex parsing.
  • Self-consistency — running the same prompt N times and majority-voting — improves accuracy 10-20 points on complex reasoning.
  • 2022 viral patterns like “act as expert” or “take a deep breath” have aged poorly and should be avoided.
  • Treating prompts like code — versioned, tested, monitored — separates teams that get consistent results from those that don’t.

The patterns that have consolidated

From the OpenAI Cookbook[1] and Anthropic’s prompt library[2], these patterns have stopped being experimental:

  • Clear instructions, then context, then question. Order matters: GPT-4 and equivalent models pay more attention to the start and end of the prompt. Begin with the task, insert context in the middle, end with the concrete question.
  • Few-shot with representative examples. For structured tasks (extraction, classification, rewriting) giving 2-5 examples of input + expected output dramatically boosts quality. The key is choosing examples that cover edge cases, not just the ideal case.
  • Explicit chain-of-thought. Asking the model to “reason step by step before answering” — introduced by Wei et al. 2022[3] — measurably improves logical and mathematical reasoning on GSM8K and similar benchmarks.
  • Structured output with schema. Telling the model “respond with JSON matching this schema: {…}” produces parseable results with much lower error rates than “give me a JSON”.
  • Minimal negative instructions. “Don’t do X” works worse than “instead of X, do Y”. Models follow positive directions better.

Structured output as standard

A notable shift: structured output has moved from emerging pattern to native capability. OpenAI introduced function calling[4] — a formal mechanism for the model to return function invocations with JSON arguments validated against a schema. Anthropic’s Claude[5] adopted an equivalent pattern with tool use.

The practical impact is that many applications no longer need fragile regex parsing over free text. You define a function (say, extractInvoiceData(number, date, total, items[])), pass it to the model alongside text, and the response is directly invokable.

Libraries like Instructor[6] (Python) or Marvin[7] wrap these patterns over Pydantic, giving typed output with no manual effort. This connects with the ChatGPT plugins ecosystem, where function calling has channelled many integration patterns that previously required manual orchestration.

Transformer architecture: attention mechanism, encoder, and decoder underlying modern language models

Self-consistency and verification

When high reliability is needed for critical decisions, a robust pattern is self-consistency: run the same prompt N times (typically 3-5) at high temperature to generate diversity, then majority-vote the answers. Wang et al. 2022[8] showed this can raise accuracy 10-20 points on complex reasoning benchmarks.

The token cost is real — 3x to 5x per inference — but in flows where an error has consequences (medical diagnosis, legal analysis, financial decisions) the cost/reliability trade is favourable.

A more efficient variant: critic-refine:

  1. Generate an initial response with the prompt.
  2. Ask the model to critique its own response: “what problems does this answer have?”.
  3. Request a revised version incorporating the critique.

This typically doubles rather than 5x’s the cost, retaining much of the quality gain. The same logic we apply to software testing: a second review pass catches the errors the first pass misses.

Three 2022 viral patterns that have aged badly:

  • “Take a deep breath and work the problem step by step.” Added something with GPT-3.5, but with more recent models and explicit chain-of-thought it’s redundant.
  • “Act as an expert in X.” Less effective than giving specific instructions about style, rigour, and response format. Current models respond better to “provide technical analysis with citations” than to “you are a cybersecurity expert with 20 years of experience”.
  • Jailbreaks and safety manipulation. Even when they work, they produce worse quality than the model in normal mode — and typically violate terms of service.

Observability and evaluation tools

With maturity, tools have appeared to treat prompts as production artefacts:

  • LangSmith[9]: prompt-chain tracing and automated evaluation. Useful for detecting regressions when the model or prompt changes.
  • PromptLayer[10]: prompt versioning and A/B testing in production.
  • Weights & Biases Prompts[11]: structured experimentation with per-variant metric logging.

Equally important: automated evals. As we noted in GitHub Copilot as a code assistant, the quality of any AI assistant is measured with reproducible test cases — prompt engineering is no exception. Advances in NLP have yielded more precise metrics for evaluating output quality beyond human intuition.

Conclusion

Prompt engineering is no longer a trick: it is a reproducible engineering layer with patterns, libraries, and observability tooling. Teams that treat prompts with the same discipline as code — versioned, tested, monitored — consistently get better results. The real quality leap doesn’t come from finding the magic prompt, but from building the process that allows iterating on prompts with data.

Was this useful?
[Total: 6 · Average: 4.7]
  1. OpenAI Cookbook
  2. Anthropic’s prompt library
  3. Wei et al. 2022
  4. function calling
  5. Anthropic’s Claude
  6. Instructor
  7. Marvin
  8. Wang et al. 2022
  9. LangSmith
  10. PromptLayer
  11. Weights & Biases Prompts

Written by

CEO - Jacar Systems

Passionate about technology, cloud infrastructure and artificial intelligence. Writes about DevOps, AI, platforms and software from Madrid.