Written by

CEO - Jacar Systems

Passionate about technology, cloud infrastructure and artificial intelligence. Writes about DevOps, AI, platforms and software from Madrid.

Desarrollo de Software Inteligencia Artificial

agentes agents assistants api openai rag threads tool calling

OpenAI Assistants API: Stateful Agents Without Your Own Infrastructure

September 17, 2024 12 min read 114 reads

Table of contents

Key takeaways
The conceptual model
The three tools that change the calculus
Persistent threads and the actual cost
When it pays off and when it doesn’t
The mixed pattern that works best
Conclusion

Actualizado: 2026-05-03

OpenAI’s Assistants API, now on its v2 after a significant redesign, is a deliberate attempt to package the patterns everyone ends up reinventing when building an agent: a multi-turn conversation that survives across sessions, a structured mechanism for the model to request function execution, an isolated Python interpreter, and a retrieval system over user-provided documents. All of that without having to stand up a database for history, an embedding pipeline, or a code sandbox. The price of that convenience is giving up some control and signing up for clear vendor dependency.

Key takeaways

The Assistant + Thread + Run abstraction eliminates the infrastructure needed for conversation state and history — useful in prototypes and internal bots.
File search (managed RAG) saves weeks of work in simple cases; it falls short when retrieval quality is critical.
Code interpreter is an ephemeral Python sandbox — useful for analytics on user data, with non-trivial latency.
Function calling is the point where the assistant stops being chat and becomes an agent with real capacity to act.
For high traffic, predictable cost, or multi-provider architecture, Chat Completions with your own stack is the right answer.

The conceptual model

The central abstraction is the assistant: a reusable configuration with a model (GPT-4o or GPT-4o mini), instructions as a system prompt, and the list of available tools. On top of that assistant we create threads (individual conversations with their messages). To have the model process a thread we launch a run, whose state transitions through queued, in_progress, requires_action, completed, failed, cancelled, and expired. The split between thread (data) and run (execution) is intentional: it lets you fire the same thread against different assistants or reuse one assistant across thousands of different threads.

python

from openai import OpenAI

client = OpenAI()

assistant = client.beta.assistants.create(
    name="Customer support",
    instructions="You are a concise, friendly support assistant.",
    model="gpt-4o",
    tools=[{"type": "code_interpreter"}, {"type": "file_search"}],
)

thread = client.beta.threads.create()

client.beta.threads.messages.create(
    thread_id=thread.id,
    role="user",
    content="I have a problem with order 12345",
)

run = client.beta.threads.runs.create_and_poll(
    thread_id=thread.id,
    assistant_id=assistant.id,
)

if run.status == "completed":
    messages = client.beta.threads.messages.list(thread_id=thread.id)

create_and_poll encapsulates the pattern of launching the run and waiting for it to finish. In production, the streaming variant emits events as the model generates tokens — necessary for responsive interfaces.

The three tools that change the calculus

Code interpreter spins up an ephemeral Python sandbox with the usual data-analysis libraries. The assistant can generate code, execute it, and return results including files (PNG charts, processed CSVs). Useful for analytics assistants where the user uploads a spreadsheet and asks for conclusions. It introduces non-trivial latency — starting the sandbox and running code has real cost.

File search is where the API saves the most work. We upload files, attach them to a vector store managed by OpenAI, and the model queries them automatically whenever it decides the question requires them. OpenAI handles chunking, embeddings, the index, and retrieval. For a small or mid-sized document base with homogeneous content, it avoids standing up pgvector or Pinecone, an ingestion pipeline, and all the citation logic. For serious cases, file search falls short: no control over chunk size, opaque retrieval strategy, no reranking customisation. See RAG in production: patterns that work for when full control is needed.

Function calling is the mechanism by which the model pauses execution, declares that it wants to invoke a function whose JSON schema we’ve defined, and waits for us to return the result. When the run enters requires_action we read which function was requested, execute it on our own infrastructure, and submit the response. This is the point where the assistant stops being a chat and becomes an agent with real capacity to act — connecting to our database, internal APIs, external services.

Persistent threads and the actual cost

Having the thread survive on OpenAI’s servers is convenient: the user comes back three days later, we retrieve the thread by its identifier, and continue with all prior context intact. No message table to maintain, no backup to orchestrate. The catch is that this history is re-sent to the model on every run. As the conversation grows, the per-turn cost grows with it.

Pricing has to be read in aggregate:

Input/output tokens at the standard rate of the chosen model.
Code interpreter: $0.03 per session-hour.
File search: $0.10/GB-day of vector store storage, plus the tokens retrieved content consumes in the prompt.

At high volumes the total can comfortably exceed a direct Chat Completions call backed by own RAG over pgvector — especially if documents are already indexed for other purposes. The comparison with alternative models like Mistral Large is relevant if cost control is a requirement.

When it pays off and when it doesn’t

The Assistants API shines in:

Prototypes that need to be in the hands of a non-technical stakeholder within days.
Internal support bots over documentation that fits in memory.
Analytics assistants where the value lies in combining natural language with code execution over user-provided data.

In all of these, savings from not building infrastructure outweigh the lock-in.

It’s worth avoiding when:

The application is serious and traffic is high.
RAG reliability is critical — the file search is a black box.
Costs need to be predictable to the cent.
The architecture contemplates more than one model provider.

In those cases, Chat Completions with your own vector database, conversation storage, and orchestration in code gives more control, more transparency, and more portability. The more controlled multi-agent pattern is described in CrewAI: agent teams.

The mixed pattern that works best

Use Assistants to iterate fast on experiments, internal bots, and secondary features, while reserving the custom stack for the core product where differentiation and cost at scale demand every architectural decision be yours. The question isn’t which side to pick: it’s recognising that these two options don’t compete on the same ground — one sells initial speed and the other sells long-term control.

Conclusion

The Assistants API is a powerful tool for cases where time-to-value matters more than granular control. For prototypes, internal bots, and analytics on user data, it eliminates weeks of infrastructure. For serious production with high traffic, critical RAG, or multi-provider architecture, the custom stack is the right answer. The decision isn’t technical in the strict sense: it is operational and strategic.

Was this useful?

[Total: 15 · Average: 4.4]

Post Views: 114

Written by

Javier Cañete

CEO - Jacar Systems

Passionate about technology, cloud infrastructure and artificial intelligence. Writes about DevOps, AI, platforms and software from Madrid.

Desarrollo de Software

AI editors in 2026: comparison after a year of use

Claude Code, Cursor, Aider, Copilot, Windsurf. Tras un año intenso con los principales editores asistidos por IA, esta es la comparativa que importa para quien elige hoy.

227 5 min April 28, 2026

Desarrollo de Software

AI tools for developers: the 2026 stack

El stack de herramientas IA que un desarrollador usa en 2026 es distinto al de hace dieciocho meses. Editores agénticos, herramientas de revisión, agentes de terminal y asistentes de pruebas se han estabilizado en roles reconocibles. Guía práctica por categoría.

163 13 min March 29, 2026 4.5

Desarrollo de Software

Rust in the Linux kernel: balance after several years

Cuatro años y medio después de la entrada oficial de Rust en el kernel Linux 6.1, con drivers reales de GPU Apple y NVMe en producción y tras varios conflictos mediáticos entre mantenedores, toca hacer balance técnico sin histrionismo. Qué funciona, qué cuesta y hacia dónde va la próxima fase.

140 11 min March 8, 2026 4.3

Desarrollo de Software

WASI preview 3: adoption and real cases

WASI preview 3 llegó como estándar estable a finales de 2025 y ha tenido unos meses para demostrar si realmente desbloquea los casos que preview 2 se quedaba cortos. Recorrido honesto por adopciones reales, bibliotecas maduras y patrones que empiezan a funcionar en producción.

240 13 min February 6, 2026 4.6

OpenAI Assistants API: Stateful Agents Without Your Own Infrastructure

Key takeaways

The conceptual model

The three tools that change the calculus

Persistent threads and the actual cost

When it pays off and when it doesn’t

The mixed pattern that works best

Conclusion

Related posts

AI editors in 2026: comparison after a year of use

AI tools for developers: the 2026 stack

Rust in the Linux kernel: balance after several years

WASI preview 3: adoption and real cases