Jacar mascot — reading along A laptop whose eyes follow your cursor while you read.
Desarrollo de Software Inteligencia Artificial

OpenAI Assistants API: Stateful Agents Without Your Own Infrastructure

OpenAI Assistants API: Stateful Agents Without Your Own Infrastructure

Actualizado: 2026-05-03

OpenAI’s Assistants API, now on its v2 after a significant redesign, is a deliberate attempt to package the patterns everyone ends up reinventing when building an agent: a multi-turn conversation that survives across sessions, a structured mechanism for the model to request function execution, an isolated Python interpreter, and a retrieval system over user-provided documents. All of that without having to stand up a database for history, an embedding pipeline, or a code sandbox. The price of that convenience is giving up some control and signing up for clear vendor dependency.

Key takeaways

  • The Assistant + Thread + Run abstraction eliminates the infrastructure needed for conversation state and history — useful in prototypes and internal bots.
  • File search (managed RAG) saves weeks of work in simple cases; it falls short when retrieval quality is critical.
  • Code interpreter is an ephemeral Python sandbox — useful for analytics on user data, with non-trivial latency.
  • Function calling is the point where the assistant stops being chat and becomes an agent with real capacity to act.
  • For high traffic, predictable cost, or multi-provider architecture, Chat Completions with your own stack is the right answer.

The conceptual model

The central abstraction is the assistant: a reusable configuration with a model (GPT-4o or GPT-4o mini), instructions as a system prompt, and the list of available tools. On top of that assistant we create threads (individual conversations with their messages). To have the model process a thread we launch a run, whose state transitions through queued, in_progress, requires_action, completed, failed, cancelled, and expired. The split between thread (data) and run (execution) is intentional: it lets you fire the same thread against different assistants or reuse one assistant across thousands of different threads.

python
from openai import OpenAI

client = OpenAI()

assistant = client.beta.assistants.create(
    name="Customer support",
    instructions="You are a concise, friendly support assistant.",
    model="gpt-4o",
    tools=[{"type": "code_interpreter"}, {"type": "file_search"}],
)

thread = client.beta.threads.create()

client.beta.threads.messages.create(
    thread_id=thread.id,
    role="user",
    content="I have a problem with order 12345",
)

run = client.beta.threads.runs.create_and_poll(
    thread_id=thread.id,
    assistant_id=assistant.id,
)

if run.status == "completed":
    messages = client.beta.threads.messages.list(thread_id=thread.id)

create_and_poll encapsulates the pattern of launching the run and waiting for it to finish. In production, the streaming variant emits events as the model generates tokens — necessary for responsive interfaces.

The three tools that change the calculus

Code interpreter spins up an ephemeral Python sandbox with the usual data-analysis libraries. The assistant can generate code, execute it, and return results including files (PNG charts, processed CSVs). Useful for analytics assistants where the user uploads a spreadsheet and asks for conclusions. It introduces non-trivial latency — starting the sandbox and running code has real cost.

File search is where the API saves the most work. We upload files, attach them to a vector store managed by OpenAI, and the model queries them automatically whenever it decides the question requires them. OpenAI handles chunking, embeddings, the index, and retrieval. For a small or mid-sized document base with homogeneous content, it avoids standing up pgvector or Pinecone, an ingestion pipeline, and all the citation logic. For serious cases, file search falls short: no control over chunk size, opaque retrieval strategy, no reranking customisation. See RAG in production: patterns that work for when full control is needed.

Function calling is the mechanism by which the model pauses execution, declares that it wants to invoke a function whose JSON schema we’ve defined, and waits for us to return the result. When the run enters requires_action we read which function was requested, execute it on our own infrastructure, and submit the response. This is the point where the assistant stops being a chat and becomes an agent with real capacity to act — connecting to our database, internal APIs, external services.

Persistent threads and the actual cost

Having the thread survive on OpenAI’s servers is convenient: the user comes back three days later, we retrieve the thread by its identifier, and continue with all prior context intact. No message table to maintain, no backup to orchestrate. The catch is that this history is re-sent to the model on every run. As the conversation grows, the per-turn cost grows with it.

Pricing has to be read in aggregate:

  • Input/output tokens at the standard rate of the chosen model.
  • Code interpreter: $0.03 per session-hour.
  • File search: $0.10/GB-day of vector store storage, plus the tokens retrieved content consumes in the prompt.

At high volumes the total can comfortably exceed a direct Chat Completions call backed by own RAG over pgvector — especially if documents are already indexed for other purposes. The comparison with alternative models like Mistral Large is relevant if cost control is a requirement.

When it pays off and when it doesn’t

The Assistants API shines in:

  • Prototypes that need to be in the hands of a non-technical stakeholder within days.
  • Internal support bots over documentation that fits in memory.
  • Analytics assistants where the value lies in combining natural language with code execution over user-provided data.

In all of these, savings from not building infrastructure outweigh the lock-in.

It’s worth avoiding when:

  • The application is serious and traffic is high.
  • RAG reliability is critical — the file search is a black box.
  • Costs need to be predictable to the cent.
  • The architecture contemplates more than one model provider.

In those cases, Chat Completions with your own vector database, conversation storage, and orchestration in code gives more control, more transparency, and more portability. The more controlled multi-agent pattern is described in CrewAI: agent teams.

The mixed pattern that works best

Use Assistants to iterate fast on experiments, internal bots, and secondary features, while reserving the custom stack for the core product where differentiation and cost at scale demand every architectural decision be yours. The question isn’t which side to pick: it’s recognising that these two options don’t compete on the same ground — one sells initial speed and the other sells long-term control.

Conclusion

The Assistants API is a powerful tool for cases where time-to-value matters more than granular control. For prototypes, internal bots, and analytics on user data, it eliminates weeks of infrastructure. For serious production with high traffic, critical RAG, or multi-provider architecture, the custom stack is the right answer. The decision isn’t technical in the strict sense: it is operational and strategic.

Was this useful?
[Total: 15 · Average: 4.4]

Written by

CEO - Jacar Systems

Passionate about technology, cloud infrastructure and artificial intelligence. Writes about DevOps, AI, platforms and software from Madrid.