Streaming agent responses

Building on the Letta API

Messages from the Letta server can be streamed to the client. If you’re building a UI on the Letta API, enabling streaming allows your UI to update in real-time as the agent generates a response to an input message.

Quick Start

Letta supports two streaming modes: step streaming (default) and token streaming.

To enable streaming, use the /v1/agents/{agent_id}/messages/stream endpoint instead of /messages:

TypeScript
Python

import { LettaClient } from "@letta-ai/letta-client";

const client = new LettaClient({ token: "YOUR_API_KEY" });

// Step streaming (default) - returns complete messages
const stream = await client.agents.messages.createStream(agent.id, {
  messages: [{ role: "user", content: "Hello!" }],
});
for await (const chunk of stream) {
  console.log(chunk); // Complete message objects
}

// Token streaming - returns partial chunks for real-time UX
const tokenStream = await client.agents.messages.createStream(agent.id, {
  messages: [{ role: "user", content: "Hello!" }],
  streamTokens: true, // Enable token streaming
});
for await (const chunk of tokenStream) {
  console.log(chunk); // Partial content chunks
}

# Step streaming (default) - returns complete messages
stream = client.agents.messages.create_stream(
    agent_id=agent.id,
    messages=[{"role": "user", "content": "Hello!"}]
)
for chunk in stream:
    print(chunk)  # Complete message objects

# Token streaming - returns partial chunks for real-time UX
stream = client.agents.messages.create_stream(
    agent_id=agent.id,
    messages=[{"role": "user", "content": "Hello!"}],
    stream_tokens=True  # Enable token streaming
)
for chunk in stream:
    print(chunk)  # Partial content chunks

Shorthand syntax: The input parameter works with streaming too:

TypeScript
Python

// Shorthand for streaming
const stream = await client.agents.messages.createStream(agent.id, {
  input: "Hello!",
});

// Full syntax for streaming
const stream = await client.agents.messages.createStream(agent.id, {
  messages: [{ role: "user", content: "Hello!" }],
});

# Shorthand for streaming
stream = client.agents.messages.create_stream(
    agent_id=agent.id,
    input="Hello!"
)

# Full syntax for streaming
stream = client.agents.messages.create_stream(
    agent_id=agent.id,
    messages=[{"role": "user", "content": "Hello!"}]
)

Streaming Modes Comparison

Aspect	Step Streaming (default)	Token Streaming
What you get	Complete messages after each step	Partial chunks as tokens generate
When to use	Simple implementation	ChatGPT-like real-time UX
Reassembly needed	No	Yes (by message ID)
Message IDs	Unique per message	Same ID across chunks
Content format	Full text in each message	Incremental text pieces
Enable with	Default behavior	`stream_tokens: true`

Understanding Message Flow

Message Types and Flow Patterns

The messages you receive depend on your agent’s configuration:

With reasoning enabled (default):

Simple response: reasoning_message → assistant_message
With tool use: reasoning_message → tool_call_message → tool_return_message → reasoning_message → assistant_message

With reasoning disabled (reasoning=false):

Simple response: assistant_message
With tool use: tool_call_message → tool_return_message → assistant_message

Message Type Reference

reasoning_message: Agent’s internal thinking process (only when reasoning=true)
assistant_message: The actual response shown to the user
tool_call_message: Request to execute a tool
tool_return_message: Result from tool execution
stop_reason: Indicates end of response (end_turn)
usage_statistics: Token usage and step count metrics

// With reasoning (default) - includes reasoning_message events
const agent = await client.agents.create({
  model: "openai/gpt-4o-mini",
  // reasoning: true is the default
});

// Without reasoning - no reasoning_message events
const agentNoReasoning = await client.agents.create({
  model: "openai/gpt-4o-mini",
  reasoning: false, // Disable reasoning messages
});

# With reasoning (default) - includes reasoning_message events
agent = client.agents.create(
    model="openai/gpt-4o-mini",
    # reasoning=True is the default
)

# Without reasoning - no reasoning_message events
agent = client.agents.create(
    model="openai/gpt-4o-mini",
    reasoning=False  # Disable reasoning messages
)

Step Streaming (Default)

Step streaming delivers complete messages after each agent step completes. This is the default behavior when you use the streaming endpoint.

How It Works

Agent processes your request through steps (reasoning, tool calls, generating responses)
After each step completes, you receive a complete LettaMessage via SSE
Each message can be processed immediately without reassembly

Example

import { LettaClient } from "@letta-ai/letta-client";
import type { LettaMessage } from "@letta-ai/letta-client/api/types";

const client = new LettaClient({ token: "YOUR_API_KEY" });

const stream = await client.agents.messages.createStream(agent.id, {
  messages: [{ role: "user", content: "What's 2+2?" }],
});

for await (const chunk of stream as AsyncIterable<LettaMessage>) {
  if (chunk.messageType === "reasoning_message") {
    console.log(`Thinking: ${(chunk as any).reasoning}`);
  } else if (chunk.messageType === "assistant_message") {
    console.log(`Response: ${(chunk as any).content}`);
  }
}

stream = client.agents.messages.create_stream(
    agent_id=agent.id,
    messages=[{"role": "user", "content": "What's 2+2?"}]
)

for chunk in stream:
    if hasattr(chunk, 'message_type'):
        if chunk.message_type == 'reasoning_message':
            print(f"Thinking: {chunk.reasoning}")
        elif chunk.message_type == 'assistant_message':
            print(f"Response: {chunk.content}")

curl -N --request POST \
  --url https://api.letta.com/v1/agents/$AGENT_ID/messages/stream \
  --header "Authorization: Bearer $LETTA_API_KEY" \
  --header 'Content-Type: application/json' \
  --data '{"messages": [{"role": "user", "content": "What is 2+2?"}]}'

# For self-hosted: Replace https://api.letta.com with http://localhost:8283

Example Output

data: {"id":"msg-123","message_type":"reasoning_message","reasoning":"User is asking a simple math question."}
data: {"id":"msg-456","message_type":"assistant_message","content":"2 + 2 equals 4!"}
data: {"message_type":"stop_reason","stop_reason":"end_turn"}
data: {"message_type":"usage_statistics","completion_tokens":50,"total_tokens":2821}
data: [DONE]

Token Streaming

Token streaming provides partial content chunks as they’re generated by the LLM, enabling a ChatGPT-like experience where text appears character by character.

How It Works

Set stream_tokens: true in your request
Receive multiple chunks with the same message ID
Each chunk contains a piece of the content
Client must accumulate chunks by ID to rebuild complete messages

Example with Reassembly

import { LettaClient } from "@letta-ai/letta-client";
import type { LettaMessage } from "@letta-ai/letta-client/api/types";

const client = new LettaClient({ token: "YOUR_API_KEY" });

// Token streaming with reassembly
interface MessageAccumulator {
  type: string;
  content: string;
}

const messageAccumulators = new Map<string, MessageAccumulator>();

const stream = await client.agents.messages.createStream(agent.id, {
  messages: [{ role: "user", content: "Tell me a joke" }],
  streamTokens: true, // Note: camelCase
});

for await (const chunk of stream as AsyncIterable<LettaMessage>) {
  if (chunk.id && chunk.messageType) {
    const msgId = chunk.id;
    const msgType = chunk.messageType;

    // Initialize accumulator for new messages
    if (!messageAccumulators.has(msgId)) {
      messageAccumulators.set(msgId, {
        type: msgType,
        content: "",
      });
    }

    // Accumulate content based on message type
    const acc = messageAccumulators.get(msgId)!;

    // Only accumulate if the type matches (in case types share IDs)
    if (acc.type === msgType) {
      if (msgType === "reasoning_message") {
        acc.content += (chunk as any).reasoning || "";
      } else if (msgType === "assistant_message") {
        acc.content += (chunk as any).content || "";
      }
    }

    // Update UI with accumulated content
    process.stdout.write(acc.content);
  }
}

# Token streaming with reassembly
message_accumulators = {}

stream = client.agents.messages.create_stream(
    agent_id=agent.id,
    messages=[{"role": "user", "content": "Tell me a joke"}],
    stream_tokens=True
)

for chunk in stream:
    if hasattr(chunk, 'id') and hasattr(chunk, 'message_type'):
        msg_id = chunk.id
        msg_type = chunk.message_type

        # Initialize accumulator for new messages
        if msg_id not in message_accumulators:
            message_accumulators[msg_id] = {
                'type': msg_type,
                'content': ''
            }

        # Accumulate content
        if msg_type == 'reasoning_message':
            message_accumulators[msg_id]['content'] += chunk.reasoning
        elif msg_type == 'assistant_message':
            message_accumulators[msg_id]['content'] += chunk.content

        # Display accumulated content in real-time
        print(message_accumulators[msg_id]['content'], end='', flush=True)

curl -N --request POST \
  --url https://api.letta.com/v1/agents/$AGENT_ID/messages/stream \
  --header "Authorization: Bearer $LETTA_API_KEY" \
  --header 'Content-Type: application/json' \
  --data '{
    "messages": [{"role": "user", "content": "Tell me a joke"}],
    "stream_tokens": true
  }'

Example Output

# Same ID across chunks of the same message
data: {"id":"msg-abc","message_type":"assistant_message","content":"Why"}
data: {"id":"msg-abc","message_type":"assistant_message","content":" did"}
data: {"id":"msg-abc","message_type":"assistant_message","content":" the"}
data: {"id":"msg-abc","message_type":"assistant_message","content":" scarecrow"}
data: {"id":"msg-abc","message_type":"assistant_message","content":" win"}
# ... more chunks with same ID
data: [DONE]

Implementation Tips

Universal Handling Pattern

The accumulator pattern shown above works for both streaming modes:

Step streaming: Each message is complete (single chunk per ID)
Token streaming: Multiple chunks per ID need accumulation

This means you can write your client code once to handle both cases.

SSE Format Notes

All streaming responses follow the Server-Sent Events (SSE) format:

Each event starts with data: followed by JSON
Stream ends with data: [DONE]
Empty lines separate events

Learn more about SSE format here.

Handling Different LLM Providers

If your Letta server connects to multiple LLM providers, some may not support token streaming. Your client code will still work - the server will fall back to step streaming automatically when token streaming isn’t available.