Streaming agent responses
Messages from the Letta server can be streamed to the client. If you’re building a UI on the Letta API, enabling streaming allows your UI to update in real-time as the agent generates a response to an input message.
Quick Start
Section titled “Quick Start”Letta supports two streaming modes: step streaming (default) and token streaming.
To enable streaming, use the /v1/agents/{agent_id}/messages/stream endpoint instead of /messages:
import { LettaClient } from "@letta-ai/letta-client";
const client = new LettaClient({ token: "YOUR_API_KEY" });
// Step streaming (default) - returns complete messagesconst stream = await client.agents.messages.createStream(agent.id, { messages: [{ role: "user", content: "Hello!" }],});for await (const chunk of stream) { console.log(chunk); // Complete message objects}
// Token streaming - returns partial chunks for real-time UXconst tokenStream = await client.agents.messages.createStream(agent.id, { messages: [{ role: "user", content: "Hello!" }], streamTokens: true, // Enable token streaming});for await (const chunk of tokenStream) { console.log(chunk); // Partial content chunks}# Step streaming (default) - returns complete messagesstream = client.agents.messages.create_stream( agent_id=agent.id, messages=[{"role": "user", "content": "Hello!"}])for chunk in stream: print(chunk) # Complete message objects
# Token streaming - returns partial chunks for real-time UXstream = client.agents.messages.create_stream( agent_id=agent.id, messages=[{"role": "user", "content": "Hello!"}], stream_tokens=True # Enable token streaming)for chunk in stream: print(chunk) # Partial content chunksStreaming Modes Comparison
Section titled “Streaming Modes Comparison”| Aspect | Step Streaming (default) | Token Streaming |
|---|---|---|
| What you get | Complete messages after each step | Partial chunks as tokens generate |
| When to use | Simple implementation | ChatGPT-like real-time UX |
| Reassembly needed | No | Yes (by message ID) |
| Message IDs | Unique per message | Same ID across chunks |
| Content format | Full text in each message | Incremental text pieces |
| Enable with | Default behavior | stream_tokens: true |
Understanding Message Flow
Section titled “Understanding Message Flow”Message Types and Flow Patterns
Section titled “Message Types and Flow Patterns”The messages you receive depend on your agent’s configuration:
With reasoning enabled (default):
- Simple response:
reasoning_message→assistant_message - With tool use:
reasoning_message→tool_call_message→tool_return_message→reasoning_message→assistant_message
With reasoning disabled (reasoning=false):
- Simple response:
assistant_message - With tool use:
tool_call_message→tool_return_message→assistant_message
Message Type Reference
Section titled “Message Type Reference”reasoning_message: Agent’s internal thinking process (only whenreasoning=true)assistant_message: The actual response shown to the usertool_call_message: Request to execute a tooltool_return_message: Result from tool executionstop_reason: Indicates end of response (end_turn)usage_statistics: Token usage and step count metrics
Controlling Reasoning Messages
Section titled “Controlling Reasoning Messages”// With reasoning (default) - includes reasoning_message eventsconst agent = await client.agents.create({ model: "openai/gpt-4o-mini", // reasoning: true is the default});
// Without reasoning - no reasoning_message eventsconst agentNoReasoning = await client.agents.create({ model: "openai/gpt-4o-mini", reasoning: false, // Disable reasoning messages});# With reasoning (default) - includes reasoning_message eventsagent = client.agents.create( model="openai/gpt-4o-mini", # reasoning=True is the default)
# Without reasoning - no reasoning_message eventsagent = client.agents.create( model="openai/gpt-4o-mini", reasoning=False # Disable reasoning messages)Step Streaming (Default)
Section titled “Step Streaming (Default)”Step streaming delivers complete messages after each agent step completes. This is the default behavior when you use the streaming endpoint.
How It Works
Section titled “How It Works”- Agent processes your request through steps (reasoning, tool calls, generating responses)
- After each step completes, you receive a complete
LettaMessagevia SSE - Each message can be processed immediately without reassembly
Example
Section titled “Example”import { LettaClient } from "@letta-ai/letta-client";import type { LettaMessage } from "@letta-ai/letta-client/api/types";
const client = new LettaClient({ token: "YOUR_API_KEY" });
const stream = await client.agents.messages.createStream(agent.id, { messages: [{ role: "user", content: "What's 2+2?" }],});
for await (const chunk of stream as AsyncIterable<LettaMessage>) { if (chunk.messageType === "reasoning_message") { console.log(`Thinking: ${(chunk as any).reasoning}`); } else if (chunk.messageType === "assistant_message") { console.log(`Response: ${(chunk as any).content}`); }}stream = client.agents.messages.create_stream( agent_id=agent.id, messages=[{"role": "user", "content": "What's 2+2?"}])
for chunk in stream: if hasattr(chunk, 'message_type'): if chunk.message_type == 'reasoning_message': print(f"Thinking: {chunk.reasoning}") elif chunk.message_type == 'assistant_message': print(f"Response: {chunk.content}")curl -N --request POST \ --url https://api.letta.com/v1/agents/$AGENT_ID/messages/stream \ --header "Authorization: Bearer $LETTA_API_KEY" \ --header 'Content-Type: application/json' \ --data '{"messages": [{"role": "user", "content": "What is 2+2?"}]}'
# For self-hosted: Replace https://api.letta.com with http://localhost:8283Example Output
Section titled “Example Output”data: {"id":"msg-123","message_type":"reasoning_message","reasoning":"User is asking a simple math question."}data: {"id":"msg-456","message_type":"assistant_message","content":"2 + 2 equals 4!"}data: {"message_type":"stop_reason","stop_reason":"end_turn"}data: {"message_type":"usage_statistics","completion_tokens":50,"total_tokens":2821}data: [DONE]Token Streaming
Section titled “Token Streaming”Token streaming provides partial content chunks as they’re generated by the LLM, enabling a ChatGPT-like experience where text appears character by character.
How It Works
Section titled “How It Works”- Set
stream_tokens: truein your request - Receive multiple chunks with the same message ID
- Each chunk contains a piece of the content
- Client must accumulate chunks by ID to rebuild complete messages
Example with Reassembly
Section titled “Example with Reassembly”import { LettaClient } from "@letta-ai/letta-client";import type { LettaMessage } from "@letta-ai/letta-client/api/types";
const client = new LettaClient({ token: "YOUR_API_KEY" });
// Token streaming with reassemblyinterface MessageAccumulator { type: string; content: string;}
const messageAccumulators = new Map<string, MessageAccumulator>();
const stream = await client.agents.messages.createStream(agent.id, { messages: [{ role: "user", content: "Tell me a joke" }], streamTokens: true, // Note: camelCase});
for await (const chunk of stream as AsyncIterable<LettaMessage>) { if (chunk.id && chunk.messageType) { const msgId = chunk.id; const msgType = chunk.messageType;
// Initialize accumulator for new messages if (!messageAccumulators.has(msgId)) { messageAccumulators.set(msgId, { type: msgType, content: "", }); }
// Accumulate content based on message type const acc = messageAccumulators.get(msgId)!;
// Only accumulate if the type matches (in case types share IDs) if (acc.type === msgType) { if (msgType === "reasoning_message") { acc.content += (chunk as any).reasoning || ""; } else if (msgType === "assistant_message") { acc.content += (chunk as any).content || ""; } }
// Update UI with accumulated content process.stdout.write(acc.content); }}# Token streaming with reassemblymessage_accumulators = {}
stream = client.agents.messages.create_stream( agent_id=agent.id, messages=[{"role": "user", "content": "Tell me a joke"}], stream_tokens=True)
for chunk in stream: if hasattr(chunk, 'id') and hasattr(chunk, 'message_type'): msg_id = chunk.id msg_type = chunk.message_type
# Initialize accumulator for new messages if msg_id not in message_accumulators: message_accumulators[msg_id] = { 'type': msg_type, 'content': '' }
# Accumulate content if msg_type == 'reasoning_message': message_accumulators[msg_id]['content'] += chunk.reasoning elif msg_type == 'assistant_message': message_accumulators[msg_id]['content'] += chunk.content
# Display accumulated content in real-time print(message_accumulators[msg_id]['content'], end='', flush=True)curl -N --request POST \ --url https://api.letta.com/v1/agents/$AGENT_ID/messages/stream \ --header "Authorization: Bearer $LETTA_API_KEY" \ --header 'Content-Type: application/json' \ --data '{ "messages": [{"role": "user", "content": "Tell me a joke"}], "stream_tokens": true }'Example Output
Section titled “Example Output”# Same ID across chunks of the same messagedata: {"id":"msg-abc","message_type":"assistant_message","content":"Why"}data: {"id":"msg-abc","message_type":"assistant_message","content":" did"}data: {"id":"msg-abc","message_type":"assistant_message","content":" the"}data: {"id":"msg-abc","message_type":"assistant_message","content":" scarecrow"}data: {"id":"msg-abc","message_type":"assistant_message","content":" win"}# ... more chunks with same IDdata: [DONE]Implementation Tips
Section titled “Implementation Tips”Universal Handling Pattern
Section titled “Universal Handling Pattern”The accumulator pattern shown above works for both streaming modes:
- Step streaming: Each message is complete (single chunk per ID)
- Token streaming: Multiple chunks per ID need accumulation
This means you can write your client code once to handle both cases.
SSE Format Notes
Section titled “SSE Format Notes”All streaming responses follow the Server-Sent Events (SSE) format:
- Each event starts with
data:followed by JSON - Stream ends with
data: [DONE] - Empty lines separate events
Learn more about SSE format here.
Handling Different LLM Providers
Section titled “Handling Different LLM Providers”If your Letta server connects to multiple LLM providers, some may not support token streaming. Your client code will still work - the server will fall back to step streaming automatically when token streaming isn’t available.