Skip to main content

    Streaming UI

    Behest streams SSE exactly the way OpenAI does, so every OpenAI-compatible streaming client works unmodified. This guide covers cancelation, reconnect, error surfacing, and the common "typewriter" UX tricks.


    Basic stream (browser)

    After fetching a token from your backend (/api/behest/token), call Behest directly with the OpenAI SDK:

    ts
    import OpenAI from "openai";
     
    const { token, sessionId } = await fetchBehestToken();
    const openai = new OpenAI({
      apiKey: token,
      baseURL: `${BEHEST_BASE_URL}/v1`,
      dangerouslyAllowBrowser: true,
      defaultHeaders: { "X-Session-Id": sessionId },
    });
     
    const stream = await openai.chat.completions.create({
      messages,
      stream: true,
    });
     
    for await (const chunk of stream) {
      const delta = chunk.choices[0]?.delta?.content ?? "";
      appendToUI(delta);
    }

    Each chunk is ChatCompletionChunkchoices[0].delta.content is the incremental string. Last chunk has finish_reason: "stop" | "length" | "tool_calls" | "content_filter".


    Cancel a stream (AbortSignal)

    Always wire a cancel button. The user closing a tab or navigating away should not keep LLM tokens flowing (you pay for them).

    tsx
    const abortRef = useRef<AbortController | null>(null);
     
    async function send() {
      abortRef.current = new AbortController();
      const stream = await openai.chat.completions.create(
        { messages, stream: true },
        { signal: abortRef.current.signal }
      );
      try {
        for await (const chunk of stream) render(chunk);
      } catch (err) {
        if ((err as Error).name === "AbortError") return; // user cancelled
        throw err;
      }
    }
     
    <button onClick={() => abortRef.current?.abort()}>Stop</button>;

    Aborting closes the HTTP/2 stream immediately. Behest stops billing the moment Kong sees the disconnect.


    Cleanup on unmount

    tsx
    useEffect(() => {
      return () => abortRef.current?.abort(); // cancel in-flight request on nav
    }, []);

    Reconnect on transient failure

    For long generations, network blips happen. Pattern: if the stream dies before finish_reason, retry from the last chunk you rendered using the thread history:

    ts
    async function streamWithRetry(
      openai: OpenAI,
      messages: Msg[],
      threadId: string,
      attempts = 2
    ) {
      for (let i = 0; i <= attempts; i++) {
        try {
          const stream = await openai.chat.completions.create(
            {
              messages,
              stream: true,
            },
            {
              headers: { "X-Thread-Id": threadId },
            }
          );
          for await (const chunk of stream) {
            if (chunk.choices[0]?.delta?.content) render(chunk);
            if (chunk.choices[0]?.finish_reason) return; // done
          }
        } catch (err) {
          if (i === attempts || isFatal(err)) throw err;
          await sleep(500 * 2 ** i); // exponential backoff: 500ms, 1s, 2s
        }
      }
    }

    isFatal should return true for 400/401/402/403 — retrying won't help. Retry only network errors and 5xx. With the OpenAI SDK in the browser, check (err as APIError).status to decide.


    Typewriter smoothing

    Chunks arrive in bursts (10–100 tokens per TCP packet). To avoid janky bursts, queue chunks and flush at a fixed rate:

    ts
    const queue: string[] = [];
    let flushTimer: ReturnType<typeof setInterval> | null = null;
     
    function startFlush() {
      if (flushTimer) return;
      flushTimer = setInterval(() => {
        if (queue.length === 0) return;
        const next = queue.shift()!;
        appendCharByChar(next, 8); // 8ms per char
      }, 16); // 60fps
    }
     
    for await (const chunk of stream) {
      queue.push(chunk.choices[0]?.delta?.content ?? "");
      startFlush();
    }

    Or use requestAnimationFrame — whichever matches your framework's update cycle.


    Error in the middle of a stream

    A stream can fail mid-response. Typically this is 402 (user exhausted quota) or 5xx (upstream hiccup). Behest sends a terminal event:

    data: {"error":{"code":"quota_exceeded","message":"..."}}
    
    data: [DONE]
    

    Browser (OpenAI SDK): the SDK throws an APIError with .status and .code. Render what you have and branch:

    ts
    try {
      for await (const chunk of stream) render(chunk);
    } catch (err) {
      const status = (err as { status?: number }).status;
      if (status === 402) showUpgradeModal(err);
      else if (status === 429) backoffToast(err);
      else renderError("Something went wrong — tap to retry.");
    }

    Backend (v1.5 SDK): the SDK throws BehestQuotaError, BehestRateLimitError, etc. See error handling for the full taxonomy.


    Progressive rendering for Markdown / code blocks

    Raw deltas often split markdown mid-token (`foo``fo, o`). Render markdown only when the stream is idle for > 60ms, or accumulate a full-text buffer and re-parse on each update:

    tsx
    const [raw, setRaw] = useState("");
    const rendered = useMemo(() => markdownToHtml(raw), [raw]);
     
    // In the loop:
    setRaw((r) => r + delta);

    React's batching makes this cheap enough up to ~50k tokens.


    Server-to-server streaming (Node/Python)

    Use the v1.5 Behest SDK directly. Same pattern; no AbortController required:

    ts
    import { Behest } from "@behest/client-ts";
    const behest = new Behest();
     
    const stream = await behest.chat.completions.create({
      messages,
      stream: true,
      user_id, // auto-mints a per-user JWT for this call
    });
    let out = "";
    for await (const chunk of stream) out += chunk.choices[0]?.delta?.content ?? "";
    python
    from behest import Behest
    behest = Behest()
     
    stream = await behest.chat.completions.create(
        messages=...,
        stream=True,
        user_id=user_id,
    )
    async for chunk in stream:
        print(chunk.choices[0].delta.content or "", end="", flush=True)

    See also

    Enterprise Token FinOps: Enforce hard budgets and attribute costs per session.

    Learn more