Full Docs (.md)

Overview

Overview

Welcome to the Fluxions API. Our hosted endpoints cover three product surfaces:

  • Transcriptionakro-v1, our listening model: speech-to-text, speaker diarization, and non-speech events (breaths, laughter, hesitations) in one call. Production-ready today.
  • Text-to-Speech — hosted VUI for expressive, low-latency TTS over HTTP or WebSocket. Live today — see Speech.
  • Realtime Voice — OpenAI Realtime-compatible WebSocket for end-to-end streaming voice conversations. Coming soon.

This page covers the basics that apply across all surfaces: authentication, base URL, and a health check.

Authentication

All API requests require authentication using an API key. Include your API key in the Authorization header:

curl "https://api.fluxions.ai/endpoint" \
-H "Authorization: YOUR_API_KEY"

Important: Do not use the "Bearer " prefix. Include the API key directly in the Authorization header.

Base URL

https://api.fluxions.ai

GET /health — Health Check

Check the API status and version information. No authentication required.

Request

curl "https://api.fluxions.ai/health"

Response

{
"status": "ok",
"gateway": "api.fluxions.ai"
}

Transcription

Transcription

Our akro-v1 model is a comprehensive listening model that performs:

  • Transcription — Convert speech to text with high accuracy
  • Speaker Diarization — Identify and separate different speakers ("who said what")
  • Non-Speech Detection — Capture breathing, laughter, hesitation, and other contextual sounds

This makes it ideal for transcribing meetings, interviews, podcasts, and any audio where understanding the full context matters.

All transcription endpoints require authentication — see Overview for API key setup.

Pricing: $0.20 per hour of audio processed, billed by the second. See pricing.

POST /submit — Submit Transcription

Submit audio for processing and receive a job ID immediately. Poll /transcriptions/{id} for results including transcription, speaker diarization, and non-speech events.

Parameters

ParameterTypeDefaultDescription
non_speechbooleanfalseInclude non-speech sounds
filenamestring"audio"Name for the uploaded file
cachebooleantrueUse cached results for identical files

Request

Body: raw audio file bytes.

curl -X POST "https://api.fluxions.ai/akro/submit" \
-H "Authorization: YOUR_API_KEY" \
-H "Content-Type: audio/mpeg" \
--data-binary @audio.mp3

Response

{
"id": 124,
"status": "submitted",
"created_at": "2025-10-24T10:35:00.000Z",
"original_audio_url": "https://...",
"query_urls": {
"get": "https://api.fluxions.ai/transcriptions/124",
"status": "https://api.fluxions.ai/transcriptions/124"
},
"cached": false
}

Workflow

  1. Submit audio via /submit and receive job ID
  2. Poll /transcriptions/{id} to check status
  3. When status is "completed", retrieve full results

GET /transcriptions/{id} — Get Transcription Results

Retrieve the full results for a specific job: transcription, speaker diarization, and non-speech events.

Parameters

ParameterTypeDefaultDescription
word_level_timestampsbooleanfalseInclude word-level timestamps in segments

Request

curl "https://api.fluxions.ai/transcriptions/124" \
-H "Authorization: YOUR_API_KEY"

Response

{
"id": 124,
"status": "completed",
"created_at": "2025-10-24T10:35:00.000Z",
"updated_at": "2025-10-24T10:35:20.000Z",
"filename": "interview.mp3",
"audio_duration": 300.0,
"audio_format": "opus",
"processing_time": 245.5,
"language": "en",
"non_speech": false,
"num_chunks": 11,
"num_segments": 25,
"num_speakers": 2,
"text": "SPEAKER_0: Yeah, let's actually start off exactly, where we initially began.\nSPEAKER_1: Sounds perfect. That makes complete sense to me.\nSPEAKER_0: So I started thinking about what if this is just a construct?",
"segments": [
{
"speaker": "0",
"text": "Yeah, let's actually start off exactly, where we initially began.",
"start": 0.86,
"end": 6.42,
"segment_idx": 0
},
{
"speaker": "1",
"text": "Sounds perfect",
"start": 6.0,
"end": 7.2,
"segment_idx": 0
},
{
"speaker": "1",
"text": "That makes complete sense to me.",
"start": 7.5,
"end": 9.8,
"segment_idx": 1
}
],
"audio_url": "https://...r2.cloudflarestorage.com/...",
"cached": true
}

Status Values

  • submitted — Job has been submitted
  • processing — Transcription in progress
  • completed — Transcription finished successfully
  • failed — Transcription failed (check error_message)

GET /transcriptions — List Transcriptions

List all transcriptions for your account.

Parameters

ParameterTypeDefaultDescription
limitinteger50Number of results per page (max: 100)
offsetinteger0Pagination offset

Request

curl "https://api.fluxions.ai/transcriptions?limit=10&offset=0" \
-H "Authorization: YOUR_API_KEY"

Response

{
"total": 150,
"limit": 10,
"offset": 0,
"transcriptions": [
{
"id": 150,
"status": "completed",
"created_at": "2025-10-24T10:40:00.000Z",
"filename": "interview.mp3",
"audio_duration": 1800.0,
"audio_format": "opus",
"processing_time": 45.2,
"num_speakers": 2,
"num_segments": 142,
"original_audio_url": "https://...",
"language": "en"
}
]
}

Response Format

Text Field

The text field contains the full transcription with speaker labels and optional non-speech events:

  • Speaker Labels: SPEAKER_0:, SPEAKER_1:, etc. prefix each speaker's utterances
  • Line Breaks: Newlines (\n) separate different speaker turns
  • Non-speech Events: When enabled, events like [breath], [pause] appear inline

Example:

SPEAKER_0: Yeah, let's start [breath] where we began.
SPEAKER_1: Sounds good. That makes sense.
SPEAKER_0: So I was thinking about [pause] what if this is a construct?

Segments Array

The segments array provides precise timing and speaker information for each utterance:

  • speaker: Speaker ID as a string ("0", "1", etc.)
  • text: The spoken text for this segment (without non-speech events)
  • start: Start time in seconds (decimal precision)
  • end: End time in seconds (decimal precision)
  • segment_idx: Sequential index for this segment

Non-Speech Events

When non_speech=true, our listening model captures various non-speech sounds and events that provide additional context to the conversation.

Common Non-Speech Sounds

EventTagDescriptionExample Usage
Breath[breath]Audible breathing sounds...end of sentence. [breath] Now this is important.
Laugh[laugh] or hahahaLaughter - can be written as text or tagged for longer laughsOh wow! hahaha [breath] that's hilarious.
Hesitation[hesitation] or [hesitate]Unclear thinking noises or mouth sounds while pausing - not specific wordsWell [hesitation] um I'm not really sure.
Pause[pause]Unnaturally long, noticeable pause (e.g., looking something up)Let me just uh... [pause] Let me look this up.
Environment[env]Background noise or environmental soundsI was thinking [env] about what you said.
Tut[tut]Tongue click or lip smack sound[tut] That's not quite right.
Sigh[sigh]Expressive exhale sound[sigh] I suppose you're right.
Sniff[sniff]Nasal inhale or sniffing sound[sniff] Something smells good in here.
Cough[cough]Coughing soundSorry, excuse me [cough] as I was saying...

Usage Notes

  • Non-speech events are placed inline with the transcribed text
  • Events appear at their natural position in the conversation flow
  • Word elongation is marked with ellipsis: um... so... I think...
  • Emphasis on words uses asterisks: I *really* think so

Speech

Speech

Hosted VUI — expressive, low-latency text-to-speech. Send text, get back audio in a natural voice, with support for non-verbal cues like [sigh] and [laugh].

Two ways to render text:

  • HTTP (POST /v1/tts) — one request, one render. Simplest to integrate.
  • WebSocket (/v1/tts/ws) — keep a warm socket open across renders so each one skips the TLS/TCP handshake and reaches first audio sooner. Use this for interactive UIs.

Pricing: $10 per 1M characters (≈ $0.45 per hour of audio). See pricing.

Base URL

Speech is served through the unified Fluxions API gateway under the /vui namespace:

https://api.fluxions.ai/vui

Authentication

Built-in voices are public — no API key needed. A private voice you've cloned requires your credential in the Authorization header (Bearer <token>). See Voices below.

GET /voices — List Voices

List the built-in voices available to everyone. No authentication required.

Request

curl "https://api.fluxions.ai/vui/voices"

Response

{
"voices": [
{ "voice_id": "maeve.h736bab09a", "preview_text": "I just, I want you to know how proud I am of you..." },
{ "voice_id": "abraham.h736bab09a", "preview_text": "I've finished analysing the document you uploaded..." },
{ "voice_id": "harry.h736bab09a", "preview_text": "Hello, this is Harry. I'm calling you..." }
]
}

Pass any voice_id as the voice field when rendering.

POST /v1/tts — Render (HTTP)

Synthesize speech from text. Returns a complete WAV by default, or streams audio chunk-by-chunk when stream=1.

Parameters

JSON body:

ParameterTypeDefaultDescription
voicestring(required)A voice_id from GET /voices
inputstring(required)Text to speak. Supports non-verbal cues (see below)
temperaturefloat0.9Sampling temperature — higher is more varied
response_formatstring"wav""wav" (complete file) or "pcm" (raw s16le @ 24 kHz)
streambooleanfalseStream audio as it's generated instead of buffering the whole file
max_secsfloat(auto)Hard ceiling on output length. Auto-estimated from text length when omitted
verify_chunksbooleantrueRe-checks each rendered chunk with a fast speech-to-text pass and re-renders any that misread the text. Improves reliability at the cost of latency. Set false for the lowest-latency stream (see Streaming)

Request

curl -X POST "https://api.fluxions.ai/vui/v1/tts" \
-H "Content-Type: application/json" \
-d '{"voice": "maeve.h736bab09a", "input": "[sigh] fine, I will say it one more time."}' \
--output speech.wav

Response

200 OK with the audio bytes. Content-Type is audio/wav (or audio/L16 when response_format is "pcm").

Streaming

Add stream=1 (query param or body field) to receive audio as it's generated, delivered as chunked transfer encoding.

By default (verify_chunks: true) each chunk is checked — and re-rendered if it misreads the text — before it streams, so the first audio lands once the first chunk is rendered and verified (~1 s for a typical sentence). Set verify_chunks: false to stream each chunk the instant the model produces it, unverified: first bytes then land within ~80 ms.

curl -X POST "https://api.fluxions.ai/vui/v1/tts?stream=1" \
-H "Content-Type: application/json" \
-d '{"voice": "maeve.h736bab09a", "input": "Streaming starts playing almost immediately."}' \
--output speech.wav

WebSocket /v1/tts/ws — Render (warm socket)

Identical render logic to POST /v1/tts, but the socket stays open between renders. Hold it open and the TLS/TCP/tunnel handshake is paid once — each subsequent speak goes straight to synthesis. Ideal for typing UIs or back-to-back lines.

Audio is delivered as binary frames of s16le PCM, mono, 24 kHz (no WAV header — assemble it yourself if you need a file).

Protocol

Client → server (text JSON):

{ "type": "speak", "voice": "<id>", "input": "<text>", "temperature": 0.9, "max_secs": 0, "verify_chunks": true, "token": "Bearer <jwt>" }
{ "type": "session.close" }

temperature, max_secs, and verify_chunks are optional. verify_chunks defaults to true; set it false for the lowest-latency stream (see Streaming).

Authentication. Built-in voices are public — omit token. A private cloned voice needs token set to the same value you'd put in the Authorization header: Bearer <clerk-jwt> for a signed-in session, or your raw API key. It rides in the speak message because browsers can't set headers on a WebSocket. The token is checked per speak, so you can mix public and private voices on one socket.

Server → client:

MessageMeaning
{"type": "start"}The worker stream opened — audio frames follow
(binary frame)A chunk of s16le PCM @ 24 kHz
{"type": "done"}Current render finished — socket stays open for the next speak
{"type": "error", "message": "..."}Render failed (socket stays open)

One render = one speakstart → binary PCM* → done. Send another speak on the same socket to render again.

Request

import asyncio, json, websockets
async def render(text, voice='maeve.h736bab09a'):
pcm = bytearray()
async with websockets.connect('wss://api.fluxions.ai/vui/v1/tts/ws') as ws:
await ws.send(json.dumps({'type': 'speak', 'voice': voice, 'input': text}))
async for msg in ws:
if isinstance(msg, bytes):
pcm += msg # s16le PCM @ 24 kHz
elif json.loads(msg)['type'] == 'done':
break
return bytes(pcm)
audio = asyncio.run(render('[sigh] so you want to force me to say things.'))

Non-Verbal Cues

Wrap a cue in square brackets inside input and the model renders it as an expressive sound rather than reading the word aloud:

CueEffect
[sigh]Audible sigh
[laugh]Laughter
[gasp]Sharp intake of breath
[sniff]Sniffle
[cough]Cough
[hesitate]Filler / thinking sound

Example: "[gasp] you did NOT just put pineapple on that pizza! [laugh] okay, okay."

Voices

Built-in voices (GET /voices) are public. You can also clone a custom voice from a short reference clip. Cloned voices are private to your account and require your Authorization token on every render — pass it as the Bearer <token> header for HTTP, or in the token field for the WebSocket.

POST /v1/voices — Clone a Voice

Upload a reference clip plus its transcript; the model encodes a private voice you can render with. Requires authentication. Sent as multipart/form-data.

FieldTypeRequiredDescription
audiofileyesReference clip (wav/opus/etc.). A few clean seconds is enough. Max 25 MB
textstringnoExact transcript of the reference clip. Omit it and we transcribe the clip for you before cloning
namestringnoDisplay label (defaults to the filename)

Leave text out and the server runs your clip through transcription automatically — so the simplest clone is just an audio file. Pass text yourself when you want exact control over the transcript.

curl -X POST "https://api.fluxions.ai/vui/v1/voices" \
-H "Authorization: Bearer YOUR_TOKEN" \
-F "audio=@reference.wav" \
-F "text=This is exactly what the reference clip says." \
-F "name=My Voice"

Response: { "voice_id": "u-<user>-<hash>", "name": "My Voice", "frames": 173, "seconds": 13.8 }. Pass the returned voice_id as voice in any render call (with your token).

GET /v1/voices/mine — List Your Cloned Voices

curl "https://api.fluxions.ai/vui/v1/voices/mine" \
-H "Authorization: Bearer YOUR_TOKEN"

Returns { "voices": [ { "voice_id": "u-...-ab12cd34", "name": "My Voice" } ] }.

POST /v1/voices/delete — Remove a Cloned Voice

curl -X POST "https://api.fluxions.ai/vui/v1/voices/delete" \
-H "Authorization: Bearer YOUR_TOKEN" \
-H "Content-Type: application/json" \
-d '{"voice_id": "u-...-ab12cd34"}'

Note: cloned voices currently live in the running server's memory, not a database — they're tied to your account but are not guaranteed to survive a server restart. Re-upload if a voice_id stops resolving.

Output Format

  • Sample rate: 24,000 Hz
  • Channels: mono
  • Sample format: signed 16-bit little-endian PCM
  • HTTP wav: PCM wrapped in a standard WAV container
  • HTTP pcm / WebSocket binary frames: raw s16le PCM (no header)

History

History

The History API is one read-only surface over everything you've done on the platform — transcriptions, TTS renders, and voice conversations — under a single host. Use it to list, page, filter, and search your activity, and to fetch download links for the underlying audio and transcripts.

All history endpoints require authentication — see Overview for API key setup.

Base URL: https://api.fluxions.ai

One shape for everything

Every list response uses the same envelope:

{
"object": "list",
"page": 1,
"limit": 20,
"total": 137,
"has_more": true,
"data": [ /* items */ ]
}

Every item carries an object field telling you its type ("transcription", "tts", or "conversation") plus its native id. To fetch one item's detail, call /history/{type}s/{id} (e.g. /history/tts/123). Timestamps are ISO-8601 UTC; costs are in US dollars.

Shared query parameters

These work on every collection (and the unified feed):

ParameterTypeDefaultDescription
pageinteger1Page number (1-based)
limitinteger20Results per page (max: 100)
orderstringdescSort by time: asc or desc
sincestringOnly items at/after this time (ISO-8601 or epoch seconds)
untilstringOnly items at/before this time (ISO-8601 or epoch seconds)

Collection-specific filters: voice (tts, conversations), status (transcriptions), type (the unified feed).

GET /history — Unified Feed

A merged, reverse-chronological feed across all three types. Filter the streams with type (comma-separated).

Parameters

ParameterTypeDefaultDescription
typestring(all)Restrict to transcription, tts, and/or conversation (csv)

(plus all shared parameters above)

Request

curl "https://api.fluxions.ai/history?limit=10&type=tts,conversation" \
-H "Authorization: YOUR_API_KEY"

Response

{
"object": "list",
"page": 1, "limit": 10, "total": 84, "has_more": true,
"data": [
{ "object": "conversation", "id": "sess_abc", "created_at": "2026-06-29T10:40:00Z",
"cost_usd": null, "voice": "maeve.en-us", "duration_secs": 312.4, "turn_count": 18 },
{ "object": "tts", "id": 123, "created_at": "2026-06-29T10:32:00Z",
"cost_usd": 0.0123, "voice": "maeve.en-us", "chars": 842, "audio_secs": 58.4 }
]
}

The feed is lightweight: it does not include presigned download_urls. Use the typed collection or detail endpoints to get them.

GET /history/transcriptions — Transcription History

List your transcriptions.

Parameters

ParameterTypeDefaultDescription
statusstringFilter by status (e.g. completed)
include_download_urlbooleanfalseInclude a presigned audio URL per item

(plus all shared parameters)

Request

curl "https://api.fluxions.ai/history/transcriptions?status=completed&limit=5" \
-H "Authorization: YOUR_API_KEY"

Response

{
"object": "list",
"page": 1, "limit": 5, "total": 42, "has_more": true,
"data": [
{
"object": "transcription",
"id": 456,
"created_at": "2026-06-29T10:35:00Z",
"cost_usd": 0.10,
"status": "completed",
"filename": "interview.mp3",
"audio_duration_secs": 1800.0,
"audio_format": "opus",
"language": "en",
"num_speakers": 2,
"num_segments": 142
}
]
}

GET /history/transcriptions/{id} — One Transcription

Returns the full record with presigned download_url (audio), text_url, and segments_url. 404 if it isn't yours.

curl "https://api.fluxions.ai/history/transcriptions/456" \
-H "Authorization: YOUR_API_KEY"

GET /history/tts — TTS Render History

List your text-to-speech renders.

Parameters

ParameterTypeDefaultDescription
voicestringFilter by voice id
include_download_urlbooleantrueInclude a presigned Opus URL per item

(plus all shared parameters)

Request

curl "https://api.fluxions.ai/history/tts?voice=maeve.en-us&limit=10" \
-H "Authorization: YOUR_API_KEY"

Response

{
"object": "list",
"page": 1, "limit": 10, "total": 60, "has_more": true,
"data": [
{
"object": "tts",
"id": 123,
"created_at": "2026-06-29T10:32:00Z",
"cost_usd": 0.0123,
"voice": "maeve.en-us",
"chars": 842,
"audio_secs": 58.4,
"download_url": "https://...r2.cloudflarestorage.com/...opus"
}
]
}

GET /history/tts/{id} — One Render

Returns one render with a fresh signed download_url. 404 if it isn't yours.

curl "https://api.fluxions.ai/history/tts/123" \
-H "Authorization: YOUR_API_KEY"

GET /history/conversations — Conversation History

List your voice conversations (agent calls).

Parameters

ParameterTypeDefaultDescription
voicestringFilter by voice id

(plus all shared parameters)

Request

curl "https://api.fluxions.ai/history/conversations?limit=10" \
-H "Authorization: YOUR_API_KEY"

Response

{
"object": "list",
"page": 1, "limit": 10, "total": 23, "has_more": true,
"data": [
{
"object": "conversation",
"id": "sess_abc123",
"created_at": "2026-06-29T10:40:00Z",
"cost_usd": null,
"voice": "maeve.en-us",
"started_at": "2026-06-29T10:40:00Z",
"ended_at": "2026-06-29T10:45:12Z",
"duration_secs": 312.4,
"turn_count": 18
}
]
}

GET /history/conversations/{id} — One Conversation

Returns the session plus its turn-by-turn transcript. 404 if it isn't yours.

Parameters

ParameterTypeDefaultDescription
include_turnsbooleantrueInclude the transcript turns
include_tool_callsbooleanfalseInclude tool invocations (calendar, email, …)
turns_limitinteger500Max turns to return (max: 2000)

Request

curl "https://api.fluxions.ai/history/conversations/sess_abc123?include_tool_calls=true" \
-H "Authorization: YOUR_API_KEY"

Response

{
"object": "conversation",
"id": "sess_abc123",
"created_at": "2026-06-29T10:40:00Z",
"voice": "maeve.en-us",
"duration_secs": 312.4,
"turn_count": 18,
"turns": [
{ "object": "conversation_turn", "id": 9001, "session_id": "sess_abc123",
"role": "user", "text": "What's on my calendar today?", "created_at": "2026-06-29T10:40:05Z" },
{ "object": "conversation_turn", "id": 9002, "session_id": "sess_abc123",
"role": "assistant", "text": "You have two meetings...", "created_at": "2026-06-29T10:40:08Z" }
],
"tool_calls": [
{ "object": "tool_call", "id": 51, "tool": "calendar",
"args": {"range": "today"}, "result": "2 events", "created_at": "2026-06-29T10:40:07Z" }
]
}

GET /history/conversations/search — Search Turns

Full-text search across your conversation turns (Postgres websearch_to_tsquery).

Parameters

ParameterTypeDefaultDescription
qstring(required)Search query
limitinteger20Max results (max: 100)

Request

curl "https://api.fluxions.ai/history/conversations/search?q=dentist+appointment" \
-H "Authorization: YOUR_API_KEY"

Response

{
"object": "list",
"query": "dentist appointment",
"data": [
{ "object": "conversation_turn", "id": 9100, "session_id": "sess_def456",
"role": "user", "text": "remind me about the dentist appointment",
"created_at": "2026-06-20T14:02:00Z" }
]
}

Search across your whole history in one call. Conversation turns are matched by full text; transcriptions are matched by filename (their text lives in object storage, not the database). Results are type-tagged via object.

Parameters

ParameterTypeDefaultDescription
qstring(required)Search query
limitinteger20Max results per domain (max: 100)

Request

curl "https://api.fluxions.ai/history/search?q=interview" \
-H "Authorization: YOUR_API_KEY"

Response

{
"object": "list",
"query": "interview",
"data": [
{ "object": "conversation_turn", "id": 9200, "session_id": "sess_ghi",
"role": "assistant", "text": "...the interview went well...", "created_at": "2026-06-25T09:00:00Z" },
{ "object": "transcription", "id": 456, "created_at": "2026-06-29T10:35:00Z",
"status": "completed", "filename": "interview.mp3", "audio_duration_secs": 1800.0 }
]
}