API Reference
Complete reference for the Scriptivox transcription API. All endpoints require an API key, passed in one of:
Authorization: sk_live_…(theBearerprefix is accepted but not required)X-Api-Key: sk_live_…
Header names are case-insensitive. Each account can have at most 5 active API keys at a time — revoke an unused key in the dashboard before creating a new one if you hit this ceiling.
Base URL: https://api.scriptivox.com/v1
Service status: real-time uptime + incident history at status.scriptivox.com.
Transcribe
Send audio for transcription. You can either pass a URL (we download it) or upload your own file.
From a URL
The simplest path — one POST request. We download the file, validate it, and start transcription automatically. Supports Google Drive, Dropbox, and OneDrive sharing links.
/v1/transcribeStart a transcription from a public URL. The file is downloaded and validated in the background. Poll GET /v1/transcribe/{id} for status updates. Duration and cost are determined after download.
Authorization: sk_live_YOUR_KEYRequest Body
| Parameter | Type | Required | Description |
|---|---|---|---|
| url | string | Required | Public URL to an audio/video file (http or https). Supports Google Drive, Dropbox, and OneDrive sharing links. Max 2048 characters. |
| language | string | Optional | ISO 639-1 language code (e.g. "en", "es", "fr"). Omit for auto-detection. Warning: forcing a wrong language may produce a translation instead of a transcription. |
| diarize | boolean | Optional | Enable speaker diarization to identify who said what. Default: false. |
| speaker_count | integer | Optional | Expected number of speakers (1–50). Requires diarize to be true. Strongly recommended when you know the speaker count — providing it noticeably improves diarization accuracy. If omitted, the model auto-detects, which can over- or under-segment speakers. |
| align | boolean | Optional | Enable word-level timestamps with start/end times for every word, plus per-word confidence scores when the alignment model supports them. Default: true. Confidence is language-dependent — some languages return null. Note: when diarize is true, alignment is automatically enabled (required for speaker assignment), even if you pass align: false. |
| webhook_url | string | Optional | URL to receive completion/failure webhook. HTTPS recommended. |
urlstringRequiredPublic URL to an audio/video file (http or https). Supports Google Drive, Dropbox, and OneDrive sharing links. Max 2048 characters.
languagestringOptionalISO 639-1 language code (e.g. "en", "es", "fr"). Omit for auto-detection. Warning: forcing a wrong language may produce a translation instead of a transcription.
diarizebooleanOptionalEnable speaker diarization to identify who said what. Default: false.
speaker_countintegerOptionalExpected number of speakers (1–50). Requires diarize to be true. Strongly recommended when you know the speaker count — providing it noticeably improves diarization accuracy. If omitted, the model auto-detects, which can over- or under-segment speakers.
alignbooleanOptionalEnable word-level timestamps with start/end times for every word, plus per-word confidence scores when the alignment model supports them. Default: true. Confidence is language-dependent — some languages return null. Note: when diarize is true, alignment is automatically enabled (required for speaker assignment), even if you pass align: false.
webhook_urlstringOptionalURL to receive completion/failure webhook. HTTPS recommended.
Response
{"id": "b2c3d4e5-f6a7-8901-bcde-f12345678901","status": "created","message": "Transcription created. The file will be downloaded and processed. Poll GET /v1/transcribe/{id} for status updates."}
Code Examples
resp = requests.post("https://api.scriptivox.com/v1/transcribe",headers={"Authorization": "sk_live_YOUR_KEY"},json={"url": "https://example.com/podcast-episode.mp3","diarize": True,"language": "en"})job = resp.json()# Poll for statusimport timewhile True:result = requests.get(f"https://api.scriptivox.com/v1/transcribe/{job['id']}",headers={"Authorization": "sk_live_YOUR_KEY"}).json()if result["status"] in ("completed", "failed"):breaktime.sleep(5)
From a file upload
Upload your own file when you need full control or don't have a public URL.
/v1/uploadGet a presigned URL to upload an audio or video file. The URL expires in 1 hour. Upload your file to the returned URL with a PUT request, then pass the upload_id to POST /v1/transcribe.
Authorization: sk_live_YOUR_KEYRequest Body
| Parameter | Type | Required | Description |
|---|---|---|---|
| filename | string | Required | Name of the file being uploaded (e.g. "meeting.mp3"). Maximum 255 characters. |
filenamestringRequiredName of the file being uploaded (e.g. "meeting.mp3"). Maximum 255 characters.
Response
{"upload_id": "a1b2c3d4-e5f6-7890-abcd-ef1234567890","upload_url": "https://storage.supabase.co/...","expires_in": 3600,"method": "PUT","headers": {"Content-Type": "audio/mpeg"}}
Code Examples
import requestsresp = requests.post("https://api.scriptivox.com/v1/upload",headers={"Authorization": "sk_live_YOUR_KEY"},json={"filename": "meeting.mp3"})upload = resp.json()with open("meeting.mp3", "rb") as f:requests.put(upload["upload_url"],headers=upload["headers"], data=f)
/v1/transcribeStart a transcription from an uploaded file. Pass the upload_id from POST /v1/upload. The file is validated in the background. Poll GET /v1/transcribe/{id} for status updates. Duration and cost are determined after validation.
Authorization: sk_live_YOUR_KEYRequest Body
| Parameter | Type | Required | Description |
|---|---|---|---|
| upload_id | string | Required | The upload ID from POST /v1/upload. |
| language | string | Optional | ISO 639-1 language code (e.g. "en", "es", "fr"). Omit for auto-detection. Warning: forcing a wrong language may produce a translation instead of a transcription. |
| diarize | boolean | Optional | Enable speaker diarization to identify who said what. Default: false. |
| speaker_count | integer | Optional | Expected number of speakers (1–50). Requires diarize to be true. Strongly recommended when you know the speaker count — providing it noticeably improves diarization accuracy. If omitted, the model auto-detects, which can over- or under-segment speakers. |
| align | boolean | Optional | Enable word-level timestamps with start/end times for every word, plus per-word confidence scores when the alignment model supports them. Default: true. Confidence is language-dependent — some languages return null. Note: when diarize is true, alignment is automatically enabled (required for speaker assignment), even if you pass align: false. |
| webhook_url | string | Optional | URL to receive completion/failure webhook. HTTPS recommended. |
upload_idstringRequiredThe upload ID from POST /v1/upload.
languagestringOptionalISO 639-1 language code (e.g. "en", "es", "fr"). Omit for auto-detection. Warning: forcing a wrong language may produce a translation instead of a transcription.
diarizebooleanOptionalEnable speaker diarization to identify who said what. Default: false.
speaker_countintegerOptionalExpected number of speakers (1–50). Requires diarize to be true. Strongly recommended when you know the speaker count — providing it noticeably improves diarization accuracy. If omitted, the model auto-detects, which can over- or under-segment speakers.
alignbooleanOptionalEnable word-level timestamps with start/end times for every word, plus per-word confidence scores when the alignment model supports them. Default: true. Confidence is language-dependent — some languages return null. Note: when diarize is true, alignment is automatically enabled (required for speaker assignment), even if you pass align: false.
webhook_urlstringOptionalURL to receive completion/failure webhook. HTTPS recommended.
Response
{"id": "b2c3d4e5-f6a7-8901-bcde-f12345678901","status": "created","message": "Transcription created. The file will be validated and processed. Poll GET /v1/transcribe/{id} for status updates."}
Code Examples
resp = requests.post("https://api.scriptivox.com/v1/transcribe",headers={"Authorization": "sk_live_YOUR_KEY"},json={"upload_id": "a1b2c3d4-e5f6-7890-abcd-ef1234567890","diarize": True,"speaker_count": 2,"language": "en"})job = resp.json()# Poll for statusimport timewhile True:result = requests.get(f"https://api.scriptivox.com/v1/transcribe/{job['id']}",headers={"Authorization": "sk_live_YOUR_KEY"}).json()if result["status"] in ("completed", "failed"):breaktime.sleep(5)
Get result
/v1/transcribe/{id}Get the status and result of a transcription. Poll this endpoint until status is completed or failed, or use webhooks for real-time notifications. Pass ?format=srt|vtt|text to export the transcript directly as captions or plain text instead of JSON.
Authorization: sk_live_YOUR_KEYRequest Body
| Parameter | Type | Required | Description |
|---|---|---|---|
| id | string | Required | The transcription ID returned from POST /v1/transcribe (passed in the URL path) |
| format | string | Optional | Output format. One of: "json" (default, full structured response), "srt" (SubRip captions), "vtt" (WebVTT captions), "text" (plain text only). Requires status=completed for srt/vtt/text. Caption formats return text/plain (srt) or text/vtt (vtt) Content-Type. |
| max_words | integer | Optional | Caption segmentation: maximum words per cue. 1–50. Default 4. Only applies to format=srt|vtt|text. Lower = shorter, faster-changing captions; higher = longer cues. Whichever of max_words/max_chars/max_duration trips first ends the segment. |
| max_chars | integer | Optional | Caption segmentation: maximum characters per cue. 10–500. Default 80. Standard subtitle convention is ~37 chars/line × 2 lines = ~80. |
| max_duration | integer | Optional | Caption segmentation: maximum seconds per cue. 1–60. Default 10. Caps how long a single subtitle stays on screen. |
| sentence_aware | boolean | Optional | Caption segmentation: end a cue whenever a sentence-ending punctuation mark appears (. ! ?). Default true. Produces more natural caption breaks at the cost of slightly more variable cue lengths. |
| include_speakers | string | Optional | Speaker labels in captions. "true" prefixes every cue with the speaker label. "false" never includes labels. "auto" (default) includes labels only if the job has more than one distinct speaker. Only meaningful when the job was created with diarize=true. |
| strip_chars | string | Optional | Characters to remove from cue text before output. Up to 32 chars. Example: strip_chars=",." removes all commas and periods. Useful for cleaner captions or fitting tighter character limits. |
idstringRequiredThe transcription ID returned from POST /v1/transcribe (passed in the URL path)
formatstringOptionalOutput format. One of: "json" (default, full structured response), "srt" (SubRip captions), "vtt" (WebVTT captions), "text" (plain text only). Requires status=completed for srt/vtt/text. Caption formats return text/plain (srt) or text/vtt (vtt) Content-Type.
max_wordsintegerOptionalCaption segmentation: maximum words per cue. 1–50. Default 4. Only applies to format=srt|vtt|text. Lower = shorter, faster-changing captions; higher = longer cues. Whichever of max_words/max_chars/max_duration trips first ends the segment.
max_charsintegerOptionalCaption segmentation: maximum characters per cue. 10–500. Default 80. Standard subtitle convention is ~37 chars/line × 2 lines = ~80.
max_durationintegerOptionalCaption segmentation: maximum seconds per cue. 1–60. Default 10. Caps how long a single subtitle stays on screen.
sentence_awarebooleanOptionalCaption segmentation: end a cue whenever a sentence-ending punctuation mark appears (. ! ?). Default true. Produces more natural caption breaks at the cost of slightly more variable cue lengths.
include_speakersstringOptionalSpeaker labels in captions. "true" prefixes every cue with the speaker label. "false" never includes labels. "auto" (default) includes labels only if the job has more than one distinct speaker. Only meaningful when the job was created with diarize=true.
strip_charsstringOptionalCharacters to remove from cue text before output. Up to 32 chars. Example: strip_chars=",." removes all commas and periods. Useful for cleaner captions or fitting tighter character limits.
Response
{"id": "b2c3d4e5-f6a7-8901-bcde-f12345678901","status": "completed","audio_duration_seconds": 120,"file_size_bytes": 1920000,"language": "en","diarize": true,"speaker_count": 2,"align": true,"cost_cents": 0.5,"source_url": "https://example.com/podcast.mp3","progress": "Transcription completed successfully.","created_at": "2025-01-15T10:30:00Z","started_at": "2025-01-15T10:30:01Z","completed_at": "2025-01-15T10:30:45Z","result": {"full_transcript": "Hello, thanks for joining...","language": "en","duration_seconds": 120,"speakers": ["SPEAKER 1", "SPEAKER 2"],"utterances": [{"start": 0.5,"end": 3.2,"text": "Hello, thanks for joining the call today.","speaker": "SPEAKER 1","confidence": 0.95,"words": [{"word": "Hello,","start": 0.5,"end": 0.9,"confidence": 0.98,"speaker": "SPEAKER 1"}]}]}}
Code Examples
resp = requests.get(f"https://api.scriptivox.com/v1/transcribe/{job['id']}",headers={"Authorization": "sk_live_YOUR_KEY"})result = resp.json()if result["status"] == "completed":print(result["result"]["full_transcript"])
Status values
The status lifecycle is created → downloading (URL flow only) → processing → completed | failed.
| Status | Description |
|---|---|
created | Job accepted. Download (URL flow) or validation (upload flow) is about to start. |
downloading | URL flow only — the file is being fetched from the provided URL. |
processing | The file has been validated and is being transcribed on a GPU worker. |
completed | Transcription finished successfully. The result object is now populated. |
failed | The job failed. Inspect error.code and error.message. |
Caption / text export
GET /v1/transcribe/{id}?format=srt|vtt|text returns the transcript in the requested format with the appropriate Content-Type:
format | Content-Type | Body |
|---|---|---|
json (default) | application/json | Full structured response with result.utterances[] |
srt | text/plain; charset=utf-8 | SubRip Text |
vtt | text/vtt; charset=utf-8 | WebVTT, with <v Speaker> voice tags when speaker labels are shown |
text | text/plain; charset=utf-8 | Plain text. Diarized jobs get a Speaker: prefix per turn with blank lines between turns. |
srt, vtt, and text require status=completed; otherwise returns 400 INVALID_REQUEST. JSON format works at any status.
Segmentation controls
The caption formats (srt, vtt, and text) accept the same segmentation knobs as the dashboard's Advanced Export modal — pass them as query params alongside format:
| Param | Default | Range | Effect |
|---|---|---|---|
max_words | 4 | 1–50 | Maximum words per cue. Lower = shorter, faster-changing captions. |
max_chars | 80 | 10–500 | Maximum characters per cue. ~80 is the industry convention (~37 chars × 2 lines). |
max_duration | 10 | 1–60 (seconds) | Maximum seconds a single cue stays on screen. |
sentence_aware | true | true / false | End a cue when a sentence ends (. ! ?). Produces more natural breaks. |
include_speakers | auto | true / false / auto | Whether to prefix each cue with the speaker label. auto includes them only when the job has more than one distinct speaker. |
strip_chars | (empty) | up to 32 chars | Characters to remove from cue text. E.g. strip_chars=,. drops all commas and periods. |
Whichever of max_words, max_chars, max_duration is exceeded first ends the current cue. sentence_aware adds a sentence-ending punctuation rule on top of those.
Examples:
# Short cues (TikTok-style, max 3 words, 2.5s each, no sentence-awareness)curl "https://api.scriptivox.com/v1/transcribe/{id}?format=srt&max_words=3&max_duration=3&sentence_aware=false" \-H "Authorization: sk_live_YOUR_KEY"# Standard SRT with default settingscurl "https://api.scriptivox.com/v1/transcribe/{id}?format=srt" \-H "Authorization: sk_live_YOUR_KEY"# WebVTT, always show speaker tags, strip filler punctuationcurl "https://api.scriptivox.com/v1/transcribe/{id}?format=vtt&include_speakers=true&strip_chars=,." \-H "Authorization: sk_live_YOUR_KEY"# Plain text without speaker prefixescurl "https://api.scriptivox.com/v1/transcribe/{id}?format=text&include_speakers=false" \-H "Authorization: sk_live_YOUR_KEY"
When the job was made with align: false (no per-word timestamps), the segmentation falls back to utterance-level — max_words/max_chars/max_duration still cap each cue but cuts can only happen at utterance boundaries. For best segmentation control, keep align on (the default).
List transcriptions
/v1/transcriptionsList your transcriptions with optional filters and cursor-based pagination. Newest first by default. Soft-deleted transcriptions are excluded.
Authorization: sk_live_YOUR_KEYRequest Body
| Parameter | Type | Required | Description |
|---|---|---|---|
| status | string | Optional | Filter by status. One of: "created", "downloading", "pending", "processing", "completed", "failed". |
| from | string | Optional | ISO-8601 timestamp. Inclusive lower bound on `created_at`. |
| to | string | Optional | ISO-8601 timestamp. Exclusive upper bound on `created_at`. |
| limit | integer | Optional | 1–200. Default 50. |
| order | string | Optional | "desc" (default, newest first) or "asc". |
| cursor | string | Optional | Opaque cursor from the `next_cursor` field of a previous response. Use to fetch the next page. |
statusstringOptionalFilter by status. One of: "created", "downloading", "pending", "processing", "completed", "failed".
fromstringOptionalISO-8601 timestamp. Inclusive lower bound on `created_at`.
tostringOptionalISO-8601 timestamp. Exclusive upper bound on `created_at`.
limitintegerOptional1–200. Default 50.
orderstringOptional"desc" (default, newest first) or "asc".
cursorstringOptionalOpaque cursor from the `next_cursor` field of a previous response. Use to fetch the next page.
Response
{"items": [{"id": "b2c3d4e5-f6a7-8901-bcde-f12345678901","status": "completed","audio_duration_seconds": 120,"file_size_bytes": 1920000,"language": "en","diarize": true,"speaker_count": 2,"align": true,"cost_cents": 0.4,"created_at": "2026-05-18T10:30:00Z","started_at": "2026-05-18T10:30:01Z","completed_at": "2026-05-18T10:30:45Z","source_url": "https://example.com/podcast.mp3"}],"next_cursor": "eyJ0IjoiMjAyNi0wNS0xOFQxMDozMDowMFoiLCJpIjoiYjJjM2Q0ZTUtZjZhNy04OTAxLWJjZGUtZjEyMzQ1Njc4OTAxIn0="}
Code Examples
resp = requests.get("https://api.scriptivox.com/v1/transcriptions",headers={"Authorization": "sk_live_YOUR_KEY"},params={"status": "completed", "limit": 20})page = resp.json()for item in page["items"]:print(item["id"], item["status"])if page["next_cursor"]:# fetch the next pageresp = requests.get("https://api.scriptivox.com/v1/transcriptions",headers={"Authorization": "sk_live_YOUR_KEY"},params={"cursor": page["next_cursor"]})
Each item matches the shape of GET /v1/transcribe/{id} except that the heavy result object is omitted — fetch individual transcriptions for the full transcript. Pagination is stable across new inserts: pass the next_cursor value from the response into the cursor query param to get the next page. next_cursor is null when there are no more pages.
Cancel transcription
/v1/transcribe/{id}/cancelStop an in-flight transcription. Releases the reserved balance and fires the transcription.failed webhook (if configured) with error.code=CANCELLED. Idempotent — calling cancel on an already-cancelled job returns the same response.
Authorization: sk_live_YOUR_KEYRequest Body
| Parameter | Type | Required | Description |
|---|---|---|---|
| id | string | Required | The transcription ID (passed in the URL path). |
idstringRequiredThe transcription ID (passed in the URL path).
Response
{"id": "b2c3d4e5-f6a7-8901-bcde-f12345678901","status": "failed","error": {"code": "CANCELLED","message": "Cancelled by customer"},"released_cents": 0.5}
Code Examples
requests.post(f"https://api.scriptivox.com/v1/transcribe/{transcription_id}/cancel",headers={"Authorization": "sk_live_YOUR_KEY"})
Cancel is allowed only while the job is in created, downloading, pending, or processing state. Cancelling a completed or failed job returns 409 CONFLICT. Cancellation is best-effort against the GPU — the model may still finish briefly after, but its result is discarded and you are not charged.
Delete transcription
/v1/transcribe/{id}Soft-delete a completed or failed transcription. Removes the stored transcript from our storage. The job record is kept for 7 days for audit, then hard-deleted. In-flight jobs cannot be deleted — cancel them first.
Authorization: sk_live_YOUR_KEYRequest Body
| Parameter | Type | Required | Description |
|---|---|---|---|
| id | string | Required | The transcription ID (passed in the URL path). |
idstringRequiredThe transcription ID (passed in the URL path).
Response
(204 No Content — empty body)
Code Examples
requests.delete(f"https://api.scriptivox.com/v1/transcribe/{transcription_id}",headers={"Authorization": "sk_live_YOUR_KEY"})
Returns 204 No Content on success. Idempotent — deleting an already-deleted transcription also returns 204. Trying to delete an in-flight transcription returns 409 CONFLICT with a message telling you to cancel first.
Balance
A non-zero balance is required to start an upload or submit a transcription — POST /v1/upload and POST /v1/transcribe return 402 ZERO_BALANCE when your balance is $0. The exact cost is reserved once the audio duration is known (after download/validation), not on submission.
/v1/balanceReturns your current account balance in cents, the amount reserved for in-progress transcriptions, the amount available for new jobs, and an estimate of remaining hours at the current per-hour price.
Authorization: sk_live_YOUR_KEYResponse
{"balance_cents": 1500,"reserved_cents": 100,"available_cents": 1400,"price_per_hour_cents": 20,"estimated_hours_available": 93.3,"deposit_url": "https://platform.scriptivox.com/billing","updated_at": "2025-01-15T10:30:00Z"}
Code Examples
resp = requests.get("https://api.scriptivox.com/v1/balance",headers={"Authorization": "sk_live_YOUR_KEY"})balance = resp.json()print(f"Available: ${balance['available_cents'] / 100:.2f}")print(f"Hours remaining: {balance['estimated_hours_available']:.1f}")
Error Codes
All errors follow the same format:
{"error": {"code": "ERROR_CODE","message": "Human-readable description"}}
Synchronous errors — returned immediately on the request
| HTTP | Code | Description |
|---|---|---|
| 400 | INVALID_REQUEST | Malformed request body or missing required fields |
| 400 | INVALID_FILENAME | Filename is missing, too long, contains invalid characters, or has no extension |
| 400 | INVALID_MEDIA_FORMAT | Unsupported file extension at upload time (e.g. .txt, .pdf). Also returned async when ffprobe rejects the actual file contents. |
| 400 | FILE_NOT_UPLOADED | File not found at the upload URL |
| 400 | FILE_TOO_LARGE | File exceeds 5 GB limit |
| 400 | UPLOAD_ALREADY_USED | Upload already used for a transcription |
| 400 | UPLOAD_EXPIRED | Upload URL expired (1 hour TTL) |
| 401 | INVALID_API_KEY | Invalid or missing API key |
| 401 | API_KEY_REVOKED | API key has been revoked |
| 402 | INSUFFICIENT_BALANCE | Not enough balance for this transcription |
| 402 | ZERO_BALANCE | Balance is $0 — deposit required |
| 404 | UPLOAD_NOT_FOUND | Upload ID does not exist |
| 404 | TRANSCRIPTION_NOT_FOUND | Transcription ID does not exist |
| 404 | NOT_FOUND | Path doesn't match any endpoint |
| 405 | METHOD_NOT_ALLOWED | The path exists but only accepts a different HTTP method. The response includes an Allow header naming the accepted method (e.g. Allow: GET if you POST to /v1/balance). |
| 415 | UNSUPPORTED_MEDIA_TYPE | Request body sent without Content-Type: application/json on a POST/PUT/PATCH. The response includes an Accept-Post: application/json header. Parameters are allowed (application/json; charset=utf-8 works). |
| 409 | CONFLICT | Action not allowed in the current state (e.g. delete on an in-flight job, cancel on a completed one) |
| 409 | IDEMPOTENCY_KEY_LOCKED | Another request with the same Idempotency-Key is mid-flight. Retry after a few seconds (Retry-After header sent). |
| 422 | IDEMPOTENCY_KEY_CONFLICT | Idempotency-Key reused with a different request body — see Idempotency |
| 429 | RATE_LIMIT_EXCEEDED | Too many requests — see Rate Limits |
| 500 | INTERNAL_ERROR | Server error |
Asynchronous errors — surface only on GET /v1/transcribe/{id}
POST /v1/transcribe accepts the job and returns {"status":"created"} even if the input will ultimately fail. The errors below only appear on the GET endpoint, with the top-level status set to "failed" and the failure reason in the error object. Your client must handle these on the poll path, not on submit.
| Code | When it appears |
|---|---|
| URL_NOT_ACCESSIBLE | URL flow only — the URL returned 4xx/5xx, didn't resolve, or the connection was refused. |
| DOWNLOAD_FAILED | URL flow only — the download started but was interrupted. |
| INVALID_MEDIA_FORMAT | The file downloaded but ffprobe rejected it: not actually audio/video, no audio track, or shorter than the 1-second minimum (message names the measured duration, e.g. "Audio too short (0.10s). Minimum duration is 1 second."). |
| DURATION_TOO_LONG | Audio exceeds the 10-hour limit (only known after probing duration). |
| PROCESSING_ERROR | The GPU job failed after all retries. |
| CREATED_TIMEOUT | Job sat in created for more than 30 minutes — validation step never started. |
| DOWNLOAD_TIMEOUT | Job sat in downloading for more than 15 minutes — file download stalled. |
| PROCESSING_TIMEOUT | Job sat in processing for more than 45 minutes — GPU never returned a result. |
| BILLING_ERROR | Internal accounting issue while finalizing the charge — your balance was not debited and the transcript was not delivered. Safe to retry. |
| INTERNAL_ERROR | A server-side step (e.g. queueing the job to our processing layer) failed after we accepted the request. Safe to retry. |
| CANCELLED | The transcription was cancelled by the customer via POST /v1/transcribe/{id}/cancel. The reserved balance was released; you are not charged. |
Failed transcriptions are free. The reserved balance is released back to your account, so a failure costs $0 regardless of how far the job got.
error.message is always a sanitized, customer-safe string — we never forward raw GPU/library stack traces or internal paths. If you need more detail than the message provides for a PROCESSING_ERROR, contact support with the transcription ID and we can look up the raw error on our side.
Important notes
Language parameter behavior
Recommended: always pass language explicitly. Auto-detection works in most cases but has a small failure rate that's fully avoidable. Passing the actual language gives more accurate transcripts, faster turnaround (the model skips its detection pass), and protects you from the known edge cases below.
When you specify a language code, the model is forced to interpret the audio as that language. If the audio is actually in a different language, the model may translate rather than transcribe — for example, setting language: "es" on English audio can produce a Spanish translation of the speech. Omit language (or pass null) to let the model auto-detect.
Auto-detect picks a single dominant language for the whole file and is not perfect:
- Code-switched audio (e.g. English/Spanish in the same clip) typically gets transcribed as the dominant language, and segments in the other language may be dropped or mistranscribed.
- Hindi audio is sometimes routed to Urdu by the detector. If you know the language in advance, pass it explicitly.
- Short clips (under 30s) give the detector less signal and are more likely to mis-route.
- Music or background noise during the first few seconds can throw the detector off.
If you know the language up front — even probabilistically — pass it. The accuracy cost of being wrong is roughly the same as the cost of auto-detect picking the wrong language, but passing it is faster and works on the edge cases above.
Silence and very short clips
Whisper-based models are prone to a known hallucination on near-silent or sub-second audio, often producing a phantom "Thank you." or similar filler. If your pipeline can produce silent or extremely short clips, filter them on your side before submitting.
Size, duration, and retention limits
- Max file size: 5 GB. Anything larger is rejected with
400 FILE_TOO_LARGE. Very large uploads can also be rejected by the network layer with an HTML413page before they reach the API — your client should handle non-JSON error bodies gracefully. - Max duration: 10 hours. Files probing longer than this fail with
DURATION_TOO_LONG(asynchronous; you'll see it on the GET endpoint after probing completes). - Min duration: 1 second. Files shorter than 1 second fail asynchronously with
INVALID_MEDIA_FORMATand a message naming the actual measured duration. (Sub-second clips have too little speech signal to transcribe reliably.) - Request body Content-Type: every
POST/PUT/PATCHthat carries a JSON body must sendContent-Type: application/json(parameters allowed, e.g.application/json; charset=utf-8). Anything else is rejected at the gateway with415 UNSUPPORTED_MEDIA_TYPEand anAccept-Post: application/jsonheader. - Presigned URL TTL: 1 hour. Uploads must
PUTto the URL withinexpires_inseconds of receiving it; after that the URL returns400 UPLOAD_EXPIRED. - Audio retention: uploaded source audio is retained for 24 hours after the job ends, then deleted. Transcripts themselves remain available via
GET /v1/transcribe/{id}.
Filename rules
The filename you pass to POST /v1/upload must satisfy all of the following:
- ASCII only (rename files with accents, CJK, or emoji before upload — Supabase Storage rejects non-ASCII object keys).
- No path separators —
/and\are rejected. - No `< > " ' `` — rejected to prevent injection into dashboards that render filenames unescaped.
- Must include a supported extension (e.g.
.mp3,.wav,.mp4). See Supported Formats for the full list. - 255 characters or fewer.
Violations return 400 INVALID_FILENAME (or 400 INVALID_MEDIA_FORMAT if the extension is present but unsupported).
audio_duration_seconds is an integer
audio_duration_seconds in GET /v1/transcribe/{id} is rounded to the nearest whole second (so an 11.7-second clip returns 11). cost_cents, by contrast, is fractional and exact (e.g. 0.061111 for an 11-second clip at $0.20/hour) — your account balance is debited against the precise value, not a rounded one.
speaker_count is a soft prior, not a hard ceiling
When you set speaker_count: N with diarize: true, you're giving the diarizer a hint about how many speakers to expect — not a hard cap. The result may include slightly more or fewer speaker labels than N (e.g. requesting 5 may produce 6). Values outside 1–50, or speaker_count without diarize: true, are rejected with 400 INVALID_REQUEST.
Recommended: whenever you know how many speakers are on the recording, pass speaker_count. It noticeably improves diarization accuracy — the model uses it as a prior instead of guessing, which cuts down on over-segmentation (one speaker split across two labels) and under-segmentation (two speakers merged into one).
Per-word leading whitespace
Word objects in align: true output have whitespace pre-stripped (e.g. {"word":"The"}, not {"word":" The"}). Reconstruct sentence text from utterance.text if you need exact spacing.
diarize forces align
diarize requires word-level alignment to map speakers onto each word, so any request with diarize: true is processed as if align: true — even when the caller explicitly passes align: false. The stored job and webhook payload reflect the effective value (align: true), and word objects are present on the result. If you don't want alignment data, leave diarize off.
Idempotency
Both POST /v1/upload and POST /v1/transcribe accept an optional Idempotency-Key HTTP header. When present, we cache the response for 24 hours and replay it byte-for-byte on subsequent requests with the same key and body. This protects you against double-billed transcriptions if your network drops the response and your client retries.
curl -X POST https://api.scriptivox.com/v1/transcribe \-H "Authorization: sk_live_YOUR_KEY" \-H "Idempotency-Key: 7c8f5b3a-1234-4d56-90ab-cdef01234567" \-H "Content-Type: application/json" \-d '{"url": "https://example.com/audio.mp3"}'
Rules:
- Key is opaque — we don't parse it. Generate one per logical operation (a UUID is conventional). Reuse the same key when retrying after a failure.
- 1–255 printable ASCII characters.
- Same key + same body within 24h → cached response is replayed. The replayed response carries a
Idempotent-Replay: trueheader so you can tell it was cached. - Same key + different body →
422 IDEMPOTENCY_KEY_CONFLICT. This is a bug in your retry logic — use a fresh key for a different operation. - Same key while a previous request is still in progress →
409 IDEMPOTENCY_KEY_LOCKEDwith aRetry-Afterheader. Sleep briefly and retry; the cached response will be available once the first request finishes. - No header sent → endpoint behaves normally (no caching).
The replay cache covers 200 success responses only. 4xx and 5xx errors are not cached, so you can retry freely.
Unknown fields are rejected
Request bodies on POST /v1/upload and POST /v1/transcribe are strict: any top-level field that isn't in the documented parameter list returns 400 INVALID_REQUEST with the offending key name. This is to surface typos ({"dirize": true}) immediately instead of silently ignoring them. Nested objects (e.g. inside a future metadata field) are not currently inspected.
Response shape changes with diarize and align
The GET /v1/transcribe/{id} example above shows the response with diarize: true and align: true (the maximal case). When you turn flags off, some result fields become null or empty:
| Field | diarize: false | align: false |
|---|---|---|
result.speakers | null | unchanged |
result.utterances[].speaker | null | unchanged |
result.utterances[].words | unchanged | [] (empty array) |
result.utterances[].words[].speaker | key absent | n/a (no words) |
Top-level metadata (status, cost_cents, audio_duration_seconds, etc.) is identical regardless of flags. result.utterances[].confidence and result.utterances[].words[].confidence may be null — confidence scores depend on the alignment model used for the detected language, and not every language is covered. Always handle null defensively (e.g. skip filtering by confidence rather than dropping the word).
Speaker labels are "SPEAKER 1", "SPEAKER 2", … (space, 1-indexed) — not "SPEAKER_00". source_url is only present on URL-flow transcriptions; upload-flow jobs omit it.
Rate Limits
Limits are enforced per API key per endpoint, plus a per-IP cap across all endpoints. Exceeding any limit returns 429 RATE_LIMIT_EXCEEDED with a Retry-After header.
| Scope | Limit | Notes |
|---|---|---|
POST /v1/upload (per key) | 60/min | Presigned URL generation |
POST /v1/transcribe (per key) | 60/min | Job submission |
GET /v1/transcribe/{id} (per key) | 200/min | Higher limit for polling |
GET /v1/transcriptions (per key) | 60/min | List your jobs |
POST /v1/transcribe/{id}/cancel (per key) | 30/min | Cancel in-flight |
DELETE /v1/transcribe/{id} (per key) | 30/min | Soft-delete |
GET /v1/balance (per key) | 100/min | Balance checks |
| Per source IP (across all endpoints) | 300/min | Edge-level cap to prevent abuse |
Limits use a sliding 60-second window, not a fixed-window counter — short bursts above the per-minute number are tolerated as long as the rolling 60-second total stays under the limit. Plan against the steady-state number, not the burst.
Rate limit headers
Every response includes:
| Header | Description |
|---|---|
X-RateLimit-Limit | Maximum requests allowed per minute for this endpoint. |
X-RateLimit-Remaining | Requests remaining in the current rolling window. |
X-RateLimit-Reset | Unix timestamp when the window fully resets. |
Retry-After | Seconds to wait before retrying. Only present on 429 responses. |
Supported Formats
25 container/codec combinations are accepted (10 audio + 15 video). Maximum file size is 5 GB; maximum duration is 10 hours. Unrecognized extensions return 400 INVALID_MEDIA_FORMAT.
Audio (10)
| Extension | Format |
|---|---|
| .mp3 | MPEG Audio |
| .wav | Waveform Audio |
| .m4a | MPEG-4 Audio |
| .aac | Advanced Audio Coding |
| .ogg | Ogg Vorbis |
| .flac | Free Lossless Audio |
| .opus | Opus |
| .wma | Windows Media Audio |
| .aiff | Audio Interchange |
| .caf | Core Audio Format |
Video (15)
| Extension | Format |
|---|---|
| .mp4 | MPEG-4 Video |
| .mov | QuickTime |
| .avi | Audio Video Interleave |
| .mkv | Matroska Video |
| .webm | WebM |
| .wmv | Windows Media Video |
| .flv | Flash Video |
| .m4v | MPEG-4 Video (iTunes) |
| .3gp | 3GPP |
| .mpeg | MPEG Video |
| .mts | AVCHD |
| .ogv | Ogg Video |
| .ts | MPEG Transport Stream |
| .vob | DVD Video Object |
| .f4v | Flash MP4 Video |
Supported Languages
119 languages are supported. Pass the ISO 639-1 (or BCP-47 fallback, e.g. yue, kea) language code below in the language parameter.
We recommend always passing a language explicitly. Omitting the parameter (or passing null) triggers auto-detection, which works for most inputs but has a small chance of picking the wrong language — especially on short clips, code-switched audio, or files that start with music or background noise. See Language parameter behavior for details.
Invalid codes return 400 INVALID_REQUEST.
| Language | Code |
|---|---|
| Afrikaans | af |
| Albanian | sq |
| Amharic | am |
| Arabic | ar |
| Armenian | hy |
| Assamese | as |
| Asturian | ast |
| Azerbaijani | az |
| Bashkir | ba |
| Basque | eu |
| Belarusian | be |
| Bengali | bn |
| Bosnian | bs |
| Breton | br |
| Bulgarian | bg |
| Cantonese | yue |
| Cape Verdean Creole | kea |
| Catalan | ca |
| Cebuano | ceb |
| Chichewa | ny |
| Chinese | zh |
| Croatian | hr |
| Czech | cs |
| Danish | da |
| Dutch | nl |
| English | en |
| Estonian | et |
| Faroese | fo |
| Finnish | fi |
| French | fr |
| Fula | ff |
| Galician | gl |
| Georgian | ka |
| German | de |
| Greek | el |
| Gujarati | gu |
| Haitian Creole | ht |
| Hausa | ha |
| Hawaiian | haw |
| Hebrew | he |
| Hindi | hi |
| Hungarian | hu |
| Icelandic | is |
| Igbo | ig |
| Indonesian | id |
| Irish | ga |
| Italian | it |
| Japanese | ja |
| Javanese | jw |
| Kamba | kam |
| Kannada | kn |
| Kazakh | kk |
| Khmer | km |
| Korean | ko |
| Kyrgyz | ky |
| Lao | lo |
| Latin | la |
| Latvian | lv |
| Lingala | ln |
| Lithuanian | lt |
| Luganda | lg |
| Luo | luo |
| Luxembourgish | lb |
| Macedonian | mk |
| Malagasy | mg |
| Malay | ms |
| Malayalam | ml |
| Maltese | mt |
| Maori | mi |
| Marathi | mr |
| Mongolian | mn |
| Myanmar | my |
| Nepali | ne |
| Northern Sotho | nso |
| Norwegian | no |
| Nynorsk | nn |
| Occitan | oc |
| Odia | or |
| Oromo | om |
| Pashto | ps |
| Persian | fa |
| Polish | pl |
| Portuguese | pt |
| Punjabi | pa |
| Romanian | ro |
| Russian | ru |
| Sanskrit | sa |
| Serbian | sr |
| Shona | sn |
| Sindhi | sd |
| Sinhala | si |
| Slovak | sk |
| Slovenian | sl |
| Somali | so |
| Sorani Kurdish | ckb |
| Spanish | es |
| Sundanese | su |
| Swahili | sw |
| Swedish | sv |
| Tagalog | tl |
| Tajik | tg |
| Tamil | ta |
| Tatar | tt |
| Telugu | te |
| Thai | th |
| Tibetan | bo |
| Turkish | tr |
| Turkmen | tk |
| Ukrainian | uk |
| Umbundu | umb |
| Urdu | ur |
| Uzbek | uz |
| Vietnamese | vi |
| Welsh | cy |
| Wolof | wo |
| Xhosa | xh |
| Yiddish | yi |
| Yoruba | yo |
| Zulu | zu |