API Reference

Complete reference for the Scriptivox transcription API. All endpoints require an API key, passed in one of:

Authorization: sk_live_… (the Bearer prefix is accepted but not required)
X-Api-Key: sk_live_…

Header names are case-insensitive. Each account can have at most 5 active API keys at a time — revoke an unused key in the dashboard before creating a new one if you hit this ceiling.

Base URL: https://api.scriptivox.com/v1

Service status: real-time uptime + incident history at status.scriptivox.com.

Transcribe

Send audio for transcription. You can either pass a URL (we download it) or upload your own file.

From a URL

The simplest path — one POST request. We download the file, validate it, and start transcription automatically. Supports Google Drive, Dropbox, and OneDrive sharing links.

POST/v1/transcribe

Start a transcription from a public URL. The file is downloaded and validated in the background. Poll GET /v1/transcribe/{id} for status updates. Duration and cost are determined after download.

Authentication: Authorization: sk_live_YOUR_KEY

Request Body

Parameter	Type	Required	Description
url	string	Required	Public URL to an audio/video file (http or https). Supports Google Drive, Dropbox, and OneDrive sharing links. Max 2048 characters.
language	string	Optional	ISO 639-1 language code (e.g. "en", "es", "fr"). Omit for auto-detection. Warning: forcing a wrong language may produce a translation instead of a transcription.
diarize	boolean	Optional	Enable speaker diarization to identify who said what. Default: false.
speaker_count	integer	Optional	Expected number of speakers (1–50). Requires diarize to be true. Strongly recommended when you know the speaker count — providing it noticeably improves diarization accuracy. If omitted, the model auto-detects, which can over- or under-segment speakers.
align	boolean	Optional	Enable word-level timestamps with start/end times for every word, plus per-word confidence scores when the alignment model supports them. Default: true. Confidence is language-dependent — some languages return null. Note: when diarize is true, alignment is automatically enabled (required for speaker assignment), even if you pass align: false.
webhook_url	string	Optional	URL to receive completion/failure webhook. HTTPS recommended.

urlstringRequired

Public URL to an audio/video file (http or https). Supports Google Drive, Dropbox, and OneDrive sharing links. Max 2048 characters.

languagestringOptional

ISO 639-1 language code (e.g. "en", "es", "fr"). Omit for auto-detection. Warning: forcing a wrong language may produce a translation instead of a transcription.

diarizebooleanOptional

Enable speaker diarization to identify who said what. Default: false.

speaker_countintegerOptional

Expected number of speakers (1–50). Requires diarize to be true. Strongly recommended when you know the speaker count — providing it noticeably improves diarization accuracy. If omitted, the model auto-detects, which can over- or under-segment speakers.

alignbooleanOptional

Enable word-level timestamps with start/end times for every word, plus per-word confidence scores when the alignment model supports them. Default: true. Confidence is language-dependent — some languages return null. Note: when diarize is true, alignment is automatically enabled (required for speaker assignment), even if you pass align: false.

webhook_urlstringOptional

URL to receive completion/failure webhook. HTTPS recommended.

Response

json

{
  "id": "b2c3d4e5-f6a7-8901-bcde-f12345678901",
  "status": "created",
  "message": "Transcription created. The file will be downloaded and processed. Poll GET /v1/transcribe/{id} for status updates."
}

Code Examples

resp = requests.post("https://api.scriptivox.com/v1/transcribe",
    headers={"Authorization": "sk_live_YOUR_KEY"},
    json={
        "url": "https://example.com/podcast-episode.mp3",
        "diarize": True,
        "language": "en"
    })
job = resp.json()

# Poll for status
import time
while True:
    result = requests.get(
        f"https://api.scriptivox.com/v1/transcribe/{job['id']}",
        headers={"Authorization": "sk_live_YOUR_KEY"}).json()
    if result["status"] in ("completed", "failed"):
        break
    time.sleep(5)

From a file upload

Upload your own file when you need full control or don't have a public URL.

POST/v1/upload

Get a presigned URL to upload an audio or video file. The URL expires in 1 hour. Upload your file to the returned URL with a PUT request, then pass the upload_id to POST /v1/transcribe.

Authentication: Authorization: sk_live_YOUR_KEY

Request Body

Parameter	Type	Required	Description
filename	string	Required	Name of the file being uploaded (e.g. "meeting.mp3"). Maximum 255 characters.

filenamestringRequired

Name of the file being uploaded (e.g. "meeting.mp3"). Maximum 255 characters.

Response

json

{
  "upload_id": "a1b2c3d4-e5f6-7890-abcd-ef1234567890",
  "upload_url": "https://storage.supabase.co/...",
  "expires_in": 3600,
  "method": "PUT",
  "headers": {
    "Content-Type": "audio/mpeg"
  }
}

Code Examples

import requests

resp = requests.post("https://api.scriptivox.com/v1/upload",
    headers={"Authorization": "sk_live_YOUR_KEY"},
    json={"filename": "meeting.mp3"})
upload = resp.json()

with open("meeting.mp3", "rb") as f:
    requests.put(upload["upload_url"],
        headers=upload["headers"], data=f)

POST/v1/transcribe

Start a transcription from an uploaded file. Pass the upload_id from POST /v1/upload. The file is validated in the background. Poll GET /v1/transcribe/{id} for status updates. Duration and cost are determined after validation.

Authentication: Authorization: sk_live_YOUR_KEY

Request Body

Parameter	Type	Required	Description
upload_id	string	Required	The upload ID from POST /v1/upload.
language	string	Optional	ISO 639-1 language code (e.g. "en", "es", "fr"). Omit for auto-detection. Warning: forcing a wrong language may produce a translation instead of a transcription.
diarize	boolean	Optional	Enable speaker diarization to identify who said what. Default: false.
speaker_count	integer	Optional	Expected number of speakers (1–50). Requires diarize to be true. Strongly recommended when you know the speaker count — providing it noticeably improves diarization accuracy. If omitted, the model auto-detects, which can over- or under-segment speakers.
align	boolean	Optional	Enable word-level timestamps with start/end times for every word, plus per-word confidence scores when the alignment model supports them. Default: true. Confidence is language-dependent — some languages return null. Note: when diarize is true, alignment is automatically enabled (required for speaker assignment), even if you pass align: false.
webhook_url	string	Optional	URL to receive completion/failure webhook. HTTPS recommended.

upload_idstringRequired

The upload ID from POST /v1/upload.

languagestringOptional

ISO 639-1 language code (e.g. "en", "es", "fr"). Omit for auto-detection. Warning: forcing a wrong language may produce a translation instead of a transcription.

diarizebooleanOptional

Enable speaker diarization to identify who said what. Default: false.

speaker_countintegerOptional

alignbooleanOptional

webhook_urlstringOptional

URL to receive completion/failure webhook. HTTPS recommended.

Response

json

{
  "id": "b2c3d4e5-f6a7-8901-bcde-f12345678901",
  "status": "created",
  "message": "Transcription created. The file will be validated and processed. Poll GET /v1/transcribe/{id} for status updates."
}

Code Examples

resp = requests.post("https://api.scriptivox.com/v1/transcribe",
    headers={"Authorization": "sk_live_YOUR_KEY"},
    json={
        "upload_id": "a1b2c3d4-e5f6-7890-abcd-ef1234567890",
        "diarize": True,
        "speaker_count": 2,
        "language": "en"
    })
job = resp.json()

# Poll for status
import time
while True:
    result = requests.get(
        f"https://api.scriptivox.com/v1/transcribe/{job['id']}",
        headers={"Authorization": "sk_live_YOUR_KEY"}).json()
    if result["status"] in ("completed", "failed"):
        break
    time.sleep(5)

Get result

GET/v1/transcribe/{id}

Get the status and result of a transcription. Poll this endpoint until status is completed or failed, or use webhooks for real-time notifications. Pass ?format=srt|vtt|text to export the transcript directly as captions or plain text instead of JSON.

Authentication: Authorization: sk_live_YOUR_KEY

Request Body

Parameter	Type	Required	Description
id	string	Required	The transcription ID returned from POST /v1/transcribe (passed in the URL path)
format	string	Optional	Output format. One of: "json" (default, full structured response), "srt" (SubRip captions), "vtt" (WebVTT captions), "text" (plain text only). Requires status=completed for srt/vtt/text. Caption formats return text/plain (srt) or text/vtt (vtt) Content-Type.
max_words	integer	Optional	Caption segmentation: maximum words per cue. 1–50. Default 4. Only applies to format=srt\|vtt\|text. Lower = shorter, faster-changing captions; higher = longer cues. Whichever of max_words/max_chars/max_duration trips first ends the segment.
max_chars	integer	Optional	Caption segmentation: maximum characters per cue. 10–500. Default 80. Standard subtitle convention is ~37 chars/line × 2 lines = ~80.
max_duration	integer	Optional	Caption segmentation: maximum seconds per cue. 1–60. Default 10. Caps how long a single subtitle stays on screen.
sentence_aware	boolean	Optional	Caption segmentation: end a cue whenever a sentence-ending punctuation mark appears (. ! ?). Default true. Produces more natural caption breaks at the cost of slightly more variable cue lengths.
include_speakers	string	Optional	Speaker labels in captions. "true" prefixes every cue with the speaker label. "false" never includes labels. "auto" (default) includes labels only if the job has more than one distinct speaker. Only meaningful when the job was created with diarize=true.
strip_chars	string	Optional	Characters to remove from cue text before output. Up to 32 chars. Example: strip_chars=",." removes all commas and periods. Useful for cleaner captions or fitting tighter character limits.

idstringRequired

The transcription ID returned from POST /v1/transcribe (passed in the URL path)

formatstringOptional

Output format. One of: "json" (default, full structured response), "srt" (SubRip captions), "vtt" (WebVTT captions), "text" (plain text only). Requires status=completed for srt/vtt/text. Caption formats return text/plain (srt) or text/vtt (vtt) Content-Type.

max_wordsintegerOptional

Caption segmentation: maximum words per cue. 1–50. Default 4. Only applies to format=srt|vtt|text. Lower = shorter, faster-changing captions; higher = longer cues. Whichever of max_words/max_chars/max_duration trips first ends the segment.

max_charsintegerOptional

Caption segmentation: maximum characters per cue. 10–500. Default 80. Standard subtitle convention is ~37 chars/line × 2 lines = ~80.

max_durationintegerOptional

Caption segmentation: maximum seconds per cue. 1–60. Default 10. Caps how long a single subtitle stays on screen.

sentence_awarebooleanOptional

Caption segmentation: end a cue whenever a sentence-ending punctuation mark appears (. ! ?). Default true. Produces more natural caption breaks at the cost of slightly more variable cue lengths.

include_speakersstringOptional

Speaker labels in captions. "true" prefixes every cue with the speaker label. "false" never includes labels. "auto" (default) includes labels only if the job has more than one distinct speaker. Only meaningful when the job was created with diarize=true.

strip_charsstringOptional

Characters to remove from cue text before output. Up to 32 chars. Example: strip_chars=",." removes all commas and periods. Useful for cleaner captions or fitting tighter character limits.

Response

json

{
  "id": "b2c3d4e5-f6a7-8901-bcde-f12345678901",
  "status": "completed",
  "audio_duration_seconds": 120,
  "file_size_bytes": 1920000,
  "language": "en",
  "diarize": true,
  "speaker_count": 2,
  "align": true,
  "cost_cents": 0.5,
  "source_url": "https://example.com/podcast.mp3",
  "progress": "Transcription completed successfully.",
  "created_at": "2025-01-15T10:30:00Z",
  "started_at": "2025-01-15T10:30:01Z",
  "completed_at": "2025-01-15T10:30:45Z",
  "result": {
    "full_transcript": "Hello, thanks for joining...",
    "language": "en",
    "duration_seconds": 120,
    "speakers": ["SPEAKER 1", "SPEAKER 2"],
    "utterances": [
      {
        "start": 0.5,
        "end": 3.2,
        "text": "Hello, thanks for joining the call today.",
        "speaker": "SPEAKER 1",
        "confidence": 0.95,
        "words": [
          {
            "word": "Hello,",
            "start": 0.5,
            "end": 0.9,
            "confidence": 0.98,
            "speaker": "SPEAKER 1"
          }
        ]
      }
    ]
  }
}

Code Examples

resp = requests.get(
    f"https://api.scriptivox.com/v1/transcribe/{job['id']}",
    headers={"Authorization": "sk_live_YOUR_KEY"})
result = resp.json()

if result["status"] == "completed":
    print(result["result"]["full_transcript"])

Status values

The status lifecycle is created → downloading (URL flow only) → processing → completed | failed.

Status	Description
`created`	Job accepted. Download (URL flow) or validation (upload flow) is about to start.
`downloading`	URL flow only — the file is being fetched from the provided URL.
`processing`	The file has been validated and is being transcribed on a GPU worker.
`completed`	Transcription finished successfully. The `result` object is now populated.
`failed`	The job failed. Inspect `error.code` and `error.message`.

Caption / text export

GET /v1/transcribe/{id}?format=srt|vtt|text returns the transcript in the requested format with the appropriate Content-Type:

`format`	Content-Type	Body
`json` (default)	`application/json`	Full structured response with `result.utterances[]`
`srt`	`text/plain; charset=utf-8`	SubRip Text
`vtt`	`text/vtt; charset=utf-8`	WebVTT, with `<v Speaker>` voice tags when speaker labels are shown
`text`	`text/plain; charset=utf-8`	Plain text. Diarized jobs get a `Speaker:` prefix per turn with blank lines between turns.

srt, vtt, and text require status=completed; otherwise returns 400 INVALID_REQUEST. JSON format works at any status.

Segmentation controls

The caption formats (srt, vtt, and text) accept the same segmentation knobs as the dashboard's Advanced Export modal — pass them as query params alongside format:

Param	Default	Range	Effect
`max_words`	4	1–50	Maximum words per cue. Lower = shorter, faster-changing captions.
`max_chars`	80	10–500	Maximum characters per cue. ~80 is the industry convention (~37 chars × 2 lines).
`max_duration`	10	1–60 (seconds)	Maximum seconds a single cue stays on screen.
`sentence_aware`	`true`	`true` / `false`	End a cue when a sentence ends (`. ! ?`). Produces more natural breaks.
`include_speakers`	`auto`	`true` / `false` / `auto`	Whether to prefix each cue with the speaker label. `auto` includes them only when the job has more than one distinct speaker.
`strip_chars`	(empty)	up to 32 chars	Characters to remove from cue text. E.g. `strip_chars=,.` drops all commas and periods.

Whichever of max_words, max_chars, max_duration is exceeded first ends the current cue. sentence_aware adds a sentence-ending punctuation rule on top of those.

Examples:

bash

# Short cues (TikTok-style, max 3 words, 2.5s each, no sentence-awareness)
curl "https://api.scriptivox.com/v1/transcribe/{id}?format=srt&max_words=3&max_duration=3&sentence_aware=false" \
  -H "Authorization: sk_live_YOUR_KEY"

# Standard SRT with default settings
curl "https://api.scriptivox.com/v1/transcribe/{id}?format=srt" \
  -H "Authorization: sk_live_YOUR_KEY"

# WebVTT, always show speaker tags, strip filler punctuation
curl "https://api.scriptivox.com/v1/transcribe/{id}?format=vtt&include_speakers=true&strip_chars=,." \
  -H "Authorization: sk_live_YOUR_KEY"

# Plain text without speaker prefixes
curl "https://api.scriptivox.com/v1/transcribe/{id}?format=text&include_speakers=false" \
  -H "Authorization: sk_live_YOUR_KEY"

When the job was made with align: false (no per-word timestamps), the segmentation falls back to utterance-level — max_words/max_chars/max_duration still cap each cue but cuts can only happen at utterance boundaries. For best segmentation control, keep align on (the default).

List transcriptions

GET/v1/transcriptions

List your transcriptions with optional filters and cursor-based pagination. Newest first by default. Soft-deleted transcriptions are excluded.

Authentication: Authorization: sk_live_YOUR_KEY

Request Body

Parameter	Type	Required	Description
status	string	Optional	Filter by status. One of: "created", "downloading", "pending", "processing", "completed", "failed".
from	string	Optional	ISO-8601 timestamp. Inclusive lower bound on `created_at`.
to	string	Optional	ISO-8601 timestamp. Exclusive upper bound on `created_at`.
limit	integer	Optional	1–200. Default 50.
order	string	Optional	"desc" (default, newest first) or "asc".
cursor	string	Optional	Opaque cursor from the `next_cursor` field of a previous response. Use to fetch the next page.

statusstringOptional

Filter by status. One of: "created", "downloading", "pending", "processing", "completed", "failed".

fromstringOptional

ISO-8601 timestamp. Inclusive lower bound on `created_at`.

tostringOptional

ISO-8601 timestamp. Exclusive upper bound on `created_at`.

limitintegerOptional

1–200. Default 50.

orderstringOptional

"desc" (default, newest first) or "asc".

cursorstringOptional

Opaque cursor from the `next_cursor` field of a previous response. Use to fetch the next page.

Response

json

{
  "items": [
    {
      "id": "b2c3d4e5-f6a7-8901-bcde-f12345678901",
      "status": "completed",
      "audio_duration_seconds": 120,
      "file_size_bytes": 1920000,
      "language": "en",
      "diarize": true,
      "speaker_count": 2,
      "align": true,
      "cost_cents": 0.4,
      "created_at": "2026-05-18T10:30:00Z",
      "started_at": "2026-05-18T10:30:01Z",
      "completed_at": "2026-05-18T10:30:45Z",
      "source_url": "https://example.com/podcast.mp3"
    }
  ],
  "next_cursor": "eyJ0IjoiMjAyNi0wNS0xOFQxMDozMDowMFoiLCJpIjoiYjJjM2Q0ZTUtZjZhNy04OTAxLWJjZGUtZjEyMzQ1Njc4OTAxIn0="
}

Code Examples

resp = requests.get(
    "https://api.scriptivox.com/v1/transcriptions",
    headers={"Authorization": "sk_live_YOUR_KEY"},
    params={"status": "completed", "limit": 20})
page = resp.json()

for item in page["items"]:
    print(item["id"], item["status"])

if page["next_cursor"]:
    # fetch the next page
    resp = requests.get(
        "https://api.scriptivox.com/v1/transcriptions",
        headers={"Authorization": "sk_live_YOUR_KEY"},
        params={"cursor": page["next_cursor"]})

Each item matches the shape of GET /v1/transcribe/{id} except that the heavy result object is omitted — fetch individual transcriptions for the full transcript. Pagination is stable across new inserts: pass the next_cursor value from the response into the cursor query param to get the next page. next_cursor is null when there are no more pages.

Cancel transcription

POST/v1/transcribe/{id}/cancel

Stop an in-flight transcription. Releases the reserved balance and fires the transcription.failed webhook (if configured) with error.code=CANCELLED. Idempotent — calling cancel on an already-cancelled job returns the same response.

Authentication: Authorization: sk_live_YOUR_KEY

Request Body

Parameter	Type	Required	Description
id	string	Required	The transcription ID (passed in the URL path).

idstringRequired

The transcription ID (passed in the URL path).

Response

json

{
  "id": "b2c3d4e5-f6a7-8901-bcde-f12345678901",
  "status": "failed",
  "error": {
    "code": "CANCELLED",
    "message": "Cancelled by customer"
  },
  "released_cents": 0.5
}

Code Examples

requests.post(
    f"https://api.scriptivox.com/v1/transcribe/{transcription_id}/cancel",
    headers={"Authorization": "sk_live_YOUR_KEY"})

Cancel is allowed only while the job is in created, downloading, pending, or processing state. Cancelling a completed or failed job returns 409 CONFLICT. Cancellation is best-effort against the GPU — the model may still finish briefly after, but its result is discarded and you are not charged.

Delete transcription

DELETE/v1/transcribe/{id}

Soft-delete a completed or failed transcription. Removes the stored transcript from our storage. The job record is kept for 7 days for audit, then hard-deleted. In-flight jobs cannot be deleted — cancel them first.

Authentication: Authorization: sk_live_YOUR_KEY

Request Body

Parameter	Type	Required	Description
id	string	Required	The transcription ID (passed in the URL path).

idstringRequired

The transcription ID (passed in the URL path).

Response

json

(204 No Content — empty body)

Code Examples

requests.delete(
    f"https://api.scriptivox.com/v1/transcribe/{transcription_id}",
    headers={"Authorization": "sk_live_YOUR_KEY"})

Returns 204 No Content on success. Idempotent — deleting an already-deleted transcription also returns 204. Trying to delete an in-flight transcription returns 409 CONFLICT with a message telling you to cancel first.

Balance

A non-zero balance is required to start an upload or submit a transcription — POST /v1/upload and POST /v1/transcribe return 402 ZERO_BALANCE when your balance is $0. The exact cost is reserved once the audio duration is known (after download/validation), not on submission.

GET/v1/balance

Returns your current account balance in cents, the amount reserved for in-progress transcriptions, the amount available for new jobs, and an estimate of remaining hours at the current per-hour price.

Authentication: Authorization: sk_live_YOUR_KEY

Response

json

{
  "balance_cents": 1500,
  "reserved_cents": 100,
  "available_cents": 1400,
  "price_per_hour_cents": 20,
  "estimated_hours_available": 93.3,
  "deposit_url": "https://platform.scriptivox.com/billing",
  "updated_at": "2025-01-15T10:30:00Z"
}

Code Examples

resp = requests.get("https://api.scriptivox.com/v1/balance",
    headers={"Authorization": "sk_live_YOUR_KEY"})
balance = resp.json()
print(f"Available: ${balance['available_cents'] / 100:.2f}")
print(f"Hours remaining: {balance['estimated_hours_available']:.1f}")

Error Codes

All errors follow the same format:

json

{
  "error": {
    "code": "ERROR_CODE",
    "message": "Human-readable description"
  }
}

Synchronous errors — returned immediately on the request

HTTP	Code	Description
400	INVALID_REQUEST	Malformed request body or missing required fields
400	INVALID_FILENAME	Filename is missing, too long, contains invalid characters, or has no extension
400	INVALID_MEDIA_FORMAT	Unsupported file extension at upload time (e.g. `.txt`, `.pdf`). Also returned async when `ffprobe` rejects the actual file contents.
400	FILE_NOT_UPLOADED	File not found at the upload URL
400	FILE_TOO_LARGE	File exceeds 5 GB limit
400	UPLOAD_ALREADY_USED	Upload already used for a transcription
400	UPLOAD_EXPIRED	Upload URL expired (1 hour TTL)
401	INVALID_API_KEY	Invalid or missing API key
401	API_KEY_REVOKED	API key has been revoked
402	INSUFFICIENT_BALANCE	Not enough balance for this transcription
402	ZERO_BALANCE	Balance is $0 — deposit required
404	UPLOAD_NOT_FOUND	Upload ID does not exist
404	TRANSCRIPTION_NOT_FOUND	Transcription ID does not exist
404	NOT_FOUND	Path doesn't match any endpoint
405	METHOD_NOT_ALLOWED	The path exists but only accepts a different HTTP method. The response includes an `Allow` header naming the accepted method (e.g. `Allow: GET` if you POST to `/v1/balance`).
415	UNSUPPORTED_MEDIA_TYPE	Request body sent without `Content-Type: application/json` on a POST/PUT/PATCH. The response includes an `Accept-Post: application/json` header. Parameters are allowed (`application/json; charset=utf-8` works).
409	CONFLICT	Action not allowed in the current state (e.g. delete on an in-flight job, cancel on a completed one)
409	IDEMPOTENCY_KEY_LOCKED	Another request with the same `Idempotency-Key` is mid-flight. Retry after a few seconds (`Retry-After` header sent).
422	IDEMPOTENCY_KEY_CONFLICT	`Idempotency-Key` reused with a different request body — see Idempotency
429	RATE_LIMIT_EXCEEDED	Too many requests — see Rate Limits
500	INTERNAL_ERROR	Server error

Asynchronous errors — surface only on `GET /v1/transcribe/{id}`

POST /v1/transcribe accepts the job and returns {"status":"created"} even if the input will ultimately fail. The errors below only appear on the GET endpoint, with the top-level status set to "failed" and the failure reason in the error object. Your client must handle these on the poll path, not on submit.

Code	When it appears
URL_NOT_ACCESSIBLE	URL flow only — the URL returned 4xx/5xx, didn't resolve, or the connection was refused.
DOWNLOAD_FAILED	URL flow only — the download started but was interrupted.
INVALID_MEDIA_FORMAT	The file downloaded but `ffprobe` rejected it: not actually audio/video, no audio track, or shorter than the 1-second minimum (message names the measured duration, e.g. `"Audio too short (0.10s). Minimum duration is 1 second."`).
DURATION_TOO_LONG	Audio exceeds the 10-hour limit (only known after probing duration).
PROCESSING_ERROR	The GPU job failed after all retries.
CREATED_TIMEOUT	Job sat in `created` for more than 30 minutes — validation step never started.
DOWNLOAD_TIMEOUT	Job sat in `downloading` for more than 15 minutes — file download stalled.
PROCESSING_TIMEOUT	Job sat in `processing` for more than 45 minutes — GPU never returned a result.
BILLING_ERROR	Internal accounting issue while finalizing the charge — your balance was not debited and the transcript was not delivered. Safe to retry.
INTERNAL_ERROR	A server-side step (e.g. queueing the job to our processing layer) failed after we accepted the request. Safe to retry.
CANCELLED	The transcription was cancelled by the customer via `POST /v1/transcribe/{id}/cancel`. The reserved balance was released; you are not charged.

Failed transcriptions are free. The reserved balance is released back to your account, so a failure costs $0 regardless of how far the job got.

error.message is always a sanitized, customer-safe string — we never forward raw GPU/library stack traces or internal paths. If you need more detail than the message provides for a PROCESSING_ERROR, contact support with the transcription ID and we can look up the raw error on our side.

Important notes

Language parameter behavior

Recommended: always pass language explicitly. Auto-detection works in most cases but has a small failure rate that's fully avoidable. Passing the actual language gives more accurate transcripts, faster turnaround (the model skips its detection pass), and protects you from the known edge cases below.

When you specify a language code, the model is forced to interpret the audio as that language. If the audio is actually in a different language, the model may translate rather than transcribe — for example, setting language: "es" on English audio can produce a Spanish translation of the speech. Omit language (or pass null) to let the model auto-detect.

Auto-detect picks a single dominant language for the whole file and is not perfect:

Code-switched audio (e.g. English/Spanish in the same clip) typically gets transcribed as the dominant language, and segments in the other language may be dropped or mistranscribed.
Hindi audio is sometimes routed to Urdu by the detector. If you know the language in advance, pass it explicitly.
Short clips (under 30s) give the detector less signal and are more likely to mis-route.
Music or background noise during the first few seconds can throw the detector off.

If you know the language up front — even probabilistically — pass it. The accuracy cost of being wrong is roughly the same as the cost of auto-detect picking the wrong language, but passing it is faster and works on the edge cases above.

Silence and very short clips

Whisper-based models are prone to a known hallucination on near-silent or sub-second audio, often producing a phantom "Thank you." or similar filler. If your pipeline can produce silent or extremely short clips, filter them on your side before submitting.

Size, duration, and retention limits

Max file size: 5 GB. Anything larger is rejected with 400 FILE_TOO_LARGE. Very large uploads can also be rejected by the network layer with an HTML 413 page before they reach the API — your client should handle non-JSON error bodies gracefully.
Max duration: 10 hours. Files probing longer than this fail with DURATION_TOO_LONG (asynchronous; you'll see it on the GET endpoint after probing completes).
Min duration: 1 second. Files shorter than 1 second fail asynchronously with INVALID_MEDIA_FORMAT and a message naming the actual measured duration. (Sub-second clips have too little speech signal to transcribe reliably.)
Request body Content-Type: every POST / PUT / PATCH that carries a JSON body must send Content-Type: application/json (parameters allowed, e.g. application/json; charset=utf-8). Anything else is rejected at the gateway with 415 UNSUPPORTED_MEDIA_TYPE and an Accept-Post: application/json header.
Presigned URL TTL: 1 hour. Uploads must PUT to the URL within expires_in seconds of receiving it; after that the URL returns 400 UPLOAD_EXPIRED.
Audio retention: uploaded source audio is retained for 24 hours after the job ends, then deleted. Transcripts themselves remain available via GET /v1/transcribe/{id}.

Filename rules

The filename you pass to POST /v1/upload must satisfy all of the following:

ASCII only (rename files with accents, CJK, or emoji before upload — Supabase Storage rejects non-ASCII object keys).
No path separators — / and \ are rejected.
No `< > " ' `` — rejected to prevent injection into dashboards that render filenames unescaped.
Must include a supported extension (e.g. .mp3, .wav, .mp4). See Supported Formats for the full list.
255 characters or fewer.

Violations return 400 INVALID_FILENAME (or 400 INVALID_MEDIA_FORMAT if the extension is present but unsupported).

`audio_duration_seconds` is an integer

audio_duration_seconds in GET /v1/transcribe/{id} is rounded to the nearest whole second (so an 11.7-second clip returns 11). cost_cents, by contrast, is fractional and exact (e.g. 0.061111 for an 11-second clip at $0.20/hour) — your account balance is debited against the precise value, not a rounded one.

`speaker_count` is a soft prior, not a hard ceiling

When you set speaker_count: N with diarize: true, you're giving the diarizer a hint about how many speakers to expect — not a hard cap. The result may include slightly more or fewer speaker labels than N (e.g. requesting 5 may produce 6). Values outside 1–50, or speaker_count without diarize: true, are rejected with 400 INVALID_REQUEST.

Recommended: whenever you know how many speakers are on the recording, pass speaker_count. It noticeably improves diarization accuracy — the model uses it as a prior instead of guessing, which cuts down on over-segmentation (one speaker split across two labels) and under-segmentation (two speakers merged into one).

Per-word leading whitespace

Word objects in align: true output have whitespace pre-stripped (e.g. {"word":"The"}, not {"word":" The"}). Reconstruct sentence text from utterance.text if you need exact spacing.

`diarize` forces `align`

diarize requires word-level alignment to map speakers onto each word, so any request with diarize: true is processed as if align: true — even when the caller explicitly passes align: false. The stored job and webhook payload reflect the effective value (align: true), and word objects are present on the result. If you don't want alignment data, leave diarize off.

Idempotency

Both POST /v1/upload and POST /v1/transcribe accept an optional Idempotency-Key HTTP header. When present, we cache the response for 24 hours and replay it byte-for-byte on subsequent requests with the same key and body. This protects you against double-billed transcriptions if your network drops the response and your client retries.

bash

curl -X POST https://api.scriptivox.com/v1/transcribe \
  -H "Authorization: sk_live_YOUR_KEY" \
  -H "Idempotency-Key: 7c8f5b3a-1234-4d56-90ab-cdef01234567" \
  -H "Content-Type: application/json" \
  -d '{"url": "https://example.com/audio.mp3"}'

Rules:

Key is opaque — we don't parse it. Generate one per logical operation (a UUID is conventional). Reuse the same key when retrying after a failure.
1–255 printable ASCII characters.
Same key + same body within 24h → cached response is replayed. The replayed response carries a Idempotent-Replay: true header so you can tell it was cached.
Same key + different body → 422 IDEMPOTENCY_KEY_CONFLICT. This is a bug in your retry logic — use a fresh key for a different operation.
Same key while a previous request is still in progress → 409 IDEMPOTENCY_KEY_LOCKED with a Retry-After header. Sleep briefly and retry; the cached response will be available once the first request finishes.
No header sent → endpoint behaves normally (no caching).

The replay cache covers 200 success responses only. 4xx and 5xx errors are not cached, so you can retry freely.

Unknown fields are rejected

Request bodies on POST /v1/upload and POST /v1/transcribe are strict: any top-level field that isn't in the documented parameter list returns 400 INVALID_REQUEST with the offending key name. This is to surface typos ({"dirize": true}) immediately instead of silently ignoring them. Nested objects (e.g. inside a future metadata field) are not currently inspected.

Response shape changes with `diarize` and `align`

The GET /v1/transcribe/{id} example above shows the response with diarize: true and align: true (the maximal case). When you turn flags off, some result fields become null or empty:

Field	`diarize: false`	`align: false`
`result.speakers`	`null`	unchanged
`result.utterances[].speaker`	`null`	unchanged
`result.utterances[].words`	unchanged	`[]` (empty array)
`result.utterances[].words[].speaker`	key absent	n/a (no words)

Top-level metadata (status, cost_cents, audio_duration_seconds, etc.) is identical regardless of flags. result.utterances[].confidence and result.utterances[].words[].confidence may be null — confidence scores depend on the alignment model used for the detected language, and not every language is covered. Always handle null defensively (e.g. skip filtering by confidence rather than dropping the word).

Speaker labels are "SPEAKER 1", "SPEAKER 2", … (space, 1-indexed) — not "SPEAKER_00". source_url is only present on URL-flow transcriptions; upload-flow jobs omit it.

Rate Limits

Limits are enforced per API key per endpoint, plus a per-IP cap across all endpoints. Exceeding any limit returns 429 RATE_LIMIT_EXCEEDED with a Retry-After header.

Scope	Limit	Notes
`POST /v1/upload` (per key)	60/min	Presigned URL generation
`POST /v1/transcribe` (per key)	60/min	Job submission
`GET /v1/transcribe/{id}` (per key)	200/min	Higher limit for polling
`GET /v1/transcriptions` (per key)	60/min	List your jobs
`POST /v1/transcribe/{id}/cancel` (per key)	30/min	Cancel in-flight
`DELETE /v1/transcribe/{id}` (per key)	30/min	Soft-delete
`GET /v1/balance` (per key)	100/min	Balance checks
Per source IP (across all endpoints)	300/min	Edge-level cap to prevent abuse

Limits use a sliding 60-second window, not a fixed-window counter — short bursts above the per-minute number are tolerated as long as the rolling 60-second total stays under the limit. Plan against the steady-state number, not the burst.

Rate limit headers

Every response includes:

Header	Description
`X-RateLimit-Limit`	Maximum requests allowed per minute for this endpoint.
`X-RateLimit-Remaining`	Requests remaining in the current rolling window.
`X-RateLimit-Reset`	Unix timestamp when the window fully resets.
`Retry-After`	Seconds to wait before retrying. Only present on `429` responses.

Supported Formats

25 container/codec combinations are accepted (10 audio + 15 video). Maximum file size is 5 GB; maximum duration is 10 hours. Unrecognized extensions return 400 INVALID_MEDIA_FORMAT.

Audio (10)

Extension	Format
.mp3	MPEG Audio
.wav	Waveform Audio
.m4a	MPEG-4 Audio
.aac	Advanced Audio Coding
.ogg	Ogg Vorbis
.flac	Free Lossless Audio
.opus	Opus
.wma	Windows Media Audio
.aiff	Audio Interchange
.caf	Core Audio Format

Video (15)

Extension	Format
.mp4	MPEG-4 Video
.mov	QuickTime
.avi	Audio Video Interleave
.mkv	Matroska Video
.webm	WebM
.wmv	Windows Media Video
.flv	Flash Video
.m4v	MPEG-4 Video (iTunes)
.3gp	3GPP
.mpeg	MPEG Video
.mts	AVCHD
.ogv	Ogg Video
.ts	MPEG Transport Stream
.vob	DVD Video Object
.f4v	Flash MP4 Video

Supported Languages

119 languages are supported. Pass the ISO 639-1 (or BCP-47 fallback, e.g. yue, kea) language code below in the language parameter.

We recommend always passing a language explicitly. Omitting the parameter (or passing null) triggers auto-detection, which works for most inputs but has a small chance of picking the wrong language — especially on short clips, code-switched audio, or files that start with music or background noise. See Language parameter behavior for details.

Invalid codes return 400 INVALID_REQUEST.

Language	Code
Afrikaans	af
Albanian	sq
Amharic	am
Arabic	ar
Armenian	hy
Assamese	as
Asturian	ast
Azerbaijani	az
Bashkir	ba
Basque	eu
Belarusian	be
Bengali	bn
Bosnian	bs
Breton	br
Bulgarian	bg
Cantonese	yue
Cape Verdean Creole	kea
Catalan	ca
Cebuano	ceb
Chichewa	ny
Chinese	zh
Croatian	hr
Czech	cs
Danish	da
Dutch	nl
English	en
Estonian	et
Faroese	fo
Finnish	fi
French	fr
Fula	ff
Galician	gl
Georgian	ka
German	de
Greek	el
Gujarati	gu
Haitian Creole	ht
Hausa	ha
Hawaiian	haw
Hebrew	he
Hindi	hi
Hungarian	hu
Icelandic	is
Igbo	ig
Indonesian	id
Irish	ga
Italian	it
Japanese	ja
Javanese	jw
Kamba	kam
Kannada	kn
Kazakh	kk
Khmer	km
Korean	ko
Kyrgyz	ky
Lao	lo
Latin	la
Latvian	lv
Lingala	ln
Lithuanian	lt
Luganda	lg
Luo	luo
Luxembourgish	lb
Macedonian	mk
Malagasy	mg
Malay	ms
Malayalam	ml
Maltese	mt
Maori	mi
Marathi	mr
Mongolian	mn
Myanmar	my
Nepali	ne
Northern Sotho	nso
Norwegian	no
Nynorsk	nn
Occitan	oc
Odia	or
Oromo	om
Pashto	ps
Persian	fa
Polish	pl
Portuguese	pt
Punjabi	pa
Romanian	ro
Russian	ru
Sanskrit	sa
Serbian	sr
Shona	sn
Sindhi	sd
Sinhala	si
Slovak	sk
Slovenian	sl
Somali	so
Sorani Kurdish	ckb
Spanish	es
Sundanese	su
Swahili	sw
Swedish	sv
Tagalog	tl
Tajik	tg
Tamil	ta
Tatar	tt
Telugu	te
Thai	th
Tibetan	bo
Turkish	tr
Turkmen	tk
Ukrainian	uk
Umbundu	umb
Urdu	ur
Uzbek	uz
Vietnamese	vi
Welsh	cy
Wolof	wo
Xhosa	xh
Yiddish	yi
Yoruba	yo
Zulu	zu

Webhooks

Real-time completion notifications

Pricing

Pay-as-you-go at $0.20/hour

API Reference

Complete reference for the Scriptivox transcription API. All endpoints require an API key, passed in one of:

Authorization: sk_live_… (the Bearer prefix is accepted but not required)
X-Api-Key: sk_live_…

Header names are case-insensitive. Each account can have at most 5 active API keys at a time — revoke an unused key in the dashboard before creating a new one if you hit this ceiling.

Base URL: https://api.scriptivox.com/v1

Service status: real-time uptime + incident history at status.scriptivox.com.

Transcribe

Send audio for transcription. You can either pass a URL (we download it) or upload your own file.

From a URL

The simplest path — one POST request. We download the file, validate it, and start transcription automatically. Supports Google Drive, Dropbox, and OneDrive sharing links.

POST/v1/transcribe

Start a transcription from a public URL. The file is downloaded and validated in the background. Poll GET /v1/transcribe/{id} for status updates. Duration and cost are determined after download.

Authentication: Authorization: sk_live_YOUR_KEY

Request Body

Parameter	Type	Required	Description
url	string	Required	Public URL to an audio/video file (http or https). Supports Google Drive, Dropbox, and OneDrive sharing links. Max 2048 characters.
language	string	Optional	ISO 639-1 language code (e.g. "en", "es", "fr"). Omit for auto-detection. Warning: forcing a wrong language may produce a translation instead of a transcription.
diarize	boolean	Optional	Enable speaker diarization to identify who said what. Default: false.
speaker_count	integer	Optional	Expected number of speakers (1–50). Requires diarize to be true. Strongly recommended when you know the speaker count — providing it noticeably improves diarization accuracy. If omitted, the model auto-detects, which can over- or under-segment speakers.
align	boolean	Optional	Enable word-level timestamps with start/end times for every word, plus per-word confidence scores when the alignment model supports them. Default: true. Confidence is language-dependent — some languages return null. Note: when diarize is true, alignment is automatically enabled (required for speaker assignment), even if you pass align: false.
webhook_url	string	Optional	URL to receive completion/failure webhook. HTTPS recommended.

urlstringRequired

Public URL to an audio/video file (http or https). Supports Google Drive, Dropbox, and OneDrive sharing links. Max 2048 characters.

languagestringOptional

ISO 639-1 language code (e.g. "en", "es", "fr"). Omit for auto-detection. Warning: forcing a wrong language may produce a translation instead of a transcription.

diarizebooleanOptional

Enable speaker diarization to identify who said what. Default: false.

speaker_countintegerOptional

alignbooleanOptional

webhook_urlstringOptional

URL to receive completion/failure webhook. HTTPS recommended.

Response

json

{
  "id": "b2c3d4e5-f6a7-8901-bcde-f12345678901",
  "status": "created",
  "message": "Transcription created. The file will be downloaded and processed. Poll GET /v1/transcribe/{id} for status updates."
}

Code Examples

resp = requests.post("https://api.scriptivox.com/v1/transcribe",
    headers={"Authorization": "sk_live_YOUR_KEY"},
    json={
        "url": "https://example.com/podcast-episode.mp3",
        "diarize": True,
        "language": "en"
    })
job = resp.json()

# Poll for status
import time
while True:
    result = requests.get(
        f"https://api.scriptivox.com/v1/transcribe/{job['id']}",
        headers={"Authorization": "sk_live_YOUR_KEY"}).json()
    if result["status"] in ("completed", "failed"):
        break
    time.sleep(5)

From a file upload

Upload your own file when you need full control or don't have a public URL.

POST/v1/upload

Get a presigned URL to upload an audio or video file. The URL expires in 1 hour. Upload your file to the returned URL with a PUT request, then pass the upload_id to POST /v1/transcribe.

Authentication: Authorization: sk_live_YOUR_KEY

Request Body

Parameter	Type	Required	Description
filename	string	Required	Name of the file being uploaded (e.g. "meeting.mp3"). Maximum 255 characters.

filenamestringRequired

Name of the file being uploaded (e.g. "meeting.mp3"). Maximum 255 characters.

Response

json

{
  "upload_id": "a1b2c3d4-e5f6-7890-abcd-ef1234567890",
  "upload_url": "https://storage.supabase.co/...",
  "expires_in": 3600,
  "method": "PUT",
  "headers": {
    "Content-Type": "audio/mpeg"
  }
}

Code Examples

import requests

resp = requests.post("https://api.scriptivox.com/v1/upload",
    headers={"Authorization": "sk_live_YOUR_KEY"},
    json={"filename": "meeting.mp3"})
upload = resp.json()

with open("meeting.mp3", "rb") as f:
    requests.put(upload["upload_url"],
        headers=upload["headers"], data=f)

POST/v1/transcribe

Authentication: Authorization: sk_live_YOUR_KEY

Request Body

Parameter	Type	Required	Description
upload_id	string	Required	The upload ID from POST /v1/upload.
language	string	Optional	ISO 639-1 language code (e.g. "en", "es", "fr"). Omit for auto-detection. Warning: forcing a wrong language may produce a translation instead of a transcription.
diarize	boolean	Optional	Enable speaker diarization to identify who said what. Default: false.
speaker_count	integer	Optional	Expected number of speakers (1–50). Requires diarize to be true. Strongly recommended when you know the speaker count — providing it noticeably improves diarization accuracy. If omitted, the model auto-detects, which can over- or under-segment speakers.
align	boolean	Optional	Enable word-level timestamps with start/end times for every word, plus per-word confidence scores when the alignment model supports them. Default: true. Confidence is language-dependent — some languages return null. Note: when diarize is true, alignment is automatically enabled (required for speaker assignment), even if you pass align: false.
webhook_url	string	Optional	URL to receive completion/failure webhook. HTTPS recommended.

upload_idstringRequired

The upload ID from POST /v1/upload.

languagestringOptional

ISO 639-1 language code (e.g. "en", "es", "fr"). Omit for auto-detection. Warning: forcing a wrong language may produce a translation instead of a transcription.

diarizebooleanOptional

Enable speaker diarization to identify who said what. Default: false.

speaker_countintegerOptional

alignbooleanOptional

webhook_urlstringOptional

URL to receive completion/failure webhook. HTTPS recommended.

Response

json

{
  "id": "b2c3d4e5-f6a7-8901-bcde-f12345678901",
  "status": "created",
  "message": "Transcription created. The file will be validated and processed. Poll GET /v1/transcribe/{id} for status updates."
}

Code Examples

resp = requests.post("https://api.scriptivox.com/v1/transcribe",
    headers={"Authorization": "sk_live_YOUR_KEY"},
    json={
        "upload_id": "a1b2c3d4-e5f6-7890-abcd-ef1234567890",
        "diarize": True,
        "speaker_count": 2,
        "language": "en"
    })
job = resp.json()

# Poll for status
import time
while True:
    result = requests.get(
        f"https://api.scriptivox.com/v1/transcribe/{job['id']}",
        headers={"Authorization": "sk_live_YOUR_KEY"}).json()
    if result["status"] in ("completed", "failed"):
        break
    time.sleep(5)

Get result

GET/v1/transcribe/{id}

Authentication: Authorization: sk_live_YOUR_KEY

Request Body

Parameter	Type	Required	Description
id	string	Required	The transcription ID returned from POST /v1/transcribe (passed in the URL path)
format	string	Optional	Output format. One of: "json" (default, full structured response), "srt" (SubRip captions), "vtt" (WebVTT captions), "text" (plain text only). Requires status=completed for srt/vtt/text. Caption formats return text/plain (srt) or text/vtt (vtt) Content-Type.
max_words	integer	Optional	Caption segmentation: maximum words per cue. 1–50. Default 4. Only applies to format=srt\|vtt\|text. Lower = shorter, faster-changing captions; higher = longer cues. Whichever of max_words/max_chars/max_duration trips first ends the segment.
max_chars	integer	Optional	Caption segmentation: maximum characters per cue. 10–500. Default 80. Standard subtitle convention is ~37 chars/line × 2 lines = ~80.
max_duration	integer	Optional	Caption segmentation: maximum seconds per cue. 1–60. Default 10. Caps how long a single subtitle stays on screen.
sentence_aware	boolean	Optional	Caption segmentation: end a cue whenever a sentence-ending punctuation mark appears (. ! ?). Default true. Produces more natural caption breaks at the cost of slightly more variable cue lengths.
include_speakers	string	Optional	Speaker labels in captions. "true" prefixes every cue with the speaker label. "false" never includes labels. "auto" (default) includes labels only if the job has more than one distinct speaker. Only meaningful when the job was created with diarize=true.
strip_chars	string	Optional	Characters to remove from cue text before output. Up to 32 chars. Example: strip_chars=",." removes all commas and periods. Useful for cleaner captions or fitting tighter character limits.

idstringRequired

The transcription ID returned from POST /v1/transcribe (passed in the URL path)

formatstringOptional

max_wordsintegerOptional

max_charsintegerOptional

Caption segmentation: maximum characters per cue. 10–500. Default 80. Standard subtitle convention is ~37 chars/line × 2 lines = ~80.

max_durationintegerOptional

Caption segmentation: maximum seconds per cue. 1–60. Default 10. Caps how long a single subtitle stays on screen.

sentence_awarebooleanOptional

Caption segmentation: end a cue whenever a sentence-ending punctuation mark appears (. ! ?). Default true. Produces more natural caption breaks at the cost of slightly more variable cue lengths.

include_speakersstringOptional

strip_charsstringOptional

Characters to remove from cue text before output. Up to 32 chars. Example: strip_chars=",." removes all commas and periods. Useful for cleaner captions or fitting tighter character limits.

Response

json

{
  "id": "b2c3d4e5-f6a7-8901-bcde-f12345678901",
  "status": "completed",
  "audio_duration_seconds": 120,
  "file_size_bytes": 1920000,
  "language": "en",
  "diarize": true,
  "speaker_count": 2,
  "align": true,
  "cost_cents": 0.5,
  "source_url": "https://example.com/podcast.mp3",
  "progress": "Transcription completed successfully.",
  "created_at": "2025-01-15T10:30:00Z",
  "started_at": "2025-01-15T10:30:01Z",
  "completed_at": "2025-01-15T10:30:45Z",
  "result": {
    "full_transcript": "Hello, thanks for joining...",
    "language": "en",
    "duration_seconds": 120,
    "speakers": ["SPEAKER 1", "SPEAKER 2"],
    "utterances": [
      {
        "start": 0.5,
        "end": 3.2,
        "text": "Hello, thanks for joining the call today.",
        "speaker": "SPEAKER 1",
        "confidence": 0.95,
        "words": [
          {
            "word": "Hello,",
            "start": 0.5,
            "end": 0.9,
            "confidence": 0.98,
            "speaker": "SPEAKER 1"
          }
        ]
      }
    ]
  }
}

Code Examples

resp = requests.get(
    f"https://api.scriptivox.com/v1/transcribe/{job['id']}",
    headers={"Authorization": "sk_live_YOUR_KEY"})
result = resp.json()

if result["status"] == "completed":
    print(result["result"]["full_transcript"])

Status values

The status lifecycle is created → downloading (URL flow only) → processing → completed | failed.

Status	Description
`created`	Job accepted. Download (URL flow) or validation (upload flow) is about to start.
`downloading`	URL flow only — the file is being fetched from the provided URL.
`processing`	The file has been validated and is being transcribed on a GPU worker.
`completed`	Transcription finished successfully. The `result` object is now populated.
`failed`	The job failed. Inspect `error.code` and `error.message`.

Caption / text export

GET /v1/transcribe/{id}?format=srt|vtt|text returns the transcript in the requested format with the appropriate Content-Type:

`format`	Content-Type	Body
`json` (default)	`application/json`	Full structured response with `result.utterances[]`
`srt`	`text/plain; charset=utf-8`	SubRip Text
`vtt`	`text/vtt; charset=utf-8`	WebVTT, with `<v Speaker>` voice tags when speaker labels are shown
`text`	`text/plain; charset=utf-8`	Plain text. Diarized jobs get a `Speaker:` prefix per turn with blank lines between turns.

srt, vtt, and text require status=completed; otherwise returns 400 INVALID_REQUEST. JSON format works at any status.

Segmentation controls

The caption formats (srt, vtt, and text) accept the same segmentation knobs as the dashboard's Advanced Export modal — pass them as query params alongside format:

Param	Default	Range	Effect
`max_words`	4	1–50	Maximum words per cue. Lower = shorter, faster-changing captions.
`max_chars`	80	10–500	Maximum characters per cue. ~80 is the industry convention (~37 chars × 2 lines).
`max_duration`	10	1–60 (seconds)	Maximum seconds a single cue stays on screen.
`sentence_aware`	`true`	`true` / `false`	End a cue when a sentence ends (`. ! ?`). Produces more natural breaks.
`include_speakers`	`auto`	`true` / `false` / `auto`	Whether to prefix each cue with the speaker label. `auto` includes them only when the job has more than one distinct speaker.
`strip_chars`	(empty)	up to 32 chars	Characters to remove from cue text. E.g. `strip_chars=,.` drops all commas and periods.

Whichever of max_words, max_chars, max_duration is exceeded first ends the current cue. sentence_aware adds a sentence-ending punctuation rule on top of those.

Examples:

bash

# Short cues (TikTok-style, max 3 words, 2.5s each, no sentence-awareness)
curl "https://api.scriptivox.com/v1/transcribe/{id}?format=srt&max_words=3&max_duration=3&sentence_aware=false" \
  -H "Authorization: sk_live_YOUR_KEY"

# Standard SRT with default settings
curl "https://api.scriptivox.com/v1/transcribe/{id}?format=srt" \
  -H "Authorization: sk_live_YOUR_KEY"

# WebVTT, always show speaker tags, strip filler punctuation
curl "https://api.scriptivox.com/v1/transcribe/{id}?format=vtt&include_speakers=true&strip_chars=,." \
  -H "Authorization: sk_live_YOUR_KEY"

# Plain text without speaker prefixes
curl "https://api.scriptivox.com/v1/transcribe/{id}?format=text&include_speakers=false" \
  -H "Authorization: sk_live_YOUR_KEY"

List transcriptions

GET/v1/transcriptions

List your transcriptions with optional filters and cursor-based pagination. Newest first by default. Soft-deleted transcriptions are excluded.

Authentication: Authorization: sk_live_YOUR_KEY

Request Body

Parameter	Type	Required	Description
status	string	Optional	Filter by status. One of: "created", "downloading", "pending", "processing", "completed", "failed".
from	string	Optional	ISO-8601 timestamp. Inclusive lower bound on `created_at`.
to	string	Optional	ISO-8601 timestamp. Exclusive upper bound on `created_at`.
limit	integer	Optional	1–200. Default 50.
order	string	Optional	"desc" (default, newest first) or "asc".
cursor	string	Optional	Opaque cursor from the `next_cursor` field of a previous response. Use to fetch the next page.

statusstringOptional

Filter by status. One of: "created", "downloading", "pending", "processing", "completed", "failed".

fromstringOptional

ISO-8601 timestamp. Inclusive lower bound on `created_at`.

tostringOptional

ISO-8601 timestamp. Exclusive upper bound on `created_at`.

limitintegerOptional

1–200. Default 50.

orderstringOptional

"desc" (default, newest first) or "asc".

cursorstringOptional

Opaque cursor from the `next_cursor` field of a previous response. Use to fetch the next page.

Response

json

{
  "items": [
    {
      "id": "b2c3d4e5-f6a7-8901-bcde-f12345678901",
      "status": "completed",
      "audio_duration_seconds": 120,
      "file_size_bytes": 1920000,
      "language": "en",
      "diarize": true,
      "speaker_count": 2,
      "align": true,
      "cost_cents": 0.4,
      "created_at": "2026-05-18T10:30:00Z",
      "started_at": "2026-05-18T10:30:01Z",
      "completed_at": "2026-05-18T10:30:45Z",
      "source_url": "https://example.com/podcast.mp3"
    }
  ],
  "next_cursor": "eyJ0IjoiMjAyNi0wNS0xOFQxMDozMDowMFoiLCJpIjoiYjJjM2Q0ZTUtZjZhNy04OTAxLWJjZGUtZjEyMzQ1Njc4OTAxIn0="
}

Code Examples

resp = requests.get(
    "https://api.scriptivox.com/v1/transcriptions",
    headers={"Authorization": "sk_live_YOUR_KEY"},
    params={"status": "completed", "limit": 20})
page = resp.json()

for item in page["items"]:
    print(item["id"], item["status"])

if page["next_cursor"]:
    # fetch the next page
    resp = requests.get(
        "https://api.scriptivox.com/v1/transcriptions",
        headers={"Authorization": "sk_live_YOUR_KEY"},
        params={"cursor": page["next_cursor"]})

Cancel transcription

POST/v1/transcribe/{id}/cancel

Authentication: Authorization: sk_live_YOUR_KEY

Request Body

Parameter	Type	Required	Description
id	string	Required	The transcription ID (passed in the URL path).

idstringRequired

The transcription ID (passed in the URL path).

Response

json

{
  "id": "b2c3d4e5-f6a7-8901-bcde-f12345678901",
  "status": "failed",
  "error": {
    "code": "CANCELLED",
    "message": "Cancelled by customer"
  },
  "released_cents": 0.5
}

Code Examples

requests.post(
    f"https://api.scriptivox.com/v1/transcribe/{transcription_id}/cancel",
    headers={"Authorization": "sk_live_YOUR_KEY"})

Delete transcription

DELETE/v1/transcribe/{id}

Authentication: Authorization: sk_live_YOUR_KEY

Request Body

Parameter	Type	Required	Description
id	string	Required	The transcription ID (passed in the URL path).

idstringRequired

The transcription ID (passed in the URL path).

Response

json

(204 No Content — empty body)

Code Examples

requests.delete(
    f"https://api.scriptivox.com/v1/transcribe/{transcription_id}",
    headers={"Authorization": "sk_live_YOUR_KEY"})

Balance

GET/v1/balance

Returns your current account balance in cents, the amount reserved for in-progress transcriptions, the amount available for new jobs, and an estimate of remaining hours at the current per-hour price.

Authentication: Authorization: sk_live_YOUR_KEY

Response

json

{
  "balance_cents": 1500,
  "reserved_cents": 100,
  "available_cents": 1400,
  "price_per_hour_cents": 20,
  "estimated_hours_available": 93.3,
  "deposit_url": "https://platform.scriptivox.com/billing",
  "updated_at": "2025-01-15T10:30:00Z"
}

Code Examples

resp = requests.get("https://api.scriptivox.com/v1/balance",
    headers={"Authorization": "sk_live_YOUR_KEY"})
balance = resp.json()
print(f"Available: ${balance['available_cents'] / 100:.2f}")
print(f"Hours remaining: {balance['estimated_hours_available']:.1f}")

Error Codes

All errors follow the same format:

json

{
  "error": {
    "code": "ERROR_CODE",
    "message": "Human-readable description"
  }
}

Synchronous errors — returned immediately on the request

HTTP	Code	Description
400	INVALID_REQUEST	Malformed request body or missing required fields
400	INVALID_FILENAME	Filename is missing, too long, contains invalid characters, or has no extension
400	INVALID_MEDIA_FORMAT	Unsupported file extension at upload time (e.g. `.txt`, `.pdf`). Also returned async when `ffprobe` rejects the actual file contents.
400	FILE_NOT_UPLOADED	File not found at the upload URL
400	FILE_TOO_LARGE	File exceeds 5 GB limit
400	UPLOAD_ALREADY_USED	Upload already used for a transcription
400	UPLOAD_EXPIRED	Upload URL expired (1 hour TTL)
401	INVALID_API_KEY	Invalid or missing API key
401	API_KEY_REVOKED	API key has been revoked
402	INSUFFICIENT_BALANCE	Not enough balance for this transcription
402	ZERO_BALANCE	Balance is $0 — deposit required
404	UPLOAD_NOT_FOUND	Upload ID does not exist
404	TRANSCRIPTION_NOT_FOUND	Transcription ID does not exist
404	NOT_FOUND	Path doesn't match any endpoint
405	METHOD_NOT_ALLOWED	The path exists but only accepts a different HTTP method. The response includes an `Allow` header naming the accepted method (e.g. `Allow: GET` if you POST to `/v1/balance`).
415	UNSUPPORTED_MEDIA_TYPE	Request body sent without `Content-Type: application/json` on a POST/PUT/PATCH. The response includes an `Accept-Post: application/json` header. Parameters are allowed (`application/json; charset=utf-8` works).
409	CONFLICT	Action not allowed in the current state (e.g. delete on an in-flight job, cancel on a completed one)
409	IDEMPOTENCY_KEY_LOCKED	Another request with the same `Idempotency-Key` is mid-flight. Retry after a few seconds (`Retry-After` header sent).
422	IDEMPOTENCY_KEY_CONFLICT	`Idempotency-Key` reused with a different request body — see Idempotency
429	RATE_LIMIT_EXCEEDED	Too many requests — see Rate Limits
500	INTERNAL_ERROR	Server error

Asynchronous errors — surface only on `GET /v1/transcribe/{id}`

Code	When it appears
URL_NOT_ACCESSIBLE	URL flow only — the URL returned 4xx/5xx, didn't resolve, or the connection was refused.
DOWNLOAD_FAILED	URL flow only — the download started but was interrupted.
INVALID_MEDIA_FORMAT	The file downloaded but `ffprobe` rejected it: not actually audio/video, no audio track, or shorter than the 1-second minimum (message names the measured duration, e.g. `"Audio too short (0.10s). Minimum duration is 1 second."`).
DURATION_TOO_LONG	Audio exceeds the 10-hour limit (only known after probing duration).
PROCESSING_ERROR	The GPU job failed after all retries.
CREATED_TIMEOUT	Job sat in `created` for more than 30 minutes — validation step never started.
DOWNLOAD_TIMEOUT	Job sat in `downloading` for more than 15 minutes — file download stalled.
PROCESSING_TIMEOUT	Job sat in `processing` for more than 45 minutes — GPU never returned a result.
BILLING_ERROR	Internal accounting issue while finalizing the charge — your balance was not debited and the transcript was not delivered. Safe to retry.
INTERNAL_ERROR	A server-side step (e.g. queueing the job to our processing layer) failed after we accepted the request. Safe to retry.
CANCELLED	The transcription was cancelled by the customer via `POST /v1/transcribe/{id}/cancel`. The reserved balance was released; you are not charged.

Failed transcriptions are free. The reserved balance is released back to your account, so a failure costs $0 regardless of how far the job got.

Important notes

Language parameter behavior

Auto-detect picks a single dominant language for the whole file and is not perfect:

Code-switched audio (e.g. English/Spanish in the same clip) typically gets transcribed as the dominant language, and segments in the other language may be dropped or mistranscribed.
Hindi audio is sometimes routed to Urdu by the detector. If you know the language in advance, pass it explicitly.
Short clips (under 30s) give the detector less signal and are more likely to mis-route.
Music or background noise during the first few seconds can throw the detector off.

Silence and very short clips

Size, duration, and retention limits

Max file size: 5 GB. Anything larger is rejected with 400 FILE_TOO_LARGE. Very large uploads can also be rejected by the network layer with an HTML 413 page before they reach the API — your client should handle non-JSON error bodies gracefully.
Max duration: 10 hours. Files probing longer than this fail with DURATION_TOO_LONG (asynchronous; you'll see it on the GET endpoint after probing completes).
Min duration: 1 second. Files shorter than 1 second fail asynchronously with INVALID_MEDIA_FORMAT and a message naming the actual measured duration. (Sub-second clips have too little speech signal to transcribe reliably.)
Request body Content-Type: every POST / PUT / PATCH that carries a JSON body must send Content-Type: application/json (parameters allowed, e.g. application/json; charset=utf-8). Anything else is rejected at the gateway with 415 UNSUPPORTED_MEDIA_TYPE and an Accept-Post: application/json header.
Presigned URL TTL: 1 hour. Uploads must PUT to the URL within expires_in seconds of receiving it; after that the URL returns 400 UPLOAD_EXPIRED.
Audio retention: uploaded source audio is retained for 24 hours after the job ends, then deleted. Transcripts themselves remain available via GET /v1/transcribe/{id}.

Filename rules

The filename you pass to POST /v1/upload must satisfy all of the following:

ASCII only (rename files with accents, CJK, or emoji before upload — Supabase Storage rejects non-ASCII object keys).
No path separators — / and \ are rejected.
No `< > " ' `` — rejected to prevent injection into dashboards that render filenames unescaped.
Must include a supported extension (e.g. .mp3, .wav, .mp4). See Supported Formats for the full list.
255 characters or fewer.

Violations return 400 INVALID_FILENAME (or 400 INVALID_MEDIA_FORMAT if the extension is present but unsupported).

`audio_duration_seconds` is an integer

`speaker_count` is a soft prior, not a hard ceiling

Per-word leading whitespace

Word objects in align: true output have whitespace pre-stripped (e.g. {"word":"The"}, not {"word":" The"}). Reconstruct sentence text from utterance.text if you need exact spacing.

`diarize` forces `align`

Idempotency

bash

curl -X POST https://api.scriptivox.com/v1/transcribe \
  -H "Authorization: sk_live_YOUR_KEY" \
  -H "Idempotency-Key: 7c8f5b3a-1234-4d56-90ab-cdef01234567" \
  -H "Content-Type: application/json" \
  -d '{"url": "https://example.com/audio.mp3"}'

Rules:

Key is opaque — we don't parse it. Generate one per logical operation (a UUID is conventional). Reuse the same key when retrying after a failure.
1–255 printable ASCII characters.
Same key + same body within 24h → cached response is replayed. The replayed response carries a Idempotent-Replay: true header so you can tell it was cached.
Same key + different body → 422 IDEMPOTENCY_KEY_CONFLICT. This is a bug in your retry logic — use a fresh key for a different operation.
Same key while a previous request is still in progress → 409 IDEMPOTENCY_KEY_LOCKED with a Retry-After header. Sleep briefly and retry; the cached response will be available once the first request finishes.
No header sent → endpoint behaves normally (no caching).

The replay cache covers 200 success responses only. 4xx and 5xx errors are not cached, so you can retry freely.

Unknown fields are rejected

Response shape changes with `diarize` and `align`

The GET /v1/transcribe/{id} example above shows the response with diarize: true and align: true (the maximal case). When you turn flags off, some result fields become null or empty:

Field	`diarize: false`	`align: false`
`result.speakers`	`null`	unchanged
`result.utterances[].speaker`	`null`	unchanged
`result.utterances[].words`	unchanged	`[]` (empty array)
`result.utterances[].words[].speaker`	key absent	n/a (no words)

Speaker labels are "SPEAKER 1", "SPEAKER 2", … (space, 1-indexed) — not "SPEAKER_00". source_url is only present on URL-flow transcriptions; upload-flow jobs omit it.

Rate Limits

Limits are enforced per API key per endpoint, plus a per-IP cap across all endpoints. Exceeding any limit returns 429 RATE_LIMIT_EXCEEDED with a Retry-After header.

Scope	Limit	Notes
`POST /v1/upload` (per key)	60/min	Presigned URL generation
`POST /v1/transcribe` (per key)	60/min	Job submission
`GET /v1/transcribe/{id}` (per key)	200/min	Higher limit for polling
`GET /v1/transcriptions` (per key)	60/min	List your jobs
`POST /v1/transcribe/{id}/cancel` (per key)	30/min	Cancel in-flight
`DELETE /v1/transcribe/{id}` (per key)	30/min	Soft-delete
`GET /v1/balance` (per key)	100/min	Balance checks
Per source IP (across all endpoints)	300/min	Edge-level cap to prevent abuse

Rate limit headers

Every response includes:

Header	Description
`X-RateLimit-Limit`	Maximum requests allowed per minute for this endpoint.
`X-RateLimit-Remaining`	Requests remaining in the current rolling window.
`X-RateLimit-Reset`	Unix timestamp when the window fully resets.
`Retry-After`	Seconds to wait before retrying. Only present on `429` responses.

Supported Formats

25 container/codec combinations are accepted (10 audio + 15 video). Maximum file size is 5 GB; maximum duration is 10 hours. Unrecognized extensions return 400 INVALID_MEDIA_FORMAT.

Audio (10)

Extension	Format
.mp3	MPEG Audio
.wav	Waveform Audio
.m4a	MPEG-4 Audio
.aac	Advanced Audio Coding
.ogg	Ogg Vorbis
.flac	Free Lossless Audio
.opus	Opus
.wma	Windows Media Audio
.aiff	Audio Interchange
.caf	Core Audio Format

Video (15)

Extension	Format
.mp4	MPEG-4 Video
.mov	QuickTime
.avi	Audio Video Interleave
.mkv	Matroska Video
.webm	WebM
.wmv	Windows Media Video
.flv	Flash Video
.m4v	MPEG-4 Video (iTunes)
.3gp	3GPP
.mpeg	MPEG Video
.mts	AVCHD
.ogv	Ogg Video
.ts	MPEG Transport Stream
.vob	DVD Video Object
.f4v	Flash MP4 Video

Supported Languages

119 languages are supported. Pass the ISO 639-1 (or BCP-47 fallback, e.g. yue, kea) language code below in the language parameter.

Invalid codes return 400 INVALID_REQUEST.

Language	Code
Afrikaans	af
Albanian	sq
Amharic	am
Arabic	ar
Armenian	hy
Assamese	as
Asturian	ast
Azerbaijani	az
Bashkir	ba
Basque	eu
Belarusian	be
Bengali	bn
Bosnian	bs
Breton	br
Bulgarian	bg
Cantonese	yue
Cape Verdean Creole	kea
Catalan	ca
Cebuano	ceb
Chichewa	ny
Chinese	zh
Croatian	hr
Czech	cs
Danish	da
Dutch	nl
English	en
Estonian	et
Faroese	fo
Finnish	fi
French	fr
Fula	ff
Galician	gl
Georgian	ka
German	de
Greek	el
Gujarati	gu
Haitian Creole	ht
Hausa	ha
Hawaiian	haw
Hebrew	he
Hindi	hi
Hungarian	hu
Icelandic	is
Igbo	ig
Indonesian	id
Irish	ga
Italian	it
Japanese	ja
Javanese	jw
Kamba	kam
Kannada	kn
Kazakh	kk
Khmer	km
Korean	ko
Kyrgyz	ky
Lao	lo
Latin	la
Latvian	lv
Lingala	ln
Lithuanian	lt
Luganda	lg
Luo	luo
Luxembourgish	lb
Macedonian	mk
Malagasy	mg
Malay	ms
Malayalam	ml
Maltese	mt
Maori	mi
Marathi	mr
Mongolian	mn
Myanmar	my
Nepali	ne
Northern Sotho	nso
Norwegian	no
Nynorsk	nn
Occitan	oc
Odia	or
Oromo	om
Pashto	ps
Persian	fa
Polish	pl
Portuguese	pt
Punjabi	pa
Romanian	ro
Russian	ru
Sanskrit	sa
Serbian	sr
Shona	sn
Sindhi	sd
Sinhala	si
Slovak	sk
Slovenian	sl
Somali	so
Sorani Kurdish	ckb
Spanish	es
Sundanese	su
Swahili	sw
Swedish	sv
Tagalog	tl
Tajik	tg
Tamil	ta
Tatar	tt
Telugu	te
Thai	th
Tibetan	bo
Turkish	tr
Turkmen	tk
Ukrainian	uk
Umbundu	umb
Urdu	ur
Uzbek	uz
Vietnamese	vi
Welsh	cy
Wolof	wo
Xhosa	xh
Yiddish	yi
Yoruba	yo
Zulu	zu

Webhooks

Real-time completion notifications

Pricing

Pay-as-you-go at $0.20/hour

API Reference

Transcribe

From a URL

Request Body

Response

Code Examples

From a file upload

Request Body

Response

Code Examples

Request Body

Response

Code Examples

Get result

Request Body

Response

Code Examples

Status values

Caption / text export

Segmentation controls

List transcriptions

Request Body

Response

Code Examples

Cancel transcription

Request Body

Response

Code Examples

Delete transcription

Request Body

Response

Code Examples

Balance

Response

Code Examples

Error Codes

Synchronous errors — returned immediately on the request

Asynchronous errors — surface only on GET /v1/transcribe/{id}

Important notes

Language parameter behavior

Silence and very short clips

Size, duration, and retention limits

Filename rules

audio_duration_seconds is an integer

speaker_count is a soft prior, not a hard ceiling

Per-word leading whitespace

diarize forces align

Idempotency

Unknown fields are rejected

Response shape changes with diarize and align

Rate Limits

Rate limit headers

Supported Formats

Audio (10)

Video (15)

Supported Languages

API Reference

Transcribe

From a URL

Request Body

Response

Code Examples

From a file upload

Request Body

Response

Code Examples

Request Body

Response

Code Examples

Get result

Request Body

Response

Code Examples

Status values

Caption / text export

Segmentation controls

List transcriptions

Request Body

Response

Code Examples

Asynchronous errors — surface only on `GET /v1/transcribe/{id}`

`audio_duration_seconds` is an integer

`speaker_count` is a soft prior, not a hard ceiling

`diarize` forces `align`

Response shape changes with `diarize` and `align`

Asynchronous errors — surface only on `GET /v1/transcribe/{id}`

`audio_duration_seconds` is an integer

`speaker_count` is a soft prior, not a hard ceiling

`diarize` forces `align`

Response shape changes with `diarize` and `align`